---
title: Quantization
description: Reduce vector storage with int8, uint8, and binary quantization.
canonical_url: https://superlinked.com/docs/encode/quantization
last_updated: 2026-05-18
---

Quantization reduces vector storage and bandwidth. A 1024-dim float32 vector (4KB) becomes 1KB with int8 or 128 bytes with binary. Quality loss is typically 1-3% for int8, more for binary.

## Quick Example

Source: [packages/sie_server/src/sie_server/core/postprocessor.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/core/postprocessor.py)

#### Python

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

# Int8 quantization
result = client.encode(
    "BAAI/bge-m3",
    Item(text="text to encode"),
    output_dtype="int8"
)

# Result is int8 array, 4x smaller than float32
print(f"Dtype: {result['dense'].dtype}")  # int8
print(f"Range: [{result['dense'].min()}, {result['dense'].max()}]")  # [-127, 127]
```

#### TypeScript

```typescript
import { SIEClient } from "@superlinked/sie-sdk";

const client = new SIEClient("http://localhost:8080");

// Int8 quantization
const result = await client.encode(
  "BAAI/bge-m3",
  { text: "text to encode" },
  { outputDtype: "int8" }
);

// Result is still Float32Array but contains quantized values
// Server handles quantization, client receives appropriate format
console.log(`Dimensions: ${result.dense?.length}`);

await client.close();
```

## Quantization Types

| Type | Size Reduction | Quality Loss | Best For |
|------|----------------|--------------|----------|
| `float32` | 1x (baseline) | 0% | Quality-critical |
| `float16` | 2x | ~0% | Balance |
| `int8` | 4x | 1-2% | General storage |
| `uint8` | 4x | 1-2% | Qdrant compatibility |
| `binary` | 32x | 5-10% | Massive scale |

## Int8 Quantization

Symmetric per-vector quantization mapping values to [-127, 127]:

#### Python

```python
result = client.encode(
    "BAAI/bge-m3",
    Item(text="hello world"),
    output_dtype="int8"
)

# Each vector is independently scaled
# value_int8 = round(value_float32 / max_abs * 127)
```

#### TypeScript

```typescript
const result = await client.encode(
  "BAAI/bge-m3",
  { text: "hello world" },
  { outputDtype: "int8" }
);

// Each vector is independently scaled
// value_int8 = round(value_float32 / max_abs * 127)
```

Use with vector databases that support int8:
- Qdrant (scalar quantization)
- Milvus (int8 index)
- Pinecone (using product quantization)

## Uint8 Quantization

Linear mapping to [0, 255] range:

#### Python

```python
result = client.encode(
    "BAAI/bge-m3",
    Item(text="hello world"),
    output_dtype="uint8"
)

# Maps [min, max] → [0, 255] per vector
```

#### TypeScript

```typescript
const result = await client.encode(
  "BAAI/bge-m3",
  { text: "hello world" },
  { outputDtype: "uint8" }
);

// Maps [min, max] → [0, 255] per vector
```

Qdrant's scalar quantization uses uint8 format.

## Binary Quantization

Bit-packed to 32x smaller:

#### Python

```python
result = client.encode(
    "BAAI/bge-m3",
    Item(text="hello world"),
    output_dtype="binary"
)

# 1024-dim float32 (4KB) → 128 bytes
# Each dimension becomes 1 bit: positive → 1, negative → 0
print(f"Shape: {result['dense'].shape}")  # (128,) uint8
```

#### TypeScript

```typescript
const result = await client.encode(
  "BAAI/bge-m3",
  { text: "hello world" },
  { outputDtype: "binary" }
);

// 1024-dim float32 (4KB) → 128 bytes
// Each dimension becomes 1 bit: positive → 1, negative → 0
console.log(`Shape: ${result.dense?.length}`);  // 128
```

Binary uses Hamming distance instead of cosine:

```python
# Hamming distance = XOR + popcount
hamming = np.sum(np.bitwise_xor(a_binary, b_binary).astype(np.uint8))
```

Binary is useful for:
- First-stage candidate filtering
- Memory-constrained environments
- Re-ranking with full-precision vectors

## Float16 Precision

Half precision with minimal quality loss:

#### Python

```python
result = client.encode(
    "BAAI/bge-m3",
    Item(text="hello world"),
    output_dtype="float16"
)

print(f"Dtype: {result['dense'].dtype}")  # float16
```

#### TypeScript

```typescript
const result = await client.encode(
  "BAAI/bge-m3",
  { text: "hello world" },
  { outputDtype: "float16" }
);

// Note: JavaScript doesn't have native float16, so values may be returned as float32
console.log(`Dimensions: ${result.dense?.length}`);
```

Float16 is effectively lossless for vector search in practice. Use it when your database supports it.

## Quality Impact

Approximate NDCG retention on standard benchmarks:

| Quantization | NDCG@10 Retention |
|--------------|-------------------|
| float32 | 100% (baseline) |
| float16 | ~99.9% |
| int8 | ~98-99% |
| uint8 | ~98-99% |
| binary | ~90-95% |

Actual impact varies by model and task. Run evals on your data.

## Two-Stage Pattern

Use binary for fast candidate retrieval, full precision for reranking:

#### Python

```python
# Stage 1: Binary search over millions
binary_result = client.encode(model, query, output_dtype="binary")
candidates = binary_index.search(binary_result["dense"], top_k=1000)

# Stage 2: Full precision rerank of top candidates
full_result = client.encode(model, query)  # float32
reranked = rerank_with_full_precision(full_result["dense"], candidates)
```

#### TypeScript

```typescript
// Stage 1: Binary search over millions
const binaryResult = await client.encode(model, query, { outputDtype: "binary" });
const candidates = await binaryIndex.search(binaryResult.dense!, 1000);

// Stage 2: Full precision rerank of top candidates
const fullResult = await client.encode(model, query);  // float32
const reranked = rerankWithFullPrecision(fullResult.dense!, candidates);
```

## Sparse Vector Quantization

Sparse vectors are NOT quantized-only dense and multivector:

#### Python

```python
result = client.encode(
    "BAAI/bge-m3",
    Item(text="hello"),
    output_types=["dense", "sparse"],
    output_dtype="int8"
)

# Dense is int8
print(result["dense"].dtype)  # int8

# Sparse stays float32 (indices + values don't benefit from quantization)
print(result["sparse"]["values"].dtype)  # float32
```

#### TypeScript

```typescript
const result = await client.encode(
  "BAAI/bge-m3",
  { text: "hello" },
  { outputTypes: ["dense", "sparse"], outputDtype: "int8" }
);

// Dense is quantized
console.log(`Dense length: ${result.dense?.length}`);

// Sparse stays float32 (indices + values don't benefit from quantization)
console.log(`Sparse values: Float32Array`);
```

## HTTP API

Source: [packages/sie_server/src/sie_server/api/encode.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/api/encode.py)

The server defaults to msgpack for efficient binary transport. For JSON responses:

```bash
curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "items": [{"text": "quantized text"}],
    "params": {"output_dtype": "int8"}
  }'
```

Response includes int8 values:

```json
{
  "model": "BAAI/bge-m3",
  "items": [
    {
      "dense": {"dims": 1024, "dtype": "int8", "values": [23, -89, 12, ...]}
    }
  ]
}
```

Note: JSON represents int8 as integers. For msgpack, values are packed as int8.

## What's Next

- [Model Catalog](/models#task=encode) - all supported models
