---
title: Multi-modal
description: Encode images and documents alongside text.
canonical_url: https://superlinked.com/docs/encode/multimodal
last_updated: 2026-05-18
---

Multimodal models encode images and text into a shared vector space. Search images with text queries, or find similar images directly. SIE supports CLIP, SigLIP, and document models like ColPali.

## Quick Example

Source: [packages/sie_server/src/sie_server/adapters/clip/__init__.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/adapters/clip/__init__.py)

#### Python

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

# Encode an image
with open("photo.jpg", "rb") as f:
    image_bytes = f.read()

result = client.encode(
    "openai/clip-vit-base-patch32",
    Item(images=[{"data": image_bytes, "format": "jpeg"}])
)

print(f"Dense vector: {len(result['dense'])} dims")  # 512
```

#### TypeScript

```typescript
import { SIEClient } from "@superlinked/sie-sdk";
import { readFileSync } from "fs";

const client = new SIEClient("http://localhost:8080");

// Encode an image
const imageBytes = readFileSync("photo.jpg");

const result = await client.encode(
  "openai/clip-vit-base-patch32",
  { images: [imageBytes] }
);

console.log(`Dense vector: ${result.dense?.length} dims`);  // 512

await client.close();
```

## Image Input Formats

Source: [packages/sie_sdk/src/sie_sdk/images.py](https://github.com/superlinked/sie/blob/main/packages/sie_sdk/src/sie_sdk/images.py)

The SDK accepts images in multiple formats:

#### Python

```python
# From bytes with format hint
result = client.encode(
    "openai/clip-vit-base-patch32",
    Item(images=[{"data": image_bytes, "format": "jpeg"}])
)

# From file (read bytes)
with open("image.png", "rb") as f:
    result = client.encode(
        "openai/clip-vit-base-patch32",
        Item(images=[{"data": f.read(), "format": "png"}])
    )

# Multiple images (averaged)
result = client.encode(
    "openai/clip-vit-base-patch32",
    Item(images=[
        {"data": img1_bytes, "format": "jpeg"},
        {"data": img2_bytes, "format": "jpeg"},
    ])
)
```

#### TypeScript

```typescript
import { readFileSync } from "fs";

// From file (read bytes)
const imageBytes = readFileSync("photo.jpg");
const result = await client.encode(
  "openai/clip-vit-base-patch32",
  { images: [imageBytes] }
);

// Multiple images (averaged)
const img1 = readFileSync("image1.jpg");
const img2 = readFileSync("image2.jpg");
const multiResult = await client.encode(
  "openai/clip-vit-base-patch32",
  { images: [img1, img2] }
);
```

Supported formats: JPEG, PNG, WebP, BMP, GIF (first frame).

## Text-to-Image Search

CLIP and SigLIP encode text and images into the same vector space:

#### Python

```python
# Index images
image_embeddings = []
for image_path in image_paths:
    with open(image_path, "rb") as f:
        result = client.encode(
            "openai/clip-vit-base-patch32",
            Item(images=[{"data": f.read(), "format": "jpeg"}])
        )
        image_embeddings.append(result["dense"])

# Store in vector database
for i, embedding in enumerate(image_embeddings):
    vector_db.insert(id=f"img-{i}", vector=embedding)

# Search with text query
query_result = client.encode(
    "openai/clip-vit-base-patch32",
    Item(text="a cat sitting on a couch")
)

# Find similar images
results = vector_db.search(query_result["dense"], top_k=10)
```

#### TypeScript

```typescript
import { readFileSync } from "fs";

// Index images
const imageEmbeddings: Float32Array[] = [];
for (const imagePath of imagePaths) {
  const imageBytes = readFileSync(imagePath);
  const result = await client.encode(
    "openai/clip-vit-base-patch32",
    { images: [imageBytes] }
  );
  if (result.dense) {
    imageEmbeddings.push(result.dense);
  }
}

// Store in vector database
for (let i = 0; i < imageEmbeddings.length; i++) {
  await vectorDb.insert({ id: `img-${i}`, vector: imageEmbeddings[i] });
}

// Search with text query
const queryResult = await client.encode(
  "openai/clip-vit-base-patch32",
  { text: "a cat sitting on a couch" }
);

// Find similar images
const results = await vectorDb.search(queryResult.dense!, 10);
```

## Image-to-Image Search

Search for visually similar images:

#### Python

```python
# Encode reference image
with open("reference.jpg", "rb") as f:
    ref_result = client.encode(
        "openai/clip-vit-base-patch32",
        Item(images=[{"data": f.read(), "format": "jpeg"}])
    )

# Find similar images in your database
similar = vector_db.search(ref_result["dense"], top_k=10)
```

#### TypeScript

```typescript
import { readFileSync } from "fs";

// Encode reference image
const refImage = readFileSync("reference.jpg");
const refResult = await client.encode(
  "openai/clip-vit-base-patch32",
  { images: [refImage] }
);

// Find similar images in your database
const similar = await vectorDb.search(refResult.dense!, 10);
```

## SigLIP Models

Source: [packages/sie_server/src/sie_server/adapters/siglip/__init__.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/adapters/siglip/__init__.py)

SigLIP often outperforms CLIP on image-text matching:

#### Python

```python
result = client.encode(
    "google/siglip-so400m-patch14-384",
    Item(images=[{"data": image_bytes, "format": "jpeg"}])
)

print(f"Dense vector: {len(result['dense'])} dims")  # 1152
```

#### TypeScript

```typescript
const result = await client.encode(
  "google/siglip-so400m-patch14-384",
  { images: [imageBytes] }
);

console.log(`Dense vector: ${result.dense?.length} dims`);  // 1152
```

SigLIP uses sigmoid loss (vs contrastive), which can improve fine-grained matching.

## Document Search with ColPali

Source: [packages/sie_server/src/sie_server/adapters/colpali/__init__.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/adapters/colpali/__init__.py)

ColPali encodes document page images directly-no OCR needed. The model "sees" layout, tables, and figures:

#### Python

```python
# Encode a PDF page as image
result = client.encode(
    "vidore/colpali-v1.3-hf",
    Item(images=[{"data": page_image_bytes, "format": "png"}]),
    output_types=["multivector"]
)

# ColPali returns multi-vector (per-patch) embeddings
print(f"Patches: {result['multivector'].shape[0]}")
```

#### TypeScript

```typescript
// Encode a PDF page as image
const result = await client.encode(
  "vidore/colpali-v1.3-hf",
  { images: [pageImageBytes] },
  { outputTypes: ["multivector"] }
);

// ColPali returns multi-vector (per-patch) embeddings
console.log(`Patches: ${result.multivector?.length}`);
```

ColPali is ColBERT-style: multi-vector output, MaxSim scoring.

## Vision Models

| Model | Dimensions | Resolution | Notes |
|-------|------------|------------|-------|
| `openai/clip-vit-base-patch32` | 512 | 224 | Fast, general |
| `openai/clip-vit-large-patch14` | 768 | 224 | Higher quality |
| `google/siglip-so400m-patch14-384` | 1152 | 384 | Best quality |
| `laion/CLIP-ViT-H-14-laion2B-s32B-b79K` | 1024 | 224 | Large-scale trained |
| `vidore/colpali-v1.3-hf` | 128 (multi) | 448 | Document pages |

## HTTP API

Source: [packages/sie_server/src/sie_server/api/encode.py](https://github.com/superlinked/sie/blob/main/packages/sie_server/src/sie_server/api/encode.py)

Images are base64-encoded in HTTP requests. The server defaults to msgpack. For JSON:

```bash
curl -X POST http://localhost:8080/v1/encode/openai/clip-vit-base-patch32 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{"items": [{"images": [{"data": "'$(base64 -w0 photo.jpg)'", "format": "jpeg"}]}]}'
```

## What's Next

- [Multi-vector embeddings](/docs/encode/multivector/) - ColPali uses multivector output
- [Dense embeddings](/docs/encode/) - text-only encoding
