---
title: Modal → SIE
description: Consolidate sentence-transformers-on-Modal functions into one SIE deployment. Flat cost, no cold starts, multi-model on one GPU.
canonical_url: https://superlinked.com/docs/migrate/modal
last_updated: 2026-05-07
---

[Modal](https://modal.com) is a serverless container platform. Many
teams use it as the quickest way to put `sentence-transformers` on a
GPU and call it over the network. SIE is a purpose-built inference
engine that you self-host on Kubernetes (or run locally for dev).

## Why migrate

- **No per-call cold starts.** Modal scales to zero by default; the
  first call after idle pays container boot + model load. SIE is a
  long-lived pod with the model already loaded.
- **Multi-model on one GPU.** Modal isolates per `@app.function`. If you
  serve N models, you potentially pay for N cold-starting containers.
  SIE shares one GPU's memory across models with LRU eviction.
- **Flat cost above your break-even.** Modal bills per-second of
  container uptime plus per-CPU plus per-GPU; SIE on a self-managed
  GPU instance bills hourly whether you use it or not. **SIE wins for
  sustained workloads where the GPU stays busy enough that flat hourly
  beats per-second-with-overhead.** Modal wins for spiky or
  low-duty-cycle workloads where scale-to-zero dominates. There's no
  universally right answer; measure your duty cycle.
- **Data residency.** Modal runs in their accounts. SIE runs in yours.
- **Dedicated APIs.** Modal gives you raw RPC; you reinvent every
  embedding / scoring / extraction endpoint. SIE ships typed APIs out
  of the box.

### What this migration costs you

Modal's headline benefit is "no infra". Be honest with yourself:
moving to SIE means you (or your platform team) now own:

- A Kubernetes cluster with GPU nodes (EKS / GKE / on-prem).
- Autoscaling config (the included KEDA values are a starting point,
  not a finished SLO).
- An HF weights mirror or PVC, plus image-pull credentials.
- On-call for the inference pods.

If your team doesn't already operate K8s, the operational tax can
exceed the cost savings until you're well past break-even. The
migration makes sense when (a) you already run K8s for other
services, or (b) sustained inference volume is high enough that the
savings fund a small platform investment.

## What this migration is not

A 1:1 framework swap. Modal is hosting; SIE is the engine. You're
typically replacing:

- A `@app.cls(gpu="T4")` wrapping `SentenceTransformer.encode(...)`
- ...possibly behind a `@modal.web_endpoint(...)`
- ...sometimes with a `modal.Volume` cache for HF weights

with one SIE deployment (Helm chart on EKS / GKE) plus the SIE Python
SDK.

If your current Modal app already runs TEI on Modal, follow
[TEI → SIE](/docs/migrate/tei/) for the engine swap and treat the Modal
piece as a pure hosting migration.

## Before

```python
import modal

app = modal.App("embeddings")
image = modal.Image.debian_slim().pip_install("sentence-transformers")

@app.cls(image=image, gpu="T4")
class Embedder:
    @modal.enter()
    def _load(self):
        from sentence_transformers import SentenceTransformer
        self._model = SentenceTransformer("BAAI/bge-small-en-v1.5")

    @modal.method()
    def embed(self, texts: list[str]) -> list[list[float]]:
        return self._model.encode(texts, normalize_embeddings=True).tolist()
```

```bash
modal deploy embeddings.py
```

## After

```bash
# Local dev
mise run serve -- -m sentence-transformers/all-MiniLM-L6-v2

# Production: install the published Helm chart
helm install sie superlinked/sie -f values.yaml
```

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://sie.your-cluster.internal:8080")
result = client.encode(
    "BAAI/bge-small-en-v1.5",
    [Item(text=t) for t in texts],
)
```

## Mapping

| Modal pattern                                 | SIE equivalent                                              |
|-----------------------------------------------|-------------------------------------------------------------|
| `@app.cls(gpu="T4")` + `SentenceTransformer`  | SIE bundle config; `mise run serve` locally; Helm in prod   |
| `@modal.enter()` to load weights              | First request triggers lazy load; warm via Helm values      |
| `@modal.web_endpoint(method="POST")`          | SIE `/v1/embeddings` and `/encode`                          |
| `modal.Volume.from_name("hf-cache", ...)`     | Helm chart's `weightsPVC` + HF mirror                       |
| `modal deploy`                                | `helm install sie superlinked/sie -f values.yaml`           |
| Multiple `@app.function`s for different models | One bundle config, one cluster                              |
| Modal secrets                                 | Kubernetes secrets / sealed-secrets / Vault                 |

## Re-embed required?

**No.** Same checkpoint, same vector space.

## Run it yourself

```bash
# SIE leg.
mise run serve -- -m sentence-transformers/all-MiniLM-L6-v2

# Modal leg: save the 'before' snippet from this page as
# embeddings.py, then deploy it once:
modal deploy embeddings.py
```

Call both, embed the same fixed corpus, compare. On the same
checkpoint, expect cosine at or above 0.999.