Modal → SIE
Modal is a serverless container platform. Many
teams use it as the quickest way to put sentence-transformers on a
GPU and call it over the network. SIE is a purpose-built inference
engine that you self-host on Kubernetes (or run locally for dev).
Why migrate
Section titled “Why migrate”- No per-call cold starts. Modal scales to zero by default; the first call after idle pays container boot + model load. SIE is a long-lived pod with the model already loaded.
- Multi-model on one GPU. Modal isolates per
@app.function. If you serve N models, you potentially pay for N cold-starting containers. SIE shares one GPU’s memory across models with LRU eviction. - Flat cost above your break-even. Modal bills per-second of container uptime plus per-CPU plus per-GPU; SIE on a self-managed GPU instance bills hourly whether you use it or not. SIE wins for sustained workloads where the GPU stays busy enough that flat hourly beats per-second-with-overhead. Modal wins for spiky or low-duty-cycle workloads where scale-to-zero dominates. There’s no universally right answer; measure your duty cycle.
- Data residency. Modal runs in their accounts. SIE runs in yours.
- Dedicated APIs. Modal gives you raw RPC; you reinvent every embedding / scoring / extraction endpoint. SIE ships typed APIs out of the box.
What this migration costs you
Section titled “What this migration costs you”Modal’s headline benefit is “no infra”. Be honest with yourself: moving to SIE means you (or your platform team) now own:
- A Kubernetes cluster with GPU nodes (EKS / GKE / on-prem).
- Autoscaling config (the included KEDA values are a starting point, not a finished SLO).
- An HF weights mirror or PVC, plus image-pull credentials.
- On-call for the inference pods.
If your team doesn’t already operate K8s, the operational tax can exceed the cost savings until you’re well past break-even. The migration makes sense when (a) you already run K8s for other services, or (b) sustained inference volume is high enough that the savings fund a small platform investment.
What this migration is not
Section titled “What this migration is not”A 1:1 framework swap. Modal is hosting; SIE is the engine. You’re typically replacing:
- A
@app.cls(gpu="T4")wrappingSentenceTransformer.encode(...) - …possibly behind a
@modal.web_endpoint(...) - …sometimes with a
modal.Volumecache for HF weights
with one SIE deployment (Helm chart on EKS / GKE) plus the SIE Python SDK.
If your current Modal app already runs TEI on Modal, follow TEI → SIE for the engine swap and treat the Modal piece as a pure hosting migration.
Before
Section titled “Before”import modal
app = modal.App("embeddings")image = modal.Image.debian_slim().pip_install("sentence-transformers")
@app.cls(image=image, gpu="T4")class Embedder: @modal.enter() def _load(self): from sentence_transformers import SentenceTransformer self._model = SentenceTransformer("BAAI/bge-small-en-v1.5")
@modal.method() def embed(self, texts: list[str]) -> list[list[float]]: return self._model.encode(texts, normalize_embeddings=True).tolist()modal deploy embeddings.py# Local devmise run serve -- -m sentence-transformers/all-MiniLM-L6-v2
# Production: install the published Helm charthelm install sie superlinked/sie -f values.yamlfrom sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://sie.your-cluster.internal:8080")result = client.encode( "BAAI/bge-small-en-v1.5", [Item(text=t) for t in texts],)Mapping
Section titled “Mapping”| Modal pattern | SIE equivalent |
|---|---|
@app.cls(gpu="T4") + SentenceTransformer | SIE bundle config; mise run serve locally; Helm in prod |
@modal.enter() to load weights | First request triggers lazy load; warm via Helm values |
@modal.web_endpoint(method="POST") | SIE /v1/embeddings and /encode |
modal.Volume.from_name("hf-cache", ...) | Helm chart’s weightsPVC + HF mirror |
modal deploy | helm install sie superlinked/sie -f values.yaml |
Multiple @app.functions for different models | One bundle config, one cluster |
| Modal secrets | Kubernetes secrets / sealed-secrets / Vault |
Re-embed required?
Section titled “Re-embed required?”No. Same checkpoint, same vector space.
Run it yourself
Section titled “Run it yourself”# SIE leg.mise run serve -- -m sentence-transformers/all-MiniLM-L6-v2
# Modal leg: save the 'before' snippet from this page as# embeddings.py, then deploy it once:modal deploy embeddings.pyCall both, embed the same fixed corpus, compare. On the same checkpoint, expect cosine at or above 0.999.