Modal → SIE

Modal is a serverless container platform. Many teams use it as the quickest way to put sentence-transformers on a GPU and call it over the network. SIE is a purpose-built inference engine that you self-host on Kubernetes (or run locally for dev).

Why migrate

No per-call cold starts. Modal scales to zero by default; the first call after idle pays container boot + model load. SIE is a long-lived pod with the model already loaded.
Multi-model on one GPU. Modal isolates per @app.function. If you serve N models, you potentially pay for N cold-starting containers. SIE shares one GPU’s memory across models with LRU eviction.
Flat cost above your break-even. Modal bills per-second of container uptime plus per-CPU plus per-GPU; SIE on a self-managed GPU instance bills hourly whether you use it or not. SIE wins for sustained workloads where the GPU stays busy enough that flat hourly beats per-second-with-overhead. Modal wins for spiky or low-duty-cycle workloads where scale-to-zero dominates. There’s no universally right answer; measure your duty cycle.
Data residency. Modal runs in their accounts. SIE runs in yours.
Dedicated APIs. Modal gives you raw RPC; you reinvent every embedding / scoring / extraction endpoint. SIE ships typed APIs out of the box.

What this migration costs you

Modal’s headline benefit is “no infra”. Be honest with yourself: moving to SIE means you (or your platform team) now own:

A Kubernetes cluster with GPU nodes (EKS / GKE / on-prem).
Autoscaling config (the included KEDA values are a starting point, not a finished SLO).
An HF weights mirror or PVC, plus image-pull credentials.
On-call for the inference pods.

If your team doesn’t already operate K8s, the operational tax can exceed the cost savings until you’re well past break-even. The migration makes sense when (a) you already run K8s for other services, or (b) sustained inference volume is high enough that the savings fund a small platform investment.

What this migration is not

A 1:1 framework swap. Modal is hosting; SIE is the engine. You’re typically replacing:

A @app.cls(gpu="T4") wrapping SentenceTransformer.encode(...)
…possibly behind a @modal.web_endpoint(...)
…sometimes with a modal.Volume cache for HF weights

with one SIE deployment (Helm chart on EKS / GKE) plus the SIE Python SDK.

If your current Modal app already runs TEI on Modal, follow TEI → SIE for the engine swap and treat the Modal piece as a pure hosting migration.

Before

import modal

app = modal.App("embeddings")
image = modal.Image.debian_slim().pip_install("sentence-transformers")

@app.cls(image=image, gpu="T4")
class Embedder:
    @modal.enter()
    def _load(self):
        from sentence_transformers import SentenceTransformer
        self._model = SentenceTransformer("BAAI/bge-small-en-v1.5")

    @modal.method()
    def embed(self, texts: list[str]) -> list[list[float]]:
        return self._model.encode(texts, normalize_embeddings=True).tolist()

modal deploy embeddings.py

After

# Local dev
mise run serve -- -m sentence-transformers/all-MiniLM-L6-v2

# Production: install the published Helm chart
helm install sie superlinked/sie -f values.yaml

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://sie.your-cluster.internal:8080")
result = client.encode(
    "BAAI/bge-small-en-v1.5",
    [Item(text=t) for t in texts],
)

Mapping

Modal pattern	SIE equivalent
`@app.cls(gpu="T4")` + `SentenceTransformer`	SIE bundle config; `mise run serve` locally; Helm in prod
`@modal.enter()` to load weights	First request triggers lazy load; warm via Helm values
`@modal.web_endpoint(method="POST")`	SIE `/v1/embeddings` and `/encode`
`modal.Volume.from_name("hf-cache", ...)`	Helm chart’s `weightsPVC` + HF mirror
`modal deploy`	`helm install sie superlinked/sie -f values.yaml`
Multiple `@app.function`s for different models	One bundle config, one cluster
Modal secrets	Kubernetes secrets / sealed-secrets / Vault

Re-embed required?

No. Same checkpoint, same vector space.

Run it yourself

# SIE leg.
mise run serve -- -m sentence-transformers/all-MiniLM-L6-v2

# Modal leg: save the 'before' snippet from this page as
# embeddings.py, then deploy it once:
modal deploy embeddings.py

Call both, embed the same fixed corpus, compare. On the same checkpoint, expect cosine at or above 0.999.