Skip to content
Why did we open-source our inference engine? Read the post

Modal → SIE

Modal is a serverless container platform. Many teams use it as the quickest way to put sentence-transformers on a GPU and call it over the network. SIE is a purpose-built inference engine that you self-host on Kubernetes (or run locally for dev).

  • No per-call cold starts. Modal scales to zero by default; the first call after idle pays container boot + model load. SIE is a long-lived pod with the model already loaded.
  • Multi-model on one GPU. Modal isolates per @app.function. If you serve N models, you potentially pay for N cold-starting containers. SIE shares one GPU’s memory across models with LRU eviction.
  • Flat cost above your break-even. Modal bills per-second of container uptime plus per-CPU plus per-GPU; SIE on a self-managed GPU instance bills hourly whether you use it or not. SIE wins for sustained workloads where the GPU stays busy enough that flat hourly beats per-second-with-overhead. Modal wins for spiky or low-duty-cycle workloads where scale-to-zero dominates. There’s no universally right answer; measure your duty cycle.
  • Data residency. Modal runs in their accounts. SIE runs in yours.
  • Dedicated APIs. Modal gives you raw RPC; you reinvent every embedding / scoring / extraction endpoint. SIE ships typed APIs out of the box.

Modal’s headline benefit is “no infra”. Be honest with yourself: moving to SIE means you (or your platform team) now own:

  • A Kubernetes cluster with GPU nodes (EKS / GKE / on-prem).
  • Autoscaling config (the included KEDA values are a starting point, not a finished SLO).
  • An HF weights mirror or PVC, plus image-pull credentials.
  • On-call for the inference pods.

If your team doesn’t already operate K8s, the operational tax can exceed the cost savings until you’re well past break-even. The migration makes sense when (a) you already run K8s for other services, or (b) sustained inference volume is high enough that the savings fund a small platform investment.

A 1:1 framework swap. Modal is hosting; SIE is the engine. You’re typically replacing:

  • A @app.cls(gpu="T4") wrapping SentenceTransformer.encode(...)
  • …possibly behind a @modal.web_endpoint(...)
  • …sometimes with a modal.Volume cache for HF weights

with one SIE deployment (Helm chart on EKS / GKE) plus the SIE Python SDK.

If your current Modal app already runs TEI on Modal, follow TEI → SIE for the engine swap and treat the Modal piece as a pure hosting migration.

import modal
app = modal.App("embeddings")
image = modal.Image.debian_slim().pip_install("sentence-transformers")
@app.cls(image=image, gpu="T4")
class Embedder:
@modal.enter()
def _load(self):
from sentence_transformers import SentenceTransformer
self._model = SentenceTransformer("BAAI/bge-small-en-v1.5")
@modal.method()
def embed(self, texts: list[str]) -> list[list[float]]:
return self._model.encode(texts, normalize_embeddings=True).tolist()
Terminal window
modal deploy embeddings.py
Terminal window
# Local dev
mise run serve -- -m sentence-transformers/all-MiniLM-L6-v2
# Production: install the published Helm chart
helm install sie superlinked/sie -f values.yaml
from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://sie.your-cluster.internal:8080")
result = client.encode(
"BAAI/bge-small-en-v1.5",
[Item(text=t) for t in texts],
)
Modal patternSIE equivalent
@app.cls(gpu="T4") + SentenceTransformerSIE bundle config; mise run serve locally; Helm in prod
@modal.enter() to load weightsFirst request triggers lazy load; warm via Helm values
@modal.web_endpoint(method="POST")SIE /v1/embeddings and /encode
modal.Volume.from_name("hf-cache", ...)Helm chart’s weightsPVC + HF mirror
modal deployhelm install sie superlinked/sie -f values.yaml
Multiple @app.functions for different modelsOne bundle config, one cluster
Modal secretsKubernetes secrets / sealed-secrets / Vault

No. Same checkpoint, same vector space.

Terminal window
# SIE leg.
mise run serve -- -m sentence-transformers/all-MiniLM-L6-v2
# Modal leg: save the 'before' snippet from this page as
# embeddings.py, then deploy it once:
modal deploy embeddings.py

Call both, embed the same fixed corpus, compare. On the same checkpoint, expect cosine at or above 0.999.

Contact us

Tell us about your use case and we'll get back to you shortly.