Skip to content
Why did we open-source our inference engine? Read the post

Release Notes

Latest version: v0.3.2 (2026-05-08).

  • New capabilities: default score_pairs() in BaseAdapter + baseline reranking targets; bump cold-start schema to v6 with deserialize/warmup split; per-model perf concurrency defaults for OCR adapters; adapter-triggered quality eval on persistent L4 runner; make destroy conditional on workflow_dispatch input; nightly loadtest pipeline + baseline recorder
  • Reliability and operations: raise loadtest job timeout to GH Actions ceiling (360 min / 6h); clarify experimental NATS health mode; lang tag on fenced block; fail fast on invalid scenarios; surface error/no_results rows in MD; harden parse_label against unexpected filenames; adapt OCR perf Item shape to model.inputs and fail loudly on errors
  • adapters: default score_pairs() in BaseAdapter + baseline reranking targets
  • bench,charts: bump cold-start schema to v6 with deserialize/warmup split
  • bench: per-model perf concurrency defaults for OCR adapters
  • ci: adapter-triggered quality eval on persistent L4 runner
  • ci: make destroy conditional on workflow_dispatch input
  • ci: nightly loadtest pipeline + baseline recorder
  • colbert: add score_pairs support and expand model coverage
  • dashboard: add status and kind filters to runs list
  • dashboard: introduce run-group concept (run = 3 scenarios)
  • dashboard: loadtest results dashboard (Next.js + SST + DynamoDB)
  • dashboard: render every metric in the perf-lab archive
  • dashboard: scaffold loadtest dashboard (Next.js + SST)
  • dashboard: status and kind filters on runs list
  • dashboard: track run_status; gh-API one-time backfill
  • docling: add ocr profile defaulting do_ocr=true
  • gateway: expose OpenAPI contract
  • gateway: unify API errors and align probe contracts
  • helm: expose probes value trees for worker/gateway/config
  • helm: tighten startup/readiness probes for faster pod-ready
  • helm: TLS termination via cert-manager + BYO matrix docs
  • helm: wire probe templates to values trees
  • infra: opt-in S3 cluster model cache
  • ltfr: cache-vs-no-cache compare chart, with-cache run data, and 8 single-mode chart refresh
  • matrix: add task_class stamping to eval measurements
  • server,bench: split deserialize/warmup in cold-start instrumentation (v6)
  • server: cap torch CPU threads at worker startup
  • sie_server: per-stage timing markers in lifespan for engine_boot attribution
  • sie_server: split adapter.warmup() out of load() with cold-start log markers
  • tools: bump cold-start bench to v5 with scenario flag
  • tools: LTFR per-scenario bench tooling + results (issue #652)
  • tools: ltfr-bench orchestrator (issue #652)
  • bench: adapt OCR perf Item shape to model.inputs and fail loudly on errors
  • bench: address CodeRabbit review on PR #779
  • bench: correctly detect v6 split presence in flattened runs[]
  • bench: derive emitted gpu_load_s from v6 deserialize+warmup split when available
  • chart: pass —cluster-cache to sie-server and correct populate command in docs
  • charts: vertical legend so ‘image pull + container init’ and ‘node prov’ aren’t clipped
  • ci+terraform: three deterministic root causes for loadtest pipeline
  • ci: address CodeRabbit findings on quality-adapter PR
  • ci: address CodeRabbit’s second-pass review on quality-adapter
  • ci: address CodeRabbit’s third-pass review on quality-adapter
  • ci: auto-clear stale terraform state lock from prior runner crashes
  • ci: drop double cuda12 suffix + force codebuild for missing images
  • ci: forensic dump on argo failure + LB-ENI release before destroy
  • ci: gate stale-lock clear behind force_unlock input + pass —aws-region to destroy
  • ci: override registry/gpu-selector/tolerations + Python heredoc
  • ci: parse markdown bench output → result.json synthesis
  • ci: pass WORKFLOW env to run_scenarios.sh in loadtest.yml
  • ci: preflight env-var check in run_scenarios + finalize scripts
  • ci: provision GH PAT secret + in-cluster github-token before bootstrap
  • ci: raise loadtest job timeout to GH Actions ceiling (360 min / 6h)
  • ci: read bench-config from local clone, not raw.githubusercontent.com
  • ci: right-size bench pod + worker pod resources for cluster shape
  • cluster: move orphan-LB sweep into cmd_destroy, drop parallel script
  • cluster: use project_name (not example name) for orphan-LB VPC tag
  • cluster: use project_name for orphan-LB VPC tag lookup
  • dashboard,ci: keep run_status consistent between S3 and DynamoDB
  • dashboard,ci: wire real Prometheus matrix shape + extend headlines
  • dashboard: drop time-based legacy run grouping (was unsafe)
  • dashboard: GPU util shown as 0-100 (was being multiplied by 100 again)
  • dashboard: include duration_seconds in run-meta.json (was DynamoDB-only)
  • dashboard: normalize array-shaped searchParams before .trim()
  • deps: bump plotly to >=6.1.1 for kaleido compat
  • docker: stub bundles/ and models/ in deps stage
  • docling: cache DocumentConverter per (device, ocr_enabled)
  • docling: mark adapter unloaded in unload()
  • docling: thread device through PdfPipelineOptions accelerator_options
  • gateway: address latest coderabbit contract notes
  • gateway: address PR review for NATS health mode
  • gateway: address probe and SDK review findings
  • gateway: align CreatePoolRequest OpenAPI with runtime validation
  • gateway: clarify experimental NATS health mode
  • gateway: close remaining review contract gaps
  • gateway: preserve embeddings timing headers
  • gateway: preserve scale-from-zero request path
  • gateway: reject unsupported embeddings token arrays
  • helm: validate ACME server and privateKeySecretRef in validateTls
  • infra: grant kms:Decrypt to workers when model cache uses SSE-KMS
  • infra: normalize whitespace-only model cache string inputs
  • infra: treat empty model_cache_kms_key_id as unset
  • infra: use flat lifecycle key for s3-bucket module v5
  • loadtest-ci: force-delete orphan elbv2 LBs before terraform destroy
  • loadtest-ci: force-delete orphan LBs and stop swallowing destroy failures
  • loadtest-ci: poll workflow phase instead of argo submit —wait
  • loadtest-ci: poll workflow phase instead of relying on argo —wait
  • ltfr-bench,notes: lang tag on fenced block; fail fast on invalid scenarios; surface error/no_results rows in MD
  • ltfr-bench: hoist imports to top; guard payload.results shape
  • ltfr-bench: mark request_failed rows in scenario-a/b MD tables
  • ltfr-bench: preserve failure context in aggregated rows; add request_failed status
  • ltfr-bench: treat no_results cells as failures in exit code
  • ltfr-charts: strip legend clip-path so labels render full width
  • ltfr: tighten UID/timestamp guards in capture_image_pull_events
  • multi_pod_cold_start: raise on ASG terminate fail; UID-filter pull events; isolate scenario-c pod
  • paddleocr_vl: pass use_cache=True to generate
  • paddleocr_vl: pass use_cache=True to generate to enable KV-cache
  • review: tighten score_pairs options handling and query text validation
  • sie-server: include model and bundle directories in wheel distribution
  • terraform: detect HF-cache EBS by NVMe model + size, not by Linux name
  • terraform: set resolve_conflicts_on_* = OVERWRITE on EKS addons
  • tools: drop module-level docstrings (AGENTS.md rule)
  • tools: guard fig_per_cell_table aggregation against empty results
  • tools: guard mean() against empty engine_boot_s in aggregate()
  • tools: harden parse_label against unexpected filenames
  • tools: mark cold-start-bench.py executable (EXE001)
  • tools: remove module docstring from cold_start_charts.py (repo rule)
  • New capabilities: add OmniDocBench OCR quality loader; support /v1/score with dense/sparse/colbert/hybrid modes; add Marqo/marqo-ecommerce-embeddings-B via open_clip backend
  • Reliability and operations: add terminal failed state to model registry (sie-test#85)
  • bench: add OmniDocBench OCR quality loader
  • bge-m3: support /v1/score with dense/sparse/colbert/hybrid modes
  • siglip: add Marqo/marqo-ecommerce-embeddings-B via open_clip backend
  • server: add terminal failed state to model registry (sie-test#85)
  • Breaking change: openapi.json is now a committed artifact that must be regenerated and committed when API changes are made
  • New capabilities: add GLM-OCR adapter; add Qwen3-VL-Embedding-2B and Qwen3-VL-Reranker-2B multimodal adapters; add GLiNER2 and GLiNER-bi adapters; add Qwen3-Reranker-0.6B and 4B causal LM reranker support; add SigLIP 2 base-patch16-224 vision-language encoder; add minimal cache weights snapshot command for offline deployments
  • Reliability and operations: retry only transient connection errors under wait_for_capacity; surface unrouteable models loudly and helm-repo-add on pristine hosts; emit identical NATS payload to bundle and _all subjects; surface mixed-profile unrouteable models and keep snapshot consistent on writes; add retry logic for deadsnakes PPA to handle Launchpad outages
  • Performance: cache JPEG-encoded corpus images across queries; lazily JPEG-encode corpus images on first use; cache SDK version parse, integer audit latency, UUIDv7; cut hot-path allocations, fuse numpy decode, tighten backpressure
  • openapi: openapi.json is now a committed artifact that must be regenerated and committed when API changes are made
  • adapters: add GLM-OCR adapter
  • adapters: add Qwen3-VL-Embedding-2B and Qwen3-VL-Reranker-2B multimodal adapters
  • add GLiNER2 and GLiNER-bi adapters
  • add Qwen3-Reranker-0.6B and 4B causal LM reranker support
  • add SigLIP 2 base-patch16-224 vision-language encoder
  • admin: add minimal cache weights snapshot command for offline deployments
  • bench: honor SIE_BENCH_SERVER_READY_TIMEOUT in eval orchestrator
  • ci: nightly loadtest gate against dedicated EKS cluster
  • ci: nightly loadtest gate, ephemeral cluster per run
  • extract: add Docling adapter for PDF/DOCX/HTML extraction
  • extract: add Docling adapter for PDF/DOCX/HTML parsing
  • extract: plumb document items and structured data results
  • observability: add Prometheus metrics to sie-config and expand sie-gateway coverage
  • observability: Prometheus metrics for sie-config and sie-gateway
  • oom: implement defensive exception fan-out and improve recovery metrics
  • oom: improve error semantics and budget exhaustion detection
  • openapi: add static spec export and validation
  • router: import Rust gateway source tree
  • server: add reactive OOM recovery and proactive idle eviction
  • social: daily social content pipeline with 5-source drafts + engagement
  • types: add document input modality across SDKs, server, and metadata
  • adapters: add input validation guards for empty/failed visual inputs
  • adapters: address review findings for Qwen3-VL adapters
  • adapters: clarify video placeholder, validate token IDs, fix torch_dtype key
  • add client-side hour filter to search_x_posts (was date-level only)
  • address CodeRabbit review feedback
  • address follow-up PR review nits
  • address remaining CodeRabbit feedback (round 2)
  • address review findings — negative truncation guard, score() options, constant dedup
  • bench: show correct unit labels for MP/s throughput in —print-gap
  • bundles: declare Qwen3-VL adapters in default bundle
  • ci: use Blacksmith runner in CI
  • client: retry only transient connection errors under wait_for_capacity
  • cluster: address PR #701 review comments
  • cluster: correct kubectl flag combo and reorder LB sweep before helm uninstall
  • cluster: helm uninstall before terraform destroy to clean up AWS LB leftovers
  • cluster: unblock end-to-end mise run cluster create --build
  • config,cluster: surface unrouteable models loudly and helm-repo-add on pristine hosts
  • config: emit identical NATS payload to bundle and _all subjects
  • config: surface mixed-profile unrouteable models and keep snapshot consistent on writes
  • docker: add retry logic for deadsnakes PPA to handle Launchpad outages
  • docker: propagate failure when all add-apt-repository retries exhausted
  • docling: per-task converter, hf_revision guard, callable typing (CodeRabbit)
  • docs: Update packages/sie_server/Dockerfile.cuda11
  • fail closed on missing/unparsable timestamps in lookback filter
  • gateway,config,sdk: resiliency, concurrency, and cross-service hash parity
  • gateway,config: address PR review — 404 for unknown models, 202 on default routing, full YAML propagation
  • gateway,config: harden auth, trusted NATS producers, and recovery path; drop gateway HA default
  • gateway,sdk: map upstream timeouts to 503+MODEL_LOADING for SDK retry
  • gateway: add GET /v1/models/{model} detail route
  • gateway: address PR #716 review feedback
  • gateway: align /v1/models error and list shapes
  • gateway: drop double-counted REQUEST_COUNT / REQUEST_LATENCY emit
  • gateway: emit X-SIE-Error-Code header on model-loading 503
  • gateway: keep record_request async to match main’s call shape
  • gateway: make sie-config single source of truth for bundles with live resync
  • gateway: normalize model ids in NATS work subjects + docs/tooling/ha cleanup
  • gateway: pre-instantiate request/demand metric families on startup
  • gateway: prioritize epoch-rewind branch; harden no-thrash test; correct arch-guide on ephemeral restart
  • guard score() and score_pairs() against empty input lists
  • helm: default clusterRouting to “queue” on import-sie-router-rust
  • helm: enable NATS + JetStream by default to match queue clusterRouting
  • helm: fail fast when gateway has no bundle source
  • kind-smoke: add —no-pool-isolation for static clusters + contract-drift fixes
  • kind-smoke: address bot review feedback
  • kind-smoke: enable configStore and harden config/gateway tests
  • kind-smoke: enable JetStream on test NATS and drop duplicate subchart
  • kind-smoke: wire sie-config image and helm overrides into kind cluster fixture
  • kind-smoke: wire sie-config image into kind cluster fixture
  • observability: address PR review blockers on metrics PR
  • sdk: cluster cache prefix probe uses list, not head
  • sdk: cluster cache prefix probe uses list, not head (Refs #732, #654)
  • sdk: has_children filters folder-marker objects (Refs #732, #654)
  • sdk: preserve caller-supplied document format over inferred (CodeRabbit)
  • sdk: retry mid-flight transport disconnects, not just timeouts
  • sdk: retry on connection errors and generic 503s
  • sie_config: address PR review feedback
  • terraform/aws: set 100GB root volume on cpu node group to avoid DiskPressure
  • terraform/gcp: undo router→gateway rename on GCP Cloud Router + NAT
  • tests: include sie-config in expected missing-image list
  • tests: restore docker gateway smoke test after router rename
  • tmux-scripts: improve robustness of session parsing and argument handling
  • types: adapt to ty 0.0.32 stricter ignore handling
  • use searchTerms for X tweet-scraper actor (was searchQueries)
  • bench: cache JPEG-encoded corpus images across queries
  • bench: lazily JPEG-encode corpus images on first use
  • docker: add —link + move ARG BUNDLE to eliminate cross-bundle layer noise
  • docker: normalize mtimes so shared venv layer is dedupable
  • docker: reorder stages for maximum BuildKit cache reuse
  • docker: split worker venv into shared + bundle-specific layers
  • gateway: cache SDK version parse, integer audit latency, UUIDv7
  • gateway: cut hot-path allocations, fuse numpy decode, tighten backpressure
  • gateway: fuse msgpack_numpy decode into the response path
  • gateway: move score-endpoint unwrap instead of cloning
  • gateway: pass msgpack items through as rmpv::Value
  • gateway: publish work items concurrently + borrow shared fields
  • gateway: tighten cold-pool backpressure + cheaper QPS counter
  • gateway: trim per-request work on the inference hot path
  • Breaking change: Removed --model CLI args from worker startup; use SIE_PRELOAD_MODELS env var or --preload flag instead
  • New capabilities: add ModernBERT flash dense embedding support with fallback mechanism; add OCR quality benchmarks (olmOCR-bench); add OCR quality benchmarks with olmOCR-bench; add pages/sec throughput metric for OCR perf eval; add perf metrics to OCR eval pipeline; also report query throughput in mpix/s for image queries
  • Reliability and operations: add missing NATS Helm repo to release workflow; don’t set NODE_AUTH_TOKEN for OIDC npm publishes; harden affinity spill with bounds check, clamp, and debug log; make rejected requests visible to KEDA scaling metrics; remove redundant tokenizer validation and unused template parameter
  • workers: Removed --model CLI args from worker startup; use SIE_PRELOAD_MODELS env var or --preload flag instead
  • adapters: add ModernBERT flash dense embedding support with fallback mechanism
  • bench: add OCR quality benchmarks (olmOCR-bench)
  • bench: add OCR quality benchmarks with olmOCR-bench
  • bench: add pages/sec throughput metric for OCR perf eval
  • bench: add perf metrics to OCR eval pipeline
  • bench: also report query throughput in mpix/s for image queries
  • benchmarks: add MTEB NFCorpus evaluation results for ModernBERT-based embedders
  • bench: report vision corpus throughput in mpix/s
  • bench: report vision corpus throughput in mpix/s instead of items/s
  • deps: migrate from pynvml to nvidia-ml-py package
  • haystack: add haystack_integrations namespace aliases
  • haystack: add namespace-convention aliases
  • observability: add anonymous usage telemetry
  • sdk: add max_concurrency param to SIEAsyncClient to prevent connection pool exhaustion
  • server: add lightonai/LightOnOCR-2-1B OCR adapter with next bundle
  • workers: implement model preloading at startup to reduce first-request latency
  • adapters: remove redundant tokenizer validation and unused template parameter
  • address PR review — panel title, namespace variable
  • bench: handle unloaded images in pixel count computation
  • bench: use concurrent async requests for OCR perf eval
  • bench: validate image entries before computing pixel counts
  • bench: validate pixel counts before using them for image corpus throughput
  • build: downgrade dockerfile syntax version to 1 for broader compatibility
  • ci: add missing NATS Helm repo to release workflow
  • ci: don’t set NODE_AUTH_TOKEN for OIDC npm publishes
  • dashboard: queue routing dashboard accuracy and usability
  • docs: add update date to portfolio header
  • docs: correct PR reference in reranker reclassification note
  • docs: populate reranker data and simplify table header
  • docs: update stale model counts after reranker reclassification
  • haystack: rename namespace alias to sie
  • install uv via curl instead of COPY —from ghcr.io
  • preload smoke test checks model.loaded instead of nonexistent workers field
  • readme: heading format
  • release: add LanceDB integrations to release-please config
  • router: add overflow spill to break model affinity deadlock
  • router: harden affinity spill with bounds check, clamp, and debug log
  • router: make rejected requests visible to KEDA scaling metrics
  • sie-bench: account for in-flight drain in throughput calculation
  • sie-bench: use union wall-clock for multiprocess throughput merge
  • tester-cluster: patient KEDA scale-down for worker pools
  • New capabilities: add async, chunking, and streaming to Weaviate document enricher; improve DLQ routing and score response handling; implement Config Management API with NATS-based distribution and review fixes; add LanceDB integration (Python + TypeScript); queue routing dashboard + NATS exporter + router image tag; queue routing dashboard, NATS prom exporter, router image tag
  • Reliability and operations: correct cluster routing condition, stream max_age units, and reconnect state ordering; add recreate strategy for router deployment when nats config restore is enabled; restore Chart.yaml deps from main, keep appVersion v-prefix; queue routing dashboard PromQL for NATS wait; configurable NATS fetch budget, Helm-wired queue params
  • Performance: decouple scanner and SIE batch sizes in enrich_table; stream enrich_table batch-by-batch instead of full materialization; use Lance scanner for column projection in enrich_table; bypass FastAPI for hot proxy paths via raw ASGI middleware
  • add async, chunking, and streaming to Weaviate document enricher
  • dlq,pull-loop: improve DLQ routing and score response handling
  • implement Config Management API with NATS-based distribution and review fixes
  • integrations: add LanceDB integration (Python + TypeScript)
  • observability: queue routing dashboard + NATS exporter + router image tag
  • observability: queue routing dashboard, NATS prom exporter, router image tag
  • sdk: add get_model() and configure LanceDB release workflows
  • terraform: add AWS eval-eu EKS cluster with multi-GPU support
  • terraform: add evaluation cluster setup for AWS with multi-GPU support and updated configurations
  • terraform: add node labels, adjust pool sizes for tester cluster
  • config,queue,nats: correct cluster routing condition, stream max_age units, and reconnect state ordering
  • handle BytesIO images in LlamaIndex and validate Weaviate classify config
  • helm: add recreate strategy for router deployment when nats config restore is enabled
  • helm: restore Chart.yaml deps from main, keep appVersion v-prefix
  • helm: use generic release-please updater for appVersion
  • helm: use generic updater for both Chart.yaml version fields
  • helm: use l4-spot/rtx6000-spot naming convention for spot profiles
  • integrations: address CodeRabbit review findings for LanceDB PR
  • observability: queue routing dashboard PromQL for NATS wait
  • queue-routing: configurable NATS fetch budget, Helm-wired queue params
  • queue-routing: resolve bugs, add configurable NATS params, fix score wire format
  • queue-routing: score response format and DLQ fallback routing key
  • release: use NPM_TOKEN for initial sie-lancedb publish
  • router: use “scores” key in queue-mode score responses
  • terraform: add GPU subnet coverage validation
  • terraform: relax AZ validation and clarify defaults
  • terraform: review fixes for tester cluster infra
  • terraform: switch tester-cluster to us-east-2
  • terraform: Switch tester-cluster to us-east-2 and update deployment docs
  • terraform: validate gpu_node_groups for duplicate and reserved names
  • test: add buildx builder pause recovery and improve build error diagnostics
  • update adapter tests and address code review feedback
  • use OCI registry URI for helm chart in README
  • lancedb: decouple scanner and SIE batch sizes in enrich_table
  • lancedb: stream enrich_table batch-by-batch instead of full materialization
  • lancedb: use Lance scanner for column projection in enrich_table
  • router: bypass FastAPI for hot proxy paths via raw ASGI middleware
  • router: reduce thread pool pressure by inlining small deserialization
  • router: remove msgpack_numpy global patch and BaseHTTPMiddleware
  • router: replace stdlib json with orjson for 3-10x faster serialization
  • sdk+router: lazy msgpack_numpy.patch and pure ASGI middleware
  • Reliability and operations: increase docker smoke test timeouts and add retry; include $platform in worker image tag format; revert pool names to machine profile names; remove —provenance flag (requires public repo)
  • helm: include $platform in worker image tag format
  • helm: revert pool names to machine profile names
  • increase docker smoke test timeouts and add retry
  • remove —provenance flag (requires public repo)
  • Reliability and operations: add sie-qdrant and sie-weaviate to release-please config; point sync-terraform default repos to production; remove —provenance from npm publish for private repo; correct image.tag comment to reflect actual format; remove duplicate platform suffix from worker image tag
  • add sie-qdrant and sie-weaviate to release-please config
  • ci: point sync-terraform default repos to production
  • ci: remove —provenance from npm publish for private repo
  • helm: correct image.tag comment to reflect actual format
  • helm: remove duplicate platform suffix from worker image tag
  • remove internal-only references from COMPATIBILITY.md
  • New capabilities: add profiling script for sparse encoding hot path; add GitHub Actions workflow to sync Terraform modules to registry repos; apply QoL improvements from PR #484 review comments; switch default GPU from g5 (A10G) to g6 (L4); add rerank/score support to TEI runner; implement configurable document length limits and custom prefix token registration
  • Reliability and operations: restore triggering ref for source checkout; restore quality by enabling causal attention and QK-normalization; restore dev-l4-spot zones to us-central1 for GPU availability; check /metrics endpoint in test_prometheus_metrics_exist; add per-attempt timeout to lease renewal fetch
  • Performance: optimize MoE expert dispatch with sorted-expert routing; batch MaxSim scoring across documents on GPU; batch sparse aggregation with segment_reduce and fuse relu; batch split_embeddings + validate ColBERT performance
  • adapters: add profiling script for sparse encoding hot path
  • add GitHub Actions workflow to sync Terraform modules to registry repos
  • apply QoL improvements from PR #484 review comments
  • aws: switch default GPU from g5 (A10G) to g6 (L4)
  • bench: add rerank/score support to TEI runner
  • colbert: implement configurable document length limits and custom prefix token registration
  • deploy: move namespace, SA, and HF token secret management to Helm chart
  • deploy: prepare Terraform modules for public registry publishing
  • deploy: rewrite example module sources to registry references
  • deploy: rewrite Helm and internal references for public release
  • deploy: two-artifact model — GCP Terraform infra-only, batteries-included Helm chart
  • docker: add —docker-platform flag to docker build task
  • extend create_pool API/SDK with minimum_worker_count and bundle
  • helm: add batteries-included sub-chart dependencies to sie-cluster
  • helm: add image pre-pull DaemonSet for GPU worker pools
  • helm: add step to build Helm chart dependencies in Kind smoke tests
  • helm: default router to image-embedded model configs
  • helm: enable image pre-pull DaemonSet by default
  • helm: port health gates from Terraform to Helm post-install hooks
  • helm: remove prometheus alias, bump to v0.2.0, standardize chart
  • infra: add Modal GPU sandbox for remote benchmark execution
  • infra: add rollout warning and explicit image_type for GCFS
  • infra: enable GCFS image streaming on GPU node pools
  • infra: set min_node_count=1 on L4 spot GPU node pools
  • integrations: add Qdrant integration
  • integrations: add Qdrant integration with native sparse vector support
  • integrations: add Weaviate v4 integration with Go module spec
  • multiprocess loadtest + SDK aiohttp migration
  • sdk: add version negotiation headers between SDK and server
  • sdk: default wait_for_capacity=True and timeout=900s
  • sdk: version negotiation header (SDK ↔ server)
  • sie-bench: add dataset/input_type fields for mTEB corpus inputs
  • sie-bench: built-in multiprocess loadtest mode
  • skills: add eval-model skill for HF model assessment
  • skills: add eval-model skill for HF model integration assessment
  • sync Terraform modules to registry repos
  • tei-runner: add /embed_sparse support for sparse models
  • tei: add /embed_sparse support and auto-detect pooling mode
  • terraform/aws: restore cluster autoscaler helm release to infra module
  • terraform/aws: strip k8s resources, restructure as infra module with examples
  • terraform: add cluster name and artifact registry variables; update node pool configuration
  • terraform: add EBS CSI driver, NVIDIA device plugin, default StorageClass
  • terraform: strip gcp k8s/ layer; examples use infra-only module
  • tools: add ColBERT query vs document profiling script
  • tools: add dense P50 latency profiling script
  • adapters: sort IDF unique_ids to satisfy SparseVector contract
  • add missing production example to tf validate; fix tempfile leak; remove module docstring
  • address PR #478 review feedback
  • address PR review — GPU alert formula, kubectl parsing, CI path filter
  • address review feedback for npm publish
  • address review findings - race prevention, cleanup, lighter checkout
  • alloy: add stage.cri{} before stage.json to unwrap CRI log envelopes
  • alloy: explicitly set configMap name and key for sub-chart wiring
  • alloy: scope pod discovery to current node via field selector
  • bench: complete g5 to g6 migration in AWS eval configs and GPU mapping
  • benchmarks: use TEI /embed_all for ColBERT multi-vector models
  • bench: skip loading candidates_model for single-model servers
  • chart: update home URL and Helm install command in README
  • CI compatibility and consistent env var usage
  • CI compatibility for sync-terraform workflow
  • ci: add contents: read permission to publish-pypi-oidc job
  • ci: add helm repo add + dep build to kind-smoke workflow
  • ci: g5 refactored to g6 already
  • cluster: build concrete helm command in status from infra_outputs
  • cluster: guard helm/kubectl post-create log when outputs are empty
  • deploy: clean terraform init artifacts before push
  • deploy: correct smoke test TypedDict access and helm dry-run args
  • deploy: correct StatefulSet rollout semantics, PDB scope, and KEDA pause
  • deploy: remove dangling kubernetes_namespace_v1.sie references from health_gates.tf
  • deploy: restore triggering ref for source checkout
  • deploy: update default destination repos for GCP and AWS modules
  • deploy: use triggering ref for source checkout in sync-terraform
  • disable LoRA adapter layers after loading to prevent quality corruption
  • docs: clarify optional image push in AWS and GCP README files
  • docs: update Helm chart path in AWS and GCP README files
  • fix integration test
  • helm,hook: deploy/helm/sie-cluster/templates/hooks/prometheus-ready-test.yaml
  • helm: add before-hook-creation to Job delete policies; document count==0 expectation
  • helm: address coderabbit findings on health gate hooks
  • helm: address non-blocking review findings from PR #336
  • helm: address review findings in batteries-included sub-chart PR
  • helm: address reviewer suggestions for health gate hooks
  • helm: address second-pass review findings
  • helm: aggregate buckets by le in p95 latency alert
  • helm: bump chart version to 0.1.1 (patch, not minor)
  • helm: clarify prometheusAddress comment — ignored when sub-chart is installed
  • helm: correct kube-prometheus-stack semver constraint and remove hardcoded grafana password
  • helm: correct misleading validation comment in router-deployment.yaml
  • helm: don’t emit ScaledObject CRDs unless KEDA is confirmed present
  • helm: downscope KEDA RBAC to Role/RoleBinding; remove runtime apk installs
  • helm: fix loki service URL and extract alloy config to file
  • helm: fix three blocking review issues in sub-chart dependencies
  • helm: improve temporary values file handling in helm_template function
  • helm: improve, simplify, and modularize sie-cluster chart
  • helm: move ‘app.kubernetes.io/part-of’ label to selector labels for consistency
  • helm: remove autoscaling.enabled from values-aws.yaml
  • helm: render KEDA ScaledObjects via post-install hook to avoid CRD chicken-and-egg
  • helm: replace hardcoded namespace in provisioning alert rules
  • helm: require non-empty hfToken.value when hfToken.create is true
  • helm: sub-chart naming, Loki compactor, event exporter ECR, Grafana folders
  • helm: use autoscaling.prometheusAddress in prometheus hook; remove stub health_gates.tf
  • helm: use full FQDN for Prometheus service in KEDA and health gates
  • helm: use router.service.port in NOTES.txt instead of hardcoded 8080
  • infra: update min_node_count default in top-level GCP module
  • normalize SDK version warned-set key to major.minor
  • pool error types, add pool/progress test coverage
  • profiling: add flash variant registry, device validation, top-level import
  • profiling: sync GPU before tensor timing, move script to tools/
  • profiling: use in-place relu_ to match production code path
  • qwen3: restore quality by enabling causal attention and QK-normalization
  • readme: correct helm chart path
  • readme: correct helm install command
  • release: track all package versions via release-please extra-files
  • release: track TS SDK version.ts via release-please
  • replace corrupted bge-m3 NanoFiQA2018 target + set bfloat16 precision
  • review items
  • router: increase pool lease TTL to survive rolling upgrades
  • router: resolve default pool GPU for scale-up when gpu/pool omitted
  • router: use effective_pool instead of pool_name for default pool GPU extraction
  • sdk: defer aiohttp session creation to fix “no running event loop” in SIEAsyncClient
  • sie_bench: improve —print-gap report accuracy and readability
  • tei-runner: validate /embed_all returns per-token embeddings
  • tei-runner: validate output_type in TEIRunner init
  • terraform/aws: add full -backend-config flags to production init command
  • terraform/aws: add precondition asserting >=2 GPU-capable AZs exist
  • terraform/aws: address review findings post-restructure
  • terraform/aws: correct helm chart path in dev-g5-spot example comment
  • terraform/aws: filter VPC AZs to only zones offering the GPU instance type
  • terraform/aws: fix invalid splat on instance type offerings locations
  • terraform/aws: remove provider aws block from child module
  • terraform/aws: use var.project_name in VPC subnet cluster tags
  • terraform: add validation for GPU node pool zones to ensure they match the configured region
  • terraform: restore dev-l4-spot zones to us-central1 for GPU availability
  • terraform: update GPU instance type description for clarity and add dev-g6-spot example
  • terraform: update stale k8s module references in comments
  • terraform: upgrade AWS modules and fix deprecations
  • test: check /metrics endpoint in test_prometheus_metrics_exist
  • test: update EKS tests from g5 to g6 after GPU instance type change
  • ts-sdk: add per-attempt timeout to lease renewal fetch
  • use prepack instead of prepublishOnly
  • validate minimum_worker_count input and soften docstrings
  • adapter: optimize MoE expert dispatch with sorted-expert routing
  • adapters: batch MaxSim scoring across documents on GPU
  • adapters: batch sparse aggregation with segment_reduce and fuse relu
  • adapters: batch split_embeddings + validate ColBERT performance
  • adapters: batch split_embeddings in ColBERT adapters
  • adapters: eliminate GPU overhead from IDF query encode path
  • florence2: greedy decoding for OCR (-23% P50)
  • florence2: switch OCR configs from beam search to greedy decoding
  • server,bench: add batch coalescing, query warmup, and benchmark stability improvements
  • server: dispatch immediately when worker is idle
  • server: optimize BertFlashAdapter inference path (+35% corpus throughput)
  • server: reduce batch wait timeout 10ms \u2192 2ms for lower Doc P50
  • Breaking change: remove florence2 and gliner standalone bundles — extraction adapters (gliner, glirel, gliclass) are now included in the default bundle
  • New capabilities: add native MTEB reranking task support with MRR metric; encode-dense matrix eval — 3 models × 8 tasks; add date-prefixed versioning for chronological filename ordering; reorganize and expand model size lookup table with alphabetical ordering; add detailed perf metrics, metric filter, and threshold selector; add marimo benchmark dashboard notebook
  • Reliability and operations: restore /var/cache/apt mounts, keep /var/lib/apt removed; add HF_TOKEN auth and config kwargs and fix stella models; add dense projection support to Qwen2FlashAdapter; apply query_template from runtime options in SentenceTransformerDenseAdapter; replace pip install with uv add in docker error messages
  • Performance: vectorize GTE sparse encode path; vectorize tokenization and packing for gte-multilingual-base; vectorize tokenization and packing to reduce throughput gap; switch Qwen/GTE models to flash attention adapter
  • bundles: remove florence2 and gliner standalone bundles — extraction adapters (gliner, glirel, gliclass) are now included in the default bundle
  • bench: add native MTEB reranking task support with MRR metric
  • bench: encode-dense matrix eval — 3 models × 8 tasks
  • benchmarks: add date-prefixed versioning for chronological filename ordering
  • benchmarks: reorganize and expand model size lookup table with alphabetical ordering
  • benchview: add detailed perf metrics, metric filter, and threshold selector
  • benchview: add marimo benchmark dashboard notebook
  • benchview: add perf metric selector to Model Size tab
  • benchview: detailed perf metrics, metric filter, threshold selector
  • router,bench,sdk: improve throughput with inflight tracking, batching, and connection pooling
  • server: typed request parsing with msgspec
  • sie_server: add gliner, glirel, and gliclass extraction dependencies to the default bundle
  • adapter: add dense projection support to Qwen2FlashAdapter
  • adapter: apply query_template from runtime options in SentenceTransformerDenseAdapter
  • apply CodeRabbit auto-fixes
  • bench: replace pip install with uv add in docker error messages
  • benchview: add missing statistics import and use _median helper
  • bundles: include sglang bundle in default cluster and eval-matrix configs
  • client: update websocket header parameter name from extra_headers to additional_headers
  • colbert: enable native mode fallback for non-CUDA devices and add Matryoshka truncation
  • deps: cap timm upper bound and fix lazy handler init
  • docker: clear stale apt lists before update to prevent 404s
  • docker: remove all apt cache mounts from Dockerfiles
  • docker: remove no-op /var/lib/apt cache mount from apt RUN blocks
  • docker: restore /var/cache/apt mounts, keep /var/lib/apt removed
  • helm: increase CPU worker pool memory limits for expanded default bundle
  • model: add missing query_template to stella_en_400M_v5
  • models: switch all-MiniLM-L6-v2 to SentenceTransformerDenseAdapter
  • multilingual-e5-large-instruct: use instruct query template, NFCorpus 0.3521 → 0.3567
  • replace invalid HTML entities in SVG with XML numeric entities
  • rope_flash: clear cached _rope_dummy on unload and use torch.cat for packing
  • router: also resolve pool-derived GPU names to spot variants
  • router: resolve bare GPU types to spot variants for KEDA scaling
  • sdk: resolve sync/async client inconsistencies in score() and encode()
  • server: centralize request validation to prevent 500s from malformed items
  • server: use BertFlashAdapter for e5-small-v2, resolve e5 perf anomalies
  • server: use BertFlashAdapter for intfloat/e5-small-v2 and remove stale benchmarks
  • set compute_precision to bfloat16 for stella_en_1.5B_v5
  • sie_server: add HF_TOKEN auth and config kwargs and fix stella models
  • sie_server: resolve BGE-M3 linear weights loading for HF model IDs and fix test fixtures
  • sie_server: support NV-Embed-v2 with PyTorch embedding adapter
  • splade: align special token filtering and guard empty batches
  • test: always rebuild Docker images to pick up code changes
  • typecheck: move ty type checker from mise tool to uv dependency
  • adapters: vectorize GTE sparse encode path
  • rope_flash: vectorize tokenization and packing for gte-multilingual-base
  • rope_flash: vectorize tokenization and packing to reduce throughput gap
  • server: switch Qwen/GTE models to flash attention adapter
  • splade: vectorize tokenization and sparse aggregation (1.5x throughput)
  • splade: vectorize tokenization and sparse aggregation in SPLADEFlashAdapter
  • New capabilities: add GLiNER v2.5 model configs; stream request bodies through proxy instead of buffering; add classification model configs for GLiClass-large and cross-encoder NLI
  • Reliability and operations: release pipeline cache collision and smoke test timeout; strip content-length header from streamed proxy responses
  • Performance: stream response body to eliminate bytes.join bottleneck
  • models: add GLiNER v2.5 model configs
  • router: stream request bodies through proxy instead of buffering
  • sie_server: add classification model configs for GLiClass-large and cross-encoder NLI
  • release pipeline cache collision and smoke test timeout
  • router: strip content-length header from streamed proxy responses
  • router: stream response body to eliminate bytes.join bottleneck
  • Reliability and operations: revert sharing=locked, add cache-read-only for build step
  • revert sharing=locked, add cache-read-only for build step
  • Reliability and operations: revert token lifetime extension, re-auth before push instead; revert token lifetime, re-auth before push
  • revert token lifetime extension, re-auth before push instead
  • revert token lifetime, re-auth before push
  • Reliability and operations: release image builds failing from GCP token expiry
  • release image builds failing from GCP token expiry
  • Reliability and operations: update bundle definitions to replace legacy and gte-qwen2 with gliner
  • bundles: update bundle definitions to replace legacy and gte-qwen2 with gliner
  • Breaking change: HTTP 409 dependency conflict responses are removed from all API endpoints; the DEPENDENCY_CONFLICT error code no longer exists; .beads/ issue tracking data removed from repository
  • New capabilities: add X-SIE-Worker response header for per-worker metrics tracking; add encode-image-text measurements to benchmarks dir; add encode-multivector perf measurements; add encode-multivector performance measurements; add encode-visual-document perf measurements; add encode-visual-document performance measurements
  • Reliability and operations: increase helm install timeout from 10m to 15m; add trailing empty line to gitignore; align release-images workflow with docker task flags; register GLiClass and DeBERTa models in bundles; build and deploy gliner bundle in Kind smoke tests
  • Performance: add connection pooling load test results (Feb 24); pool httpx client and add X-SIE-Worker header in router proxy; pool httpx client in router proxy to eliminate per-request TCP overhead; move transformers imports to module level
  • deps: HTTP 409 dependency conflict responses are removed from all API endpoints; the DEPENDENCY_CONFLICT error code no longer exists
  • .beads/ issue tracking data removed from repository
  • deps: model config files no longer support the dependencies field
  • add X-SIE-Worker response header for per-worker metrics tracking
  • benchmarks: add encode-image-text measurements to benchmarks dir
  • benchmarks: add encode-multivector perf measurements
  • benchmarks: add encode-multivector performance measurements
  • benchmarks: add encode-visual-document perf measurements
  • benchmarks: add encode-visual-document performance measurements
  • benchmarks: add extract-detection L4-SPOT performance measurements
  • benchmarks: add extract-kie-docvqa measurements to benchmarks dir
  • benchmarks: add extract-relation L4-SPOT performance measurement
  • benchmarks: add score-colbert perf measurements
  • benchmarks: add score-colbert performance measurements
  • models: add encode-image-text measurements
  • models: add extract-detection measurements
  • models: add extract-kie-docvqa measurements
  • models: add extract-relation measurements
  • router: add structured audit logging for API requests
  • .claude: add trailing empty line to gitignore
  • align release-images workflow with docker task flags
  • bundles: register GLiClass and DeBERTa models in bundles
  • ci: build and deploy gliner bundle in Kind smoke tests
  • colbert: remove CUDA requirement and improve device compatibility
  • eval: read ‘sie_id’ instead of ‘name’ from model configs in runner
  • extract: use dict access for Entity TypedDict in sort
  • gliner: relax stale transformers<4.52 pin
  • increase helm install timeout from 10m to 15m
  • reduce cpu-gliner resource requests for Kind CI
  • router: read ‘sie_id’ instead of ‘name’ from model configs
  • server: migrate NLI adapter to classifications and improve API consistency
  • server: migrate nli_classification adapter and improve type annotations
  • server: populate classifications instead of entities in GLiClass adapter
  • use manifest mode for release-please and reset to v0.0.0
  • use nested .gitignore for .claude/ directory
  • add connection pooling load test results (Feb 24)
  • pool httpx client and add X-SIE-Worker header in router proxy
  • pool httpx client in router proxy to eliminate per-request TCP overhead
  • pytorch-embedding: move transformers imports to module level
  • server: use uvloop as default event loop for uvicorn
  • keep CONTRIBUTING.md clone URLs pointing to sie.git
  • remove beads, agent prompts, mypy refs; consolidate ty config
  • deps: move adapter dependencies from per-adapter pyproject.toml to bundle YAML
  • deps: remove model-level dependencies feature

Contact us

Tell us about your use case and we'll get back to you shortly.