Why did we open-source our inference engine? Read the post

← Catalog

Qwen/Qwen3-4B-Instruct-2507

Open comparison →

Primitive: /generate · Generate · Qwen3

Long contextTool callingConstrained outputStreamingCodeSQL

Overview

Hardware: — drives latency, throughput & cost

Size4.0B params
Tasks /generate
Licenseapache-2.0
Latency576 ms
Throughput472 tok/s
Cost$1.78 /1M tok

Cost is approximate — computed from list GPU prices; your actual price depends on the provider you deploy SIE with.

Generation

CapabilitiesTool calling · Constrained output (JSON Schema, Regex) · Streaming · Code · SQL
Context length32,768
Max output tokens4,096

Benchmarks

HumanEval

code generation en

Quality
pass@1 0.8659

MBPP

code generation en

Quality
pass@1 0.7400

Spider

sql generation en

Quality
execution acc 0.6900

BFCL (simple)

tools generation en

Quality
AST match 0.9375

BFCL (multiple)

tools generation en

Quality
AST match 0.9200

CaseHOLD

legal generation en

Quality
accuracy 0.6033
Performance RTX-PRO-6000 b1 c4
Throughput 441 tok/s
p50 latency 607.3ms

GPQA Diamond

scientific generation en

Quality
accuracy 0.4444
Performance RTX-PRO-6000 b1 c4
Throughput 495 tok/s
p50 latency 1.2s

MedQA

medical generation en

Quality
accuracy 0.5700
Performance RTX-PRO-6000 b1 c4
Throughput 475 tok/s
p50 latency 545.4ms

MMLU-Pro

general generation en

Quality
accuracy 0.5333
Performance RTX-PRO-6000 b1 c4
Throughput 468 tok/s
p50 latency 446.4ms

Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.