Why did we open-source our inference engine? Read the post

← Catalog

Qwen/Qwen3-0.6B

Open comparison →

Primitive: /generate · Generate · Qwen3

Streaming

Overview

Hardware: — drives latency, throughput & cost

Size600M params
Tasks /generate
Licenseapache-2.0
Latency413 ms
Throughput595 tok/s
Cost$1.41 /1M tok

Cost is approximate — computed from list GPU prices; your actual price depends on the provider you deploy SIE with.

Generation

CapabilitiesStreaming
Context length4,096
Max output tokens1,024

Benchmarks

CaseHOLD

legal generation en

Quality
accuracy 0.4600
Performance RTX-PRO-6000 b1 c4
Throughput 621 tok/s
p50 latency 1.7s

GPQA Diamond

scientific generation en

Quality
accuracy 0.2475
Performance RTX-PRO-6000 b1 c4
Throughput 598 tok/s
p50 latency 508.2ms

MedQA

medical generation en

Quality
accuracy 0.2533
Performance RTX-PRO-6000 b1 c4
Throughput 593 tok/s
p50 latency 317.4ms

MMLU-Pro

general generation en

Quality
accuracy 0.2367
Performance RTX-PRO-6000 b1 c4
Throughput 573 tok/s
p50 latency 216.5ms

Open source inference for agents

Open-source inference for the models behind your agents. Run it yourself, or let us run it for you.

Github 2.1K

Contact us

Tell us about your use case and we'll get back to you shortly.

Apply for an inference grant

Free capacity on our hosted cluster for selected projects. Tell us what you run and we reply by email.