Qwen/Qwen3.5-4B
Primitive: /generate · Generate ·
Qwen3 MoE
MultimodalLong contextTool callingConstrained outputStreaming
Overview
Hardware: — drives latency, throughput & cost
| Size | 4.0B params |
|---|---|
| Tasks | /generate |
| License | apache-2.0 |
| Latency | 762 ms |
| Throughput | 353 tok/s |
| Cost | $2.38 /1M tok |
Cost is approximate — computed from list GPU prices; your actual price depends on the provider you deploy SIE with.
Generation
| Capabilities | Tool calling · Constrained output (JSON Schema, Regex) · Streaming |
|---|---|
| Context length | 8,192 |
| Max output tokens | 4,096 |
Benchmarks
CaseHOLD
Quality
accuracy 0.5867
Performance RTX-PRO-6000 b1 c4
Throughput 234 tok/s
p50 latency 788.1ms
GPQA Diamond
Quality
accuracy 0.4495
Performance RTX-PRO-6000 b1 c4
Throughput 343 tok/s
p50 latency 863.8ms
MedQA
Quality
accuracy 0.6700
Performance RTX-PRO-6000 b1 c4
Throughput 364 tok/s
p50 latency 735.2ms
MMLU-Pro
Quality
accuracy 0.5767
Performance RTX-PRO-6000 b1 c4
Throughput 390 tok/s
p50 latency 587.2ms
Compare (0)Compare →