Qwen/Qwen3-4B-Instruct-2507

Primitive: /generate · Generate · Qwen3

Long contextTool callingConstrained outputStreamingCodeSQL

Overview

Hardware: — drives latency, throughput & cost

Size	4.0B params
Tasks	/generate
License	apache-2.0
Latency	576 ms
Throughput	472 tok/s
Cost	$1.78 /1M tok

Cost is approximate — computed from list GPU prices; your actual price depends on the provider you deploy SIE with.

Generation

Capabilities	Tool calling · Constrained output (JSON Schema, Regex) · Streaming · Code · SQL
Context length	32,768
Max output tokens	4,096

Benchmarks

HumanEval

code generation en

Quality

pass@1 0.8659

MBPP

code generation en

Quality

pass@1 0.7400

Spider

sql generation en

Quality

execution acc 0.6900

BFCL (simple)

tools generation en

Quality

AST match 0.9375

BFCL (multiple)

tools generation en

Quality

AST match 0.9200

CaseHOLD

legal generation en

Quality

accuracy 0.6033

Performance RTX-PRO-6000 b1 c4

Throughput 441 tok/s

p50 latency 607.3ms

GPQA Diamond

scientific generation en

Quality

accuracy 0.4444

Performance RTX-PRO-6000 b1 c4

Throughput 495 tok/s

p50 latency 1.2s

MedQA

medical generation en

Quality

accuracy 0.5700

Performance RTX-PRO-6000 b1 c4

Throughput 475 tok/s

p50 latency 545.4ms

MMLU-Pro

general generation en

Quality

accuracy 0.5333

Performance RTX-PRO-6000 b1 c4

Throughput 468 tok/s

p50 latency 446.4ms

Qwen/Qwen3-4B-Instruct-2507

Overview

Generation

Benchmarks

HumanEval

MBPP

Spider

BFCL (simple)

BFCL (multiple)

CaseHOLD

GPQA Diamond

MedQA

MMLU-Pro

Open source inference for agents