the proof ledger

Measured, never estimated.

In plain terms: we tested whether an AI writes cheaper, more reliable code in curt than in other languages. The honest answer is mixed — and it's all here. Every claim carries a receipt — a script that reproduces the number exactly — and the wins and the losses share the same table.

Claim	Measured	Reproduce
Token cost vs Python	1.10× median (n=22)	tools/tokens ↗
Token cost vs Go / Rust	2.34× / 2.63× cheaper	tools/tokens ↗
Concurrent TCP echo server	32 vs 55 · 94 · 123	DESIGN.md ↗
Grammar-masked generation	0 / 200 violations	grammar/DEMO ↗
vs Zerolang — claude-sonnet	curt wins 3/3 · 7.7× $	VS-ZERO ↗
vs Zerolang — claude-haiku	split: 78% vs 89%	VS-ZERO ↗
Fix synthesis (1-turn repair)	24/32 · 0% wrong-payload	BENCHMARK ↗
Python wins small tasks	honest · 54/54	VS-ZERO ↗

The Zerolang head-to-head

A pre-registered showdown (frozen before any lane was generated): nine shared RosettaCode tasks, each language carrying its own canonical docs, single shot plus up to two repair turns, Python control, two models × three samples. 162 cells, $0.77, all transcripts committed.

$/solved and median output tokens on solved cells — lower is better.
model	lang	solved	$/solved	median out-tok	turns
haiku	curt	21/27	$0.0062	33	1.24
haiku	zero	24/27	$0.0050	168	1.00
haiku	python	27/27	$0.0005	37	1.00
sonnet	curt	27/27	$0.0020	35	1.00
sonnet	zero	27/27	$0.0156	168	1.15
sonnet	python	27/27	$0.0011	40	1.00

Split decision, reported in full. On claude-sonnet, curt won all three pre-registered axes — 7.7× cheaper per solved task at perfect success parity, ~5× leaner output (35 vs 168 tokens), faster convergence. On claude-haiku it missed success parity (78% vs 89%), which voids the dollar axis by rule. And the Python control won every cost axis on both models — on small, well-trained tasks, no agent language beats the incumbent. curt's losses are filed as roadmap work with measured targets.

The diagnostics tournament

curt's prose error hint vs Zerolang's typed-repair-identifier design, tournamented on curt's own repair corpus — 32 toolchain-verified broken programs, each rendered four ways, fed to claude-haiku for repair. The verdict rule was pre-registered before any API call.

turn-1 repair rate by diagnostic rendering — higher is better.
rendering	diag o200k	turn-1	final	lane $
A — shipped prose hint	38.4	18/32	21/32	$0.152
B — typed (Zero-style)	43.3	21/32	25/32	$0.143
C — hybrid	60.3	22/32	26/32	$0.139
D — typed + replacement payload	82.8	32/32	32/32	$0.107

curt adopted the winner the same day. Typed fields beat the prose hint by +9.4pp turn-1 at 1.13× tokens, so curt's diagnostics now emit typed want/got fields plus a stable repair{id,summary} — at single-line economy (~44 tokens vs Zero's captured 114). Row D is the prize: a machine-applicable replacement repairs 32/32 single-shot, because turn count dominates diagnostic size in loop dollars.

Try the toolchain in your browser Benchmark dataset ↗ Full vs-Zerolang record ↗

Reproduce any number: .ci-venv/bin/python tools/bench/headtohead.py report · tools/bench/diag_tourney.py report · tools/tokens/count.py — each deterministic over committed transcripts.