the proof ledger

Measured, never estimated.

In plain terms: we tested whether an AI writes cheaper, more reliable code in curt than in other languages. The honest answer is mixed — and it's all here. Every claim carries a receipt — a script that reproduces the number exactly — and the wins and the losses share the same table.

ClaimMeasuredReproduce
Token cost vs Python1.10× median (n=22)tools/tokens ↗
Token cost vs Go / Rust2.34× / 2.63× cheapertools/tokens ↗
Concurrent TCP echo server32 vs 55 · 94 · 123DESIGN.md ↗
Grammar-masked generation0 / 200 violationsgrammar/DEMO ↗
vs Zerolang — claude-sonnetcurt wins 3/3 · 7.7× $VS-ZERO ↗
vs Zerolang — claude-haikusplit: 78% vs 89%VS-ZERO ↗
Fix synthesis (1-turn repair)24/32 · 0% wrong-payloadBENCHMARK ↗
Python wins small taskshonest · 54/54VS-ZERO ↗

The Zerolang head-to-head

A pre-registered showdown (frozen before any lane was generated): nine shared RosettaCode tasks, each language carrying its own canonical docs, single shot plus up to two repair turns, Python control, two models × three samples. 162 cells, $0.77, all transcripts committed.

$/solved and median output tokens on solved cells — lower is better.
modellangsolved$/solvedmedian out-tokturns
haikucurt21/27$0.0062331.24
haikuzero24/27$0.00501681.00
haikupython27/27$0.0005371.00
sonnetcurt27/27$0.0020351.00
sonnetzero27/27$0.01561681.15
sonnetpython27/27$0.0011401.00

Split decision, reported in full. On claude-sonnet, curt won all three pre-registered axes — 7.7× cheaper per solved task at perfect success parity, ~5× leaner output (35 vs 168 tokens), faster convergence. On claude-haiku it missed success parity (78% vs 89%), which voids the dollar axis by rule. And the Python control won every cost axis on both models — on small, well-trained tasks, no agent language beats the incumbent. curt's losses are filed as roadmap work with measured targets.

The diagnostics tournament

curt's prose error hint vs Zerolang's typed-repair-identifier design, tournamented on curt's own repair corpus — 32 toolchain-verified broken programs, each rendered four ways, fed to claude-haiku for repair. The verdict rule was pre-registered before any API call.

turn-1 repair rate by diagnostic rendering — higher is better.
renderingdiag o200kturn-1finallane $
A — shipped prose hint38.418/3221/32$0.152
B — typed (Zero-style)43.321/3225/32$0.143
C — hybrid60.322/3226/32$0.139
D — typed + replacement payload82.832/3232/32$0.107

curt adopted the winner the same day. Typed fields beat the prose hint by +9.4pp turn-1 at 1.13× tokens, so curt's diagnostics now emit typed want/got fields plus a stable repair{id,summary} — at single-line economy (~44 tokens vs Zero's captured 114). Row D is the prize: a machine-applicable replacement repairs 32/32 single-shot, because turn count dominates diagnostic size in loop dollars.

Reproduce any number: .ci-venv/bin/python tools/bench/headtohead.py report · tools/bench/diag_tourney.py report · tools/tokens/count.py — each deterministic over committed transcripts.