You measure models for a living. So I'm not going to describe myself.
You spend real care on the parts most people wave past — token budgets, progressive disclosure, what a model actually costs to run once it's doing a job instead of a demo. I care about the same thing. So instead of telling you I'm credible, I pointed a tool I built at exactly that question and left it here for you to poke at: six models, three vendors, four kinds of work, measured the same way and priced honestly.
Your app is only as good as its evals. So I ran every model through the same ones.
No gateway, no routing, nobody reselling you inference — just commodity API keys, one harness that calls every vendor the same way, and cost worked out from each response's own token count × the published price (never an estimate). Five evals, scored the way you'd score them in a runner like Evalite — exact match, JSON/field checks, toolCallAccuracy, and a faithfulness judge. Every model runs its base, lowest-reasoning path, so it's like-for-like. Poke at it:
Five evals, one fair method — each maps to a scorer you'd recognize from a runner like Evalite. What each one asks a model to do, and how it's scored:
- Classification exactMatch Read a support ticket, return one of five labels. Deterministic — no judge.
- Structured extraction valid-JSON + field Messy invoice → clean JSON in a fixed shape; scored on whether every key parses and every field is exactly right.
- Tool-calling toolCallAccuracy Eight tools on the table: call the right one(s) with the right args — including multi-tool requests — and call nothing when none fit. Deterministic.
- RAG faithfulness faithfulness · LLM-as-judge Answer only from the context; half are traps with no answer present, so a faithful model has to say so. Scored 1–5 by a three-vendor judge panel.
- Reasoning · math numeric match Word problem → one number; the gold is computed when the problem is generated, so scoring is exact.
The harness (model-exhaust v0.2.0) is ~600 lines of Python that calls every model through its official SDK on a commodity key — same code path for all four vendors, cost from each response's own token count × the published price (never an estimate). It splits scorers the way eval-driven development does: deterministic ones (exact match, JSON/field, numeric, toolCallAccuracy) cheap enough to run on every change, and an LLM-as-a-judge panel for faithfulness — three models, three vendors, median, so nothing grades its own family. Every bar carries a 95% confidence interval; reasoning is held to each model's base path. Suites are purpose-built for clean, reproducible gold. (Tool-calling in this suite is single-shot parallel selection — the multi-step agent loops and the smart-zone context-degradation fingerprint graduated into their own section below. noiseSensitivity under distractor docs is the next one on the bench.)
Your repo is the instruction set. I work on the runtime. You've nailed what a good skill looks like for a single agent — the discipline one agent runs on. I've spent this year on the layer above it: getting a lot of agents to work together without it turning to mush. Parallel dispatch against interface contracts, memory that survives across sessions, a convergence gate before anything merges, and neutral evals — like the one you just poked at — to keep them honest. The benchmark above wasn't a side quest; it's that eval layer turned on itself. Skills make one agent reliable; orchestration makes a fleet of them reliable. That's the seam between your layer and mine — and the part I'd actually want to compare notes on.
Single calls are easy. Coordinating many is where models split.
A good single tool call doesn't tell you whether a model can run a multi-step job. These two evals push into the layer I actually build — coordination, and long context — and they're scored the way the serious agent benchmarks are (tau-bench, ToolSandbox): on the final state, run for reliability, not on vibes.
A real tool-execution loop in a stateful world — dependency, error-recovery, restraint (don't over-act), multi-hop, synthesis. Each task is run 3× and only counts if it passes all three; a wrong or forbidden tool call zeroes it. Dots = the five task families; $/1k = cost to run 1,000 attempts; ★ = the value pick (cheapest at a perfect score).
The smart-zone / dumb-zone test: three facts scattered through a growing haystack plus a "which is largest" question that needs all three. Accuracy as the context grows — where each model falls out of the smart zone. (Single-needle retrieval is saturated; this is compositional.)
Straight talk: at these scales the capable models are reliable — six of seven pass every orchestration task 3/3 and hold long-context accuracy out to 110K tokens. Llama-3.3-70B is the one that breaks: it over-acts when the job is to stop, botches the multi-step chains, and drops a fact half the time once the haystack hits 110K. Telling the frontier models apart from each other is the harder problem — it would take 20–40-step plans and 200K+ context, and that's the next bench. Both evals here are deliberately small-N probes (pass^3; n=6 items per context size), built to show the shape, not to rank the top four.
There's a seam here worth talking about.
I'd like to map where your skills repo ends and an orchestration layer begins — find the complementary edge and see if there's something worth building together. No deck. Two builders comparing notes on what comes after a single agent runs a single skill.
What a good skill looks like. The instruction set, and the discipline one agent runs on.
Coordinating a lot of them — parallel dispatch, memory, convergence gates, neutral evals. The runtime.
None of this is a one-off. It comes out of DojoGenesis — what I've spent this year building.
DojoGenesis is the agent platform under everything on this page — a self-hosted gateway, a CLI, and MCP servers, built around a library of methodology skills that encode the discipline agents skip on their own: verify before you believe, remember across sessions, converge before you ship. The benchmark above, the demos, and the products in production all run on it. The skills are the layer that overlaps your repo most directly — which is the whole reason I'm writing.
Two things worth seeing up close — the platform itself, and the flagship product that proves it in production:
How I work.
Spec first. Then I dispatch parallel agents against interface contracts and gate everything behind a convergence check before it merges. The system keeps memory across sessions, so it isn't relearning the project every morning. Then I verify before I believe it — including the benchmark above, which I smoke-tested end to end and sanity-checked against the raw rows before it went on this page.
I build for people the tooling forgets
Bilingual by default, because the folks I build for are. Equity data, civic infrastructure, tools for people on variable income and night shifts. The systems lens is what turns that into something that scales instead of staying a one-off.
I label my own uncertainty
If something's a toy model, I say so. If a number's directional, it says directional on the tin. The benchmark here ships its own confidence intervals and a plain list of what I didn't test. I'd rather you trust the small claims than oversell the big ones.