You measure models for a living. So I'm not going to describe myself.
You spend real care on the parts most people wave past — token budgets, progressive disclosure, what a model actually costs to run once it's doing a job instead of a demo. I care about the same thing. So instead of telling you I'm credible, I pointed a tool I built at exactly that question and left it here for you to poke at: six models, three vendors, four kinds of work, measured the same way and priced honestly.
A neutral benchmark — and I make zero dollars from which model wins.
No gateway. No routing. Nobody reselling you inference. Just commodity API keys anyone can buy, one harness that calls every vendor the same way, and cost worked out from each response's own reported token count times the published price — never an estimate. Every model runs its standard, lowest-reasoning path, so it's a like-for-like fight, not a test-time-compute arms race. Here's what came back.
Four task types, one fair method — no black box. Here's exactly what each tab asks a model to do, and how it's scored:
Read a customer-support ticket and return one of five labels — billing, technical, account, feature request, complaint. The boring, high-volume task you'd actually hand to a model.
Read a messy invoice line and return clean JSON in a fixed shape. Scored twice: did it emit valid JSON with every key (reliability), and is every field exactly right (accuracy)?
Solve a word problem and return one number. The gold answer is computed when the problem is generated, so the scoring is exact and unarguable.
Answer using only the provided context. Half the questions are traps with no answer in the text — a faithful model has to say so instead of inventing one.
The harness (model-exhaust v0.2.0) is ~600 lines of Python that calls every model through its own official SDK on a standard, buyable key — the same code path for all four vendors. Per call it logs latency, tokens, and cost, computed from the response's own token count × the published price (never an estimate). v0.2.0 is the cross-vendor build: reasoning is held to each model's base path so it's like-for-like, faithfulness is graded by a panel of three models from three vendors so nothing scores its own family, and every bar carries a 95% confidence interval. The task suites are purpose-built rather than scraped from production — which keeps the gold clean and the whole run reproducible: same harness, same data, same keys → same table.
Your repo is the instruction set. I work on the runtime. You've nailed what a good skill looks like for a single agent — the discipline one agent runs on. I've spent this year on the layer above it: getting a lot of agents to work together without it turning to mush. Parallel dispatch against interface contracts, memory that survives across sessions, a convergence gate before anything merges, and neutral evals — like the one you just poked at — to keep them honest. The benchmark above wasn't a side quest; it's that eval layer turned on itself. Skills make one agent reliable; orchestration makes a fleet of them reliable. That's the seam between your layer and mine — and the part I'd actually want to compare notes on.
There's a seam here worth talking about.
I'd like to map where your skills repo ends and an orchestration layer begins — find the complementary edge and see if there's something worth building together. No deck. Two builders comparing notes on what comes after a single agent runs a single skill.
What a good skill looks like. The instruction set, and the discipline one agent runs on.
Coordinating a lot of them — parallel dispatch, memory, convergence gates, neutral evals. The runtime.
None of this is a one-off. It comes out of DojoGenesis — what I've spent this year building.
DojoGenesis is the agent platform under everything on this page — a self-hosted gateway, a CLI, and MCP servers, built around a library of methodology skills that encode the discipline agents skip on their own: verify before you believe, remember across sessions, converge before you ship. The benchmark above, the demos, and the products in production all run on it. The skills are the layer that overlaps your repo most directly — which is the whole reason I'm writing.
Two things worth seeing up close — the platform itself, and the flagship product that proves it in production:
How I work.
Spec first. Then I dispatch parallel agents against interface contracts and gate everything behind a convergence check before it merges. The system keeps memory across sessions, so it isn't relearning the project every morning. Then I verify before I believe it — including the benchmark above, which I smoke-tested end to end and sanity-checked against the raw rows before it went on this page.
I build for people the tooling forgets
Bilingual by default, because the folks I build for are. Equity data, civic infrastructure, tools for people on variable income and night shifts. The systems lens is what turns that into something that scales instead of staying a one-off.
I label my own uncertainty
If something's a toy model, I say so. If a number's directional, it says directional on the tin. The benchmark here ships its own confidence intervals and a plain list of what I didn't test. I'd rather you trust the small claims than oversell the big ones.