one page · one reader

Hey Matt.

Your skills repo is the clearest thing I've read on how to actually write the things — I've sent it to half the people I work with. I'm Cruz. I build agent orchestration out of Madison, Wisconsin, and I've been living one layer up from where your repo stops. A cold email felt like the wrong way to start, so I built you this.

Yeah — I made a whole page to ask for sixty seconds. It felt more honest than a clever subject line.
↓ there's a live benchmark down here. try to break it.

You measure models for a living. So I'm not going to describe myself.

You spend real care on the parts most people wave past — token budgets, progressive disclosure, what a model actually costs to run once it's doing a job instead of a demo. I care about the same thing. So instead of telling you I'm credible, I pointed a tool I built at exactly that question and left it here for you to poke at: six models, three vendors, four kinds of work, measured the same way and priced honestly.

A neutral benchmark — and I make zero dollars from which model wins.

No gateway. No routing. Nobody reselling you inference. Just commodity API keys anyone can buy, one harness that calls every vendor the same way, and cost worked out from each response's own reported token count times the published price — never an estimate. Every model runs its standard, lowest-reasoning path, so it's a like-for-like fight, not a test-time-compute arms race. Here's what came back.

Model Buy Signal
● VENDOR-NEUTRAL · REPRODUCIBLE
Loading the captured run…
Anthropic (Claude) OpenAI (GPT) Google (Gemini) Meta · Llama (OpenRouter)
BUY
The part you'd want me to say out loud: faithfulness is scored by a three-vendor judge panel — the median of a Claude, a GPT, and a Gemini model, not one model grading its own family, because a single same-vendor judge is the first thing you'd rightly poke a hole in. Every bar carries a 95% confidence interval, and the buy call always names the cheapest model that isn't statistically worse than the best, never just the top line. On two of the four tasks — basic math and faithfulness-with-traps — all seven models top out together; that's a finding, not a dud benchmark: where the capability is commoditized the only lever left is price, and the cheapest model came back 40–75× cheaper for the identical result. Gemini 2.5 Flash is in the run; I left Gemini 2.5 Pro out because it can't turn its thinking off, which would make it the one model not racing on the same footing as the rest. None of this is the moat — anyone can rebuild the harness in a weekend. The moat is the run history piling up over releases, and the fact that I have nothing to sell you about who wins.

Four task types, one fair method — no black box. Here's exactly what each tab asks a model to do, and how it's scored:

Classificationexact match

Read a customer-support ticket and return one of five labels — billing, technical, account, feature request, complaint. The boring, high-volume task you'd actually hand to a model.

Structured extractionvalid JSON + field match

Read a messy invoice line and return clean JSON in a fixed shape. Scored twice: did it emit valid JSON with every key (reliability), and is every field exactly right (accuracy)?

Reasoning · mathnumeric match

Solve a word problem and return one number. The gold answer is computed when the problem is generated, so the scoring is exact and unarguable.

RAG faithfulness3-vendor judge panel

Answer using only the provided context. Half the questions are traps with no answer in the text — a faithful model has to say so instead of inventing one.

The harness (model-exhaust v0.2.0) is ~600 lines of Python that calls every model through its own official SDK on a standard, buyable key — the same code path for all four vendors. Per call it logs latency, tokens, and cost, computed from the response's own token count × the published price (never an estimate). v0.2.0 is the cross-vendor build: reasoning is held to each model's base path so it's like-for-like, faithfulness is graded by a panel of three models from three vendors so nothing scores its own family, and every bar carries a 95% confidence interval. The task suites are purpose-built rather than scraped from production — which keeps the gold clean and the whole run reproducible: same harness, same data, same keys → same table.

Your repo is the instruction set. I work on the runtime. You've nailed what a good skill looks like for a single agent — the discipline one agent runs on. I've spent this year on the layer above it: getting a lot of agents to work together without it turning to mush. Parallel dispatch against interface contracts, memory that survives across sessions, a convergence gate before anything merges, and neutral evals — like the one you just poked at — to keep them honest. The benchmark above wasn't a side quest; it's that eval layer turned on itself. Skills make one agent reliable; orchestration makes a fleet of them reliable. That's the seam between your layer and mine — and the part I'd actually want to compare notes on.

There's a seam here worth talking about.

I'd like to map where your skills repo ends and an orchestration layer begins — find the complementary edge and see if there's something worth building together. No deck. Two builders comparing notes on what comes after a single agent runs a single skill.

your layer

What a good skill looks like. The instruction set, and the discipline one agent runs on.

+
my layer

Coordinating a lot of them — parallel dispatch, memory, convergence gates, neutral evals. The runtime.

Reply to my email — let's compare notes
(There's already a real email in your inbox. This page is the part that wouldn't fit in a subject line.)

None of this is a one-off. It comes out of DojoGenesis — what I've spent this year building.

DojoGenesis is the agent platform under everything on this page — a self-hosted gateway, a CLI, and MCP servers, built around a library of methodology skills that encode the discipline agents skip on their own: verify before you believe, remember across sessions, converge before you ship. The benchmark above, the demos, and the products in production all run on it. The skills are the layer that overlaps your repo most directly — which is the whole reason I'm writing.

Two things worth seeing up close — the platform itself, and the flagship product that proves it in production:

DojoGenesis OPEN SOURCE
"Infrastructure for disciplined AI." The platform everything on this page runs on — and the thesis I'd bet you agree with: models are already smart enough; what they lack is method.
So DojoGenesis encodes the method instead of hoping the model supplies it. A self-hosted Gateway (one Go binary) routes across eight LLM providers with intent classification, DAG orchestration for multi-step work, and real-time SSE token observability. A CLI puts that same engine in your terminal across eight named surfaces. An MCP server injects the methodology skills straight into Claude Code as cognitive scaffolds — "every other MCP gives Claude more data; this one gives it better methods." And the Garden persists decisions and context across sessions, so the system stops relearning your project every morning. Self-hosted, zero vendor lock-in.
~148K lines of Go · 8 providers · 19 runtime modules · CLI via Homebrew · MCP for Claude Code · Apache-2.0
dojogenesis.com →
Night Shift LIVE
"Cash, tip & shift tracking for service workers." The flagship product — live, and taking real payments.
It's a financial-clarity app for the people most fintech ignores: bartenders, dancers, drivers, salon and delivery workers living on variable, multi-gig income. It shows real take-home after tip-outs, fees, and gas; auto-reserves taxes (with the new federal No-Tax-on-Tips deduction built in, up to $25K); and automates savings into goal jars you can name. Offline-first and privacy-first — your data stays on the phone by default — with optional encrypted backup, CSV export for tax season, and a full bilingual EN/ES build. One founder, real users, real money in.
iOS + installable PWA · offline-first · bilingual · 30-day free trial, then $99.99 first year
nightshift.cash →

How I work.

Spec first. Then I dispatch parallel agents against interface contracts and gate everything behind a convergence check before it merges. The system keeps memory across sessions, so it isn't relearning the project every morning. Then I verify before I believe it — including the benchmark above, which I smoke-tested end to end and sanity-checked against the raw rows before it went on this page.

parallel-agent dispatch spec-first memory-native convergence-gated verify before believe bilingual ES / EN

I build for people the tooling forgets

Bilingual by default, because the folks I build for are. Equity data, civic infrastructure, tools for people on variable income and night shifts. The systems lens is what turns that into something that scales instead of staying a one-off.

I label my own uncertainty

If something's a toy model, I say so. If a number's directional, it says directional on the tin. The benchmark here ships its own confidence intervals and a plain list of what I didn't test. I'd rather you trust the small claims than oversell the big ones.