one page · one reader

Hey Matt.

Your skills repo is the clearest thing I've read on how to actually write the things — I've sent it to half the people I work with. I'm Cruz. I build agent orchestration out of Madison, Wisconsin, and I've been living one layer up from where your repo stops. A cold email felt like the wrong way to start, so I built you this.

Yeah — I made a whole page to ask for sixty seconds. It felt more honest than a clever subject line.
↓ there's a live benchmark down here. try to break it.

You measure models for a living. So I'm not going to describe myself.

You spend real care on the parts most people wave past — token budgets, progressive disclosure, what a model actually costs to run once it's doing a job instead of a demo. I care about the same thing. So instead of telling you I'm credible, I pointed a tool I built at exactly that question and left it here for you to poke at: six models, three vendors, four kinds of work, measured the same way and priced honestly.

Your app is only as good as its evals. So I ran every model through the same ones.

No gateway, no routing, nobody reselling you inference — just commodity API keys, one harness that calls every vendor the same way, and cost worked out from each response's own token count × the published price (never an estimate). Five evals, scored the way you'd score them in a runner like Evalite — exact match, JSON/field checks, toolCallAccuracy, and a faithfulness judge. Every model runs its base, lowest-reasoning path, so it's like-for-like. Poke at it:

Model Buy Signal
● VENDOR-NEUTRAL · REPRODUCIBLE
Loading the captured run…
Anthropic (Claude) OpenAI (GPT) Google (Gemini) Meta · Llama (OpenRouter)
BUY

Five evals, one fair method — each maps to a scorer you'd recognize from a runner like Evalite. What each one asks a model to do, and how it's scored:

  • Classification exactMatch Read a support ticket, return one of five labels. Deterministic — no judge.
  • Structured extraction valid-JSON + field Messy invoice → clean JSON in a fixed shape; scored on whether every key parses and every field is exactly right.
  • Tool-calling toolCallAccuracy Eight tools on the table: call the right one(s) with the right args — including multi-tool requests — and call nothing when none fit. Deterministic.
  • RAG faithfulness faithfulness · LLM-as-judge Answer only from the context; half are traps with no answer present, so a faithful model has to say so. Scored 1–5 by a three-vendor judge panel.
  • Reasoning · math numeric match Word problem → one number; the gold is computed when the problem is generated, so scoring is exact.

The harness (model-exhaust v0.2.0) is ~600 lines of Python that calls every model through its official SDK on a commodity key — same code path for all four vendors, cost from each response's own token count × the published price (never an estimate). It splits scorers the way eval-driven development does: deterministic ones (exact match, JSON/field, numeric, toolCallAccuracy) cheap enough to run on every change, and an LLM-as-a-judge panel for faithfulness — three models, three vendors, median, so nothing grades its own family. Every bar carries a 95% confidence interval; reasoning is held to each model's base path. Suites are purpose-built for clean, reproducible gold. (Tool-calling in this suite is single-shot parallel selection — the multi-step agent loops and the smart-zone context-degradation fingerprint graduated into their own section below. noiseSensitivity under distractor docs is the next one on the bench.)

The part you'd want me to say out loud: two of the five evals — basic math and faithfulness-with-traps — saturate; every model tops out. That's a finding, not a dud bench: where capability is commoditized, the only lever left is price, and the cheapest model comes back 40–75× cheaper for the identical result. Tool-calling is where it gets interesting — the cheap models that win on price collapse on multi-tool work (gpt-5-mini 33%, Llama-70B 0% on multi-call), while Gemini Flash matches the flagships at a fraction of the cost. The buy call always names the cheapest model that isn't statistically worse than the best, never just the top line. Gemini 2.5 Pro sits out — it can't turn its thinking off, so it wouldn't be on the same footing. None of this is the moat: anyone can rebuild the harness in a weekend. The moat is the run history piling up over releases — and that I have nothing to sell you about who wins.

Your repo is the instruction set. I work on the runtime. You've nailed what a good skill looks like for a single agent — the discipline one agent runs on. I've spent this year on the layer above it: getting a lot of agents to work together without it turning to mush. Parallel dispatch against interface contracts, memory that survives across sessions, a convergence gate before anything merges, and neutral evals — like the one you just poked at — to keep them honest. The benchmark above wasn't a side quest; it's that eval layer turned on itself. Skills make one agent reliable; orchestration makes a fleet of them reliable. That's the seam between your layer and mine — and the part I'd actually want to compare notes on.

Single calls are easy. Coordinating many is where models split.

A good single tool call doesn't tell you whether a model can run a multi-step job. These two evals push into the layer I actually build — coordination, and long context — and they're scored the way the serious agent benchmarks are (tau-bench, ToolSandbox): on the final state, run for reliability, not on vibes.

Orchestration pass^3 · final-state + minefield

A real tool-execution loop in a stateful world — dependency, error-recovery, restraint (don't over-act), multi-hop, synthesis. Each task is run 3× and only counts if it passes all three; a wrong or forbidden tool call zeroes it. Dots = the five task families; $/1k = cost to run 1,000 attempts; ★ = the value pick (cheapest at a perfect score).

Smart zone long-context degradation

The smart-zone / dumb-zone test: three facts scattered through a growing haystack plus a "which is largest" question that needs all three. Accuracy as the context grows — where each model falls out of the smart zone. (Single-needle retrieval is saturated; this is compositional.)

Straight talk: at these scales the capable models are reliable — six of seven pass every orchestration task 3/3 and hold long-context accuracy out to 110K tokens. Llama-3.3-70B is the one that breaks: it over-acts when the job is to stop, botches the multi-step chains, and drops a fact half the time once the haystack hits 110K. Telling the frontier models apart from each other is the harder problem — it would take 20–40-step plans and 200K+ context, and that's the next bench. Both evals here are deliberately small-N probes (pass^3; n=6 items per context size), built to show the shape, not to rank the top four.

There's a seam here worth talking about.

I'd like to map where your skills repo ends and an orchestration layer begins — find the complementary edge and see if there's something worth building together. No deck. Two builders comparing notes on what comes after a single agent runs a single skill.

your layer

What a good skill looks like. The instruction set, and the discipline one agent runs on.

+
my layer

Coordinating a lot of them — parallel dispatch, memory, convergence gates, neutral evals. The runtime.

Reply to my email — let's compare notes
(There's already a real email in your inbox. This page is the part that wouldn't fit in a subject line.)

None of this is a one-off. It comes out of DojoGenesis — what I've spent this year building.

DojoGenesis is the agent platform under everything on this page — a self-hosted gateway, a CLI, and MCP servers, built around a library of methodology skills that encode the discipline agents skip on their own: verify before you believe, remember across sessions, converge before you ship. The benchmark above, the demos, and the products in production all run on it. The skills are the layer that overlaps your repo most directly — which is the whole reason I'm writing.

Two things worth seeing up close — the platform itself, and the flagship product that proves it in production:

DojoGenesis OPEN SOURCE
"Infrastructure for disciplined AI." The platform everything on this page runs on — and the thesis I'd bet you agree with: models are already smart enough; what they lack is method.
So DojoGenesis encodes the method instead of hoping the model supplies it. A self-hosted Gateway (one Go binary) routes across eight LLM providers with intent classification, DAG orchestration for multi-step work, and real-time SSE token observability. A CLI puts that same engine in your terminal across eight named surfaces. An MCP server injects the methodology skills straight into Claude Code as cognitive scaffolds — "every other MCP gives Claude more data; this one gives it better methods." And the Garden persists decisions and context across sessions, so the system stops relearning your project every morning. Self-hosted, zero vendor lock-in.
~148K lines of Go · 8 providers · 19 runtime modules · CLI via Homebrew · MCP for Claude Code · Apache-2.0
dojogenesis.com →
Night Shift LIVE
"Cash, tip & shift tracking for service workers." The flagship product — live, and taking real payments.
It's a financial-clarity app for the people most fintech ignores: bartenders, dancers, drivers, salon and delivery workers living on variable, multi-gig income. It shows real take-home after tip-outs, fees, and gas; auto-reserves taxes (with the new federal No-Tax-on-Tips deduction built in, up to $25K); and automates savings into goal jars you can name. Offline-first and privacy-first — your data stays on the phone by default — with optional encrypted backup, CSV export for tax season, and a full bilingual EN/ES build. One founder, real users, real money in.
iOS + installable PWA · offline-first · bilingual · 30-day free trial, then $99.99 first year
nightshift.cash →

How I work.

Spec first. Then I dispatch parallel agents against interface contracts and gate everything behind a convergence check before it merges. The system keeps memory across sessions, so it isn't relearning the project every morning. Then I verify before I believe it — including the benchmark above, which I smoke-tested end to end and sanity-checked against the raw rows before it went on this page.

parallel-agent dispatch spec-first memory-native convergence-gated verify before believe bilingual ES / EN

I build for people the tooling forgets

Bilingual by default, because the folks I build for are. Equity data, civic infrastructure, tools for people on variable income and night shifts. The systems lens is what turns that into something that scales instead of staying a one-off.

I label my own uncertainty

If something's a toy model, I say so. If a number's directional, it says directional on the tin. The benchmark here ships its own confidence intervals and a plain list of what I didn't test. I'd rather you trust the small claims than oversell the big ones.