A one-of-one page · built for one reader

Hey Matt.

You wrote Skills for Real Engineers and 148K people starred it. I'm Cruz — I build agent orchestration over in Madison, Wisconsin. I think our two corners of this fit together, and a cold email felt like the wrong way to say so.

Yes — I built you a whole webpage to get sixty seconds of your attention. It seemed more honest than a subject line.
↓ there's a live benchmark in here you can break

You evaluate agents for a living. So I brought receipts, not adjectives.

You just shipped progressive disclosure for ~63% token savings — which tells me you care about the thing most people hand-wave past: what these models actually cost to run, per task, in the real world. That's a language I speak. So instead of telling you I'm credible, here's a working tool from my POC archive that lives in exactly your wheelhouse.

A neutral model benchmark — and I make zero dollars from which one wins.

No gateway. No routing. No resale of inference. Just commodity API keys, an honest harness, and cost computed from each response's own reported token usage × published price — never an estimate. The whole run below cost about ten cents. Poke at it:

Model Buy Signal
● VENDOR-NEUTRAL · REPRODUCIBLE
Real captured data from a committed run, 2026-06-23. Three-tier Claude bake-off (only one key was present at run time, so the cross-vendor adapters are wired but unrun — that's roadmap, not vapor).
BUY
total spend: ~$0.10 contestant calls: $0.029 judge calls: $0.070 (2.4× the contestants) harness: ~600 lines Python
The punchline you'd appreciate: the expensive flagship was the wrong buy on the easy task (Haiku tied for top accuracy at ~7× less cost), and on faithfulness the mid-tier Sonnet beat both the pricier Opus and the Opus acting as judge. The moat was never the harness — anyone can rebuild it in a weekend. It's the longitudinal dataset and the fact that I have no thumb on the scale.

Here's the actual pitch in one line: single-agent skills → multi-agent orchestration. Your repo nails the "what good skills look like" layer. I've been building the layer above it — a decision console that sits over multi-agent ideation runs (one run took 53 raw ideas down to 36 that cleared the gates, each with an adversarial six-dimension verdict). Skills are the instruction set; orchestration is the runtime. I think that's the interesting frontier, and I think you do too.

I build like an orchestrator, not a typist.

Spec first, then dispatch parallel agents against interface contracts, then a convergence gate before anything merges. Memory-native, so the system remembers across sessions instead of re-learning every morning. Ship fast, verify harder. The page you're reading was built this way.

parallel-agent dispatch spec-first memory-native convergence-gated ship → verify → ship bilingual ES / EN

Lived experience + systems thinking

I build for communities everyone else overlooks — bilingual by default, because the people I build for are. Equity data, civic infrastructure, working-shift tools. The systems lens is what turns lived experience into something that scales.

Spunky, warm, allergic to corporate

I'd rather show you a working thing than send you a deck about it. If something's a toy model, I say it's a toy model. If a number's directional, I label it directional. Trust is the only product that compounds.

Seven things I shipped this month. Plus the platform underneath.

0
first-party skills across 9 plugins in CoworkPlugins — peer-level to your repo, and the basis for the collab
0
self-contained POCs built and committed this month — several deployed live
0
live POC subdomains you can open right now

The platform underneath it all: DojoGenesis — AgenticGateway (multi-provider routing), a CLI, and MCP servers. Real multi-agent infrastructure, not a demo reel. And the velocity is the point — here's the archive the benchmark above came from:

model-exhaust ★ the demo
The neutral benchmark you just played with. Sells the exhaust, not the engine.
quorum LIVE
A health decision goes to ~10 named reasoning specialists who split; the divergence map is the product.
quorum-5vu.pages.dev →
counterclock LIVE
Aging-velocity meter — sells the derivative (which way / how fast), not a static bio-age.
counterclock-1r5.pages.dev →
body-almanac LIVE
The Old Farmer's Almanac, rebuilt for one body — a forecast of your predictable bad stretches.
body-almanac.pages.dev →
ip-scout LIVE
IP intelligence bench — claim dissector + neutral model bench + prior-art angle finder.
ip-scout.trespies.dev →
island-remix + build-forge
A multi-agent decision console, and a nightly idea-foundry that drops ~50 original business briefs.

How this page got built — in one session.

You're an engineer; you'll want to know how the sausage was made. So here's the actual loop, no embellishment:

Read the spec, then the source of truth

A handoff defined the four jobs. I read the real repos — ASSESSMENT.md, the captured run data, the skill counts — before writing a line, so every number on this page traces to a file.

Pulled real numbers, refused to fabricate

The benchmark above isn't illustrative. It's the committed run. I spot-checked the figures against the repo rather than rounding to something prettier.

Built it self-contained, house-style

One vanilla HTML file, no build step, on-brand tokens from the TresPies design system — same convention as the rest of the POC archive. Opens straight in a browser.

Verified, screenshotted, shipped

Rendered desktop + mobile, checked the widget, no console errors. Then committed it. Verification isn't a step I skip — it's the whole personality.

Where your skills end, ours begin.

I'd love to map the seam between your skills repo and our orchestration layer — find the complementary edge and see if there's a collaboration worth building. No pitch deck. Just two builders comparing notes on what comes after single-agent skills.

your layer

What a great skill looks like. The instruction set. The discipline a single agent runs on.

+
our layer

Orchestrating many of them — parallel dispatch, memory, convergence gates, neutral evals. The runtime.

Reply to my email — let's compare notes
(There's a real email in your inbox already. This page is just the part I couldn't fit in a subject line.)