Evidence

One prompt. One matrix. We change one thing at a time and watch what moves. Two experiments sit here side by side:

Same skill, two models: hold the skill fixed, swap Sonnet for Opus. This is the floor and the ceiling.
Same model, skill on or off: hold the model fixed, flip /primer-react on and off. This is the cost of search.

Both run the standard pr-merged-switch-dark-mode prompt across five effort tiers per model. Nothing here is a projection. It is what came out of the runs.

Experiment 1: Same skill, two models

We run /primer-react on Sonnet and on Opus and put the results side by side.

The skill holds the quality floor across models; the model changes the ceiling.

Scroll →

What we measured

This is not a claim. It is what came out of the runs. Across 10 builds, the floor held every time:

Typecheck passes: the code compiles.
Both color modes are correct: light and dark each look right, from the design system's own tokens.
The merge box is editable: you can act on it once the checks pass.
The merge lock holds: while checks run, merging is genuinely unavailable, not just greyed out.

That floor is the skill's work, and both models clear it.

The ceiling is where the models split. Opus reaches for richer GitHub chrome: it adds a Timeline and review cards, closer to the real page. Sonnet stays a flatter checks list. Both are correct. One is fuller.

The two flips show up in both runs: the light↔dark color-mode toggle, and the capsule going Open → Merged.

Experiment 2: Same model, skill on or off

Now we hold the model fixed and flip the skill. Same prompt, same five effort tiers, but one arm runs /primer-react, the other gets only the PRD, node_modules, and web access and must infer Primer React on its own. To keep it honest, the skill folder was deleted from the no-skill worktrees so the model could not read it even uninvoked.

We focus on Opus: the expensive model, where the effect is real and clean.

First, the honest part: on Opus the quality came out even. Both arms ship 10/10 typecheck PASS, a working labelled dark-mode toggle every time, correct merge gating, and zero console errors, with or without the skill. The skill does not unlock a capability Opus otherwise lacks here. So the clean, measured difference is cost, and because the only variable is the skill, cost is exactly the right thing to read.

Same prompt, same model. The skill reduces Opus's search cost.

Scroll →

What the cost chart shows

Every Opus tier costs more without the skill. Across the five effort tiers, the no-skill arm spent $58.31; with /primer-react, $47.60: the skill cut $10.71, about 22%, off the total. The gap holds at every tier (between +10% and +35%), so it is a steady effect, not one outlier.

On Opus, reliability was a wash

The cost is the whole story here. Both arms were clean: 10/10 typecheck, a working labelled dark-mode toggle every time, correct merge gating, and zero console errors, with or without the skill. On Opus the skill is not buying you correctness; it is buying a shorter, cheaper path to the same correct result. The next chart shows where that money goes.

The cost of search: why a private design system is different

Here is the part that matters most. The saving did not come from writing less code: Opus's output was close either way (302k tokens with the skill, 336k without). It came from input: Opus's cache-read input jumped +43% without the skill, from 25.0M to 35.8M tokens.

Scroll →

That is the cost of search. Without the recipe in front of it, the model reads and re-reads node_modules and fetches docs to reconstruct what the skill states up front. The footnote on every number: n=1 per cell · build-only total_cost_usd · 2026-06-15.

primer-react is open source, and the model has already seen a great deal of it, so that search is relatively cheap, and the gap stays modest. Now picture a private, undocumented design system the model has never encountered. The same search mechanism is still running, but it has far less to work with: there is no public knowledge to lean on, only node_modules to dig through. That is where the gap we measured here stops being modest. We are not putting a number on that case. We did not measure it. We are showing you the mechanism, and letting you scale it to your own codebase.

Why we change one thing at a time

Each experiment moves exactly one variable, and that discipline is what makes the numbers mean anything.

Cost across models is not a fair comparison. Sonnet and Opus are different products at different prices; lining them up on a dollar axis tells you nothing about the skill. That is why Experiment 1 is a quality story, not a price one.

Cost across skill-on/off is a fair comparison. Same model, one variable flipped, everything else identical. Here the dollar axis is clean: it is the single cleanest thing we measured.

And we keep the tool fixed too. Cursor will not auto-load .claude/skills/, so a run there starts without the skill. If a result looked worse, you could not tell whether the model was weaker or the skill simply never loaded. The two causes tangle and the comparison means nothing. Native Claude Code loads the skill for every model, so when we move from Sonnet to Opus the skill is identical on both sides. The only thing that changed is the thing we meant to change.

Where this sits

Generate: run the one prompt yourself, then try a second model.
Three Frames: the map this fits into: a skill gets used, built, and grown.
Takeaways: what to carry home.

Next: Takeaways