What's Real and What's Noise in the Bullpen Ledger

We built the Bullpen Ledger to name a hero and a goat every night — the fireman who escaped the jam, the arsonist who torched the lead. It is honest about what happened. But it raised a question we owed our readers an answer to: if a reliever blows three leads in June, is that him — a thing he’ll keep doing — or is it mostly the bounce of a very small ball? So we did the boring, important thing. We took every qualified reliever, lined up his 2025 next to his 2026, and asked which of the numbers actually carry over.

The answer splits cleanly in two, and it is the same shape we keep finding — in the Jump Tax, in the Adjustable Swing: the traits repeat, the results don’t. A reliever’s stuff and command are real, identifiable, durable skills. The win-probability ledger those skills produce is, season to season, very close to noise.

1. What repeats, and what doesn’t

Take eleven things you might measure about a reliever — how hard he throws, how many bats he misses, how often he’s in the zone, and on the other side, the win probability he adds, the leads he holds, the leads he blows — and ask the simplest question in sports analytics: does this year’s number predict next year’s?

Skill inputs — how he throws (repeat)Ledger outputs — what he's credited (noise)

Year-over-year Spearman, 2025 → 2026, n = 177 relievers · 5,000-resample CIs

The split is not subtle. A reliever’s fastball velocity is essentially the same pitcher year to year (0.95); his whiff rate (0.63) and the rest of his stuff and command sit comfortably above a coin flip. Then the line falls off a cliff. Every output — fireman rate, mean win-probability swing, WPA per appearance (0.13, and its interval crosses zero), blown-lead rate (0.06) — lands near or below the noise floor. The thing the ledger dramatizes every night is the thing that least belongs to the pitcher.

That is not a knock on the relievers, and it is not a knock on the ledger. It is a fact about leverage. A reliever throws twenty innings in two months in the highest-variance situations the game has; a couple of bloop hits with the bases loaded can swing his whole season’s ledger. The skill underneath is steady. The win-probability it happens to produce in a half-season is mostly weather.

2. See it with your own eyes

You don’t need the correlations to feel this. Plot the same 177 relievers twice — 2025 on the bottom, 2026 up the side — once for velocity, once for the win-probability ledger.

Same 177 relievers, each plotted by his 2025 value (x) against his 2026 value (y). On the dashed line, a reliever repeats himself.

On the left, velocity: the cloud hugs the diagonal, because the fireballer last year is the fireballer this year. On the right, WPA per appearance: a shapeless blur. Knowing what a reliever’s ledger looked like in 2025 tells you almost nothing about 2026. Same pitchers, same axes, two completely different pictures — that gap is the difference between a skill and a scoreboard.

3. The scoreboard barely correlates with itself

Here is the part that surprised us. You might assume the ledger fails to repeat across years because a reliever’s role changes, or his team, or his luck regresses. But the noise is more fundamental than that. Split a single 2026 season in half — his odd-numbered appearances against his even-numbered ones — and a reliever’s WPA per appearance correlates with itself at about 0.17. It can barely agree with the same pitcher in the same season. The instability isn’t a cross-year artifact; it’s baked into what win probability measures over twenty innings.

So when a closer’s blown-save count doubles, or a setup man suddenly leads the league in win probability added, the safest bet is not that something changed about the pitcher. It’s that the small-sample ball bounced differently. The stuff is the constant; the ledger is the variable.

4. What we are not claiming

The honest fine print, as loud as the headline:

This is a half-season of 2026 (March 27–June 29) lined up against a full 2025. We compared rate stats, not totals, to keep it fair — but the 2026 side is young, and that widens every interval.
It’s an incumbents-only study. We required at least 15 appearances in both seasons, which means relievers who pitched their way out of a job (or into one) aren’t here. That survivorship probably makes the skills look a touch more stable than the full population would — it can’t manufacture the output noise, but it’s a real limit on the input side.
We’re reporting repeatability, not a precise noise percentage. An earlier pass tried to pin the exact share of WPA that is “signal” and how far the elite regress; our own cross-review judged those numbers too fragile at a half-season to print as precise figures. So we’re showing you the year-over-year carryover — which is robust — and leaving the decimal-point theatrics out.

5. Why we trust this: two methods, made to fight

One agent measured this as a reliability problem: year-over-year and split-half correlations, with bootstrap intervals on every number. The other refused correlations entirely and treated it as forecasting: train on a reliever’s 2025 numbers, predict his 2026, and see which 2025 numbers actually carry predictive weight — his skills, or his prior results. Neither read the other until it was filed.

Question	Reliability	Forecasting	Verdict
Do skill inputs repeat?	velo `0.95`, whiff `0.63`	same anchor passes	Yes — converge
Does the output ledger repeat?	WPA `0.13`; none > 0.5	no output forecastable	No — converge
Skills vs. prior results?	skills far more reliable	2025 skills beat 2025 results	Skill carries — converge
Can we forecast the ledger?	—	both models sub-coin-flip	No — even skill barely helps

They converged on the split and kept each other honest about its size. The forecasting agent’s models, asked to predict 2026 win probability from 2025, were sub-coin-flip in absolute terms — so the right statement isn’t “skill forecasts results,” it’s the more humbling “results carry even less than skill, and skill itself barely forecasts the ledger.” And the reliability agent’s prettier diagnostics — an exact “signal share,” a tidy regression-to-the-mean figure — got marked down by the other as half-season-fragile and pulled from the headline. What survived both is the plain finding: the inputs repeat, the output doesn’t.

What this means for tonight’s game

When our card crowns tonight’s arsonist, read it for what it is: a true account of the most damaging half-inning of the night, not a verdict on the man. The reliever who blew it is, with very high confidence, the same arm he was last week — same velocity, same whiffs, the things that actually describe him. The blown lead is real, and it cost a real win. It just isn’t a promise he’ll do it again.

It’s the companion truth to the one we told in The Firemen the Save Stat Can’t See: the win probability a reliever banks is the best record of what happened, and a poor guide to what he is. Cheer the escape. Judge the pitcher by his stuff.

Methodology

How we built and stress-tested this

Data. Every relief appearance in 2025 (full season) and 2026 (through June 29), from Statcast, aggregated to per-reliever-season rates. Win probability is event-row WPA from the pitching team’s perspective — the same engine behind our nightly Bullpen Ledger. The panel is relievers (role RP) with at least 15 appearances in both seasons: 177 pitchers. We compare rate/per-appearance metrics, never raw totals, because 2026 is a half-season.

Two divergent methods. Agent A (reliability theory): year-over-year Spearman and intraclass correlations with 5,000-sample bootstrap CIs; within-2026 odd/even split-half reliability; variance decomposition and regression-to-the-mean. Agent B (machine learning): leave-one-pitcher-out gradient-boosted prediction of each 2026 output from 2025 features, comparing the predictive weight of a reliever’s 2025 skills against his 2025 results, with bootstrap ensembles. Neither read the other until filed; then each refereed the other.

What the cross-review changed. The forecasting models are sub-baseline in absolute terms (out-of-fold AUC intervals that include 0.5), so we frame the result as “skills carry more than results, and even skills barely forecast,” not as a usable projection. The reliability pass’s exact “~1% signal” variance share and “~100% regression” figure were judged too fragile at a half-season and cut from the headline in favor of the year-over-year carryover, which is robust. A mislabeled feature (a strand-rate outcome placed among the skill inputs) was flagged and excluded from the skill anchor.

Limitations. Half-season 2026; an incumbents-only panel (survivorship can inflate input stability, though not output noise); descriptive, not a projection system. The conclusion is about repeatability, which the data supports cleanly, not about the precise magnitude of the noise, which it does not.

1. What repeats, and what doesn’t

2. See it with your own eyes

3. The scoreboard barely correlates with itself

4. What we are not claiming

5. Why we trust this: two methods, made to fight

What this means for tonight’s game

Methodology

Related analysis

The Firemen the Save Stat Can't See

The Fielder's Fingerprint: Two Gloves, Same OAA, Opposite Skills

A Bad Start Is Forever. A Good One Isn't.