The Firemen the Save Stat Can't See

Start with a number that doesn’t fit the story baseball tells about relievers. Through three months of 2026, Colin Holderman of the Guardians has added roughly 1.4 wins of win probability out of the bullpen — a genuinely elite figure, and one our bootstrap is confident is real (95% interval [0.75, 1.98], comfortably clear of zero). He has done it across 29 appearances. He has done almost none of it in the ninth inning with a lead. So when you go looking for him on the leaderboard fans and front offices actually quote — saves — he is essentially invisible. He has zero.

That is not an accident of one reliever. It is how the save works. The save was invented in 1969 to reward the man who finishes a close game. It rewards the inning, not the difficulty. And once you measure relief work by the win probability it actually swings — not by who happened to be standing there for the last three outs — a strange thing falls out of the data: the save situation is where the least bullpen value is created, and almost nobody is keeping score of the rest.

This started as the engine behind our nightly Bullpen Ledger. To pick each night’s “arsonist” and “fireman” we already compute, for every relief appearance, the swing in the pitching team’s win probability — the same Statcast win-expectancy math behind WPA. We had a half-season of it sitting there: 8,092 relief appearances. So we asked it a question the save stat can’t answer: where does bullpen value actually come from, and who is banking it without credit?

1. 83% of bullpen value happens outside the save

Take every relief appearance this season and split the positive win probability — the good a reliever did — by the situation he entered.

Only 17% of the win probability relievers create comes in save situations — entering the ninth (or later) with a lead of one to three runs. The single biggest slice, 41%, comes in high-leverage non-save spots: the seventh with the bases loaded, the eighth in a tie, the inherited-runner jam that decides the game two innings before the save man ever picks up a ball. Add the rest of the non-save work and 83% of all positive bullpen value is created outside the lane the save stat watches.

If you only knew relievers by their saves, you would be looking at the smallest sixth of the picture.

2. Saves and value are barely related

You might expect saves to at least correlate with value — closers are usually good, after all. They are. But the rank-order relationship between how much win probability a reliever creates and how much save credit he gets is weak: a Spearman correlation of just 0.22. Fold in holds — the stat invented precisely because saves missed setup men — and it climbs only to 0.42, still far from the kind of number you’d want before trusting one stat to stand in for the other.

Every qualified reliever, plotted by how much win probability he created against how much saves-and-holds credit he got. On the dashed line, value and credit agree. The relievers who sit far above it — high value, almost no credit — are the point of this article. Hover any dot for the name and the win-probability total.

The relievers farthest off that line are creating wins the box score files under nobody.

3. Meet the firemen

Here is where the rigor matters more than the ranking. At a half-season, the order of a bullpen leaderboard is genuinely unstable — resample the games and a reliever who looks seventh could plausibly land anywhere in the top seventy (more on that in §5). So we are not going to hand you a numbered top-twenty and pretend the ordering means something. Instead we name only the relievers whose total win-probability contribution is, with bootstrap confidence, positive and large — and who got almost none of the save credit for it.

Gold = an under-credited closer (Suarez gets saves, but his value outruns them); green = hidden firemen the save stat misses.

Named on magnitude, never on rank. Every interval excludes zero. Colin Holderman (CLE) is the cleanest case in baseball: a genuinely elite total, the tightest interval here, and effectively all of it invisible to saves. Gordon Graceffo (STL) has the biggest point estimate, with the wide interval a multi-inning high-leverage role produces. John King (MIA) and Ian Seymour (TB) round out the hidden group. And Robert Suarez (ATL) is a different animal — an under-credited closer: he gets the saves, but his win-probability value outruns even them. The gap runs both ways.

We deliberately left names off this list. Plenty of relievers have eye-catching point estimates whose intervals still touch zero at a half-season — the data simply can’t promise their value yet, so we won’t sell it to you. You can watch all of them accrue in real time on each pitcher’s Bullpen Ledger page.

4. What we are not claiming

Honesty is the whole brand, so here is the fine print, as loud as the finding:

This is a half-season, to date (March 27–June 29). Everything here describes 2026 so far, not a career verdict.
The leaderboard order is noisy. Both of our methods flagged it independently: resampled rank intervals are enormous at this sample. That is exactly why we name on win-probability magnitude with confidence intervals, never on rank. “Holderman has banked about 1.4 wins, and we’re confident it’s real and large” is supportable. “Holderman is the 7th-best fireman” is not.
Saves and holds here are reconstructions, not official scoring. Without MLB’s scoring table we approximate the save situation from the game state (ninth or later, one-to-three-run lead). The credit gap is therefore directional — strong enough to say “the save stat misses most bullpen value,” not precise enough to litigate one reliever’s exact hold count.
This is descriptive, not predictive. Win probability added measures what happened — the wins a reliever banked. It is not a claim about who is best or who will keep doing it. In fact, in a companion study we find that single-season bullpen win-probability barely repeats year to year. The value is real; the repeatability is another question.

5. Why we trust this: two methods, made to fight

We don’t publish a leaderboard from one analysis. We ran two, with deliberately different instincts, and kept only what both agreed on. One was transparent bookkeeping: define the save situation by a hard rule, add up positive win probability inside it and outside it, and bootstrap a confidence interval onto every reliever’s total. The other refused to trust a hand-drawn rule at all — it trained a gradient-boosted model to learn the save region from game context, then measured how much value lived outside what the model considered a save. Neither read the other’s work until it was filed; then each refereed the other as a hostile peer.

Question	Bookkeeping	Machine learning	Verdict
Share of value outside saves	`82.9%`	`83.6%`	Converge
Value vs. save credit (Spearman)	`0.22`	`0.17`	Barely related — converge
Is the leaderboard order stable?	No (CI width > 60)	No (CI width > 130)	Name on magnitude — converge
Who are the hidden firemen?	Holderman, Graceffo, King, Seymour	same core	Converge

The fight earned its keep. Both methods independently tripped their own rank-stability alarm — which is why this article names on magnitude, not rank. The machine-learning pass initially wanted to name a dozen firemen; its own bootstrap intervals refuted half of them, so they’re gone. And an early attempt to sort relievers into tidy “usage archetypes” produced clusters with essentially no separation — so we threw the taxonomy out and will tell you the honest version instead: closers separate cleanly from everyone else, and everyone else is a continuum, not a set of neat boxes. What survived two methods trying to break it is what you just read.

What this means for tonight’s game

Next time a reliever you’ve barely heard of jogs in during the seventh with the bases loaded and one out in a tie game, and strikes his way out of it — understand that you may have just watched the single most valuable half-inning any pitcher throws tonight. He won’t get a save. He probably won’t get a hold. The closer who comes on in the ninth with a three-run lead and nobody on will get the line in the recap. But the win was mostly already saved, two innings earlier, by a name the box score is about to forget.

Watch the leverage, not the inning. The nightly Bullpen Ledger is built to show you exactly that — and every reliever’s page now carries his own running ledger of the wins he’s quietly banking.

Methodology

How we built and stress-tested this

Data. Every relief appearance in the 2026 regular season through June 29 — 8,092 appearances — from Statcast pitch-by-pitch data. For each appearance we sum the change in the pitching team’s win probability across the reliever’s own plate appearances (event-row WPA, pitching-team perspective; the same win-expectancy model behind FanGraphs WPA, and the same engine that powers our nightly Bullpen Ledger). Starters are excluded. The qualified pool is relievers with at least 20 appearances (207 relievers).

Save/hold proxy. With no official scoring table available, a “save situation” is approximated as entering the ninth inning or later with a one-to-three-run lead; “high leverage” is a leverage index of 1.5 or more on entry. Because these are reconstructions, credit gaps are read as directional, not as exact save/hold counts.

Two divergent methods. Agent A (interpretability): a transparent decomposition of positive-WPA share inside vs. outside the save proxy, with 2,000-sample bootstrap confidence intervals on every reliever’s total and on the value-vs-credit rank correlation, plus a usage-archetype clustering attempt. Agent B (machine learning): a gradient-boosted model that learns the save region from entry context and measures value outside it, with bootstrap rank-stability checks. Neither read the other until filed; then each reviewed the other as a hostile peer.

What the cross-review changed. Both methods independently flagged the leaderboard order as unstable at this sample (resampled top-twenty rank-CI widths of 60–130 places), so we name relievers on win-probability magnitude with confidence intervals rather than on rank. The machine-learning pass’s longer name list was cut to the relievers whose totals exclude zero in both passes. The usage-archetype clustering was dropped entirely: the best solution had essentially no cluster separation (silhouette near zero), so the honest description is “closers separate; everyone else is a continuum.”

Limitations. Half-season and descriptive, not predictive. Win probability added credits what happened, including leverage and sequence luck; it is not a skill projection (a companion study finds single-season bullpen WPA barely repeats year to year). Save/hold credit is proxy-based, and the leaderboard order is unstable — hence naming on magnitude, never on rank.

1. 83% of bullpen value happens outside the save

2. Saves and value are barely related

3. Meet the firemen

4. What we are not claiming

5. Why we trust this: two methods, made to fight

What this means for tonight’s game

Methodology

Related analysis

What's Real and What's Noise in the Bullpen Ledger

The Fielder's Fingerprint: Two Gloves, Same OAA, Opposite Skills

A Bad Start Is Forever. A Good One Isn't.