The Jump Tax — CalledThird

Start with a fact that shouldn’t be possible. In 2025, Josh Naylor — a first baseman whose sprint speed sits near the very bottom of the league — stole 22 bases and was caught once. That is a 96% success rate, better than Bobby Witt Jr.’s, better than Elly De La Cruz’s, from a man most of those runners would lap. The obvious explanation is that slow guys who steal must do it with a huge lead, getting a running start the wheels can’t provide. So we looked. Naylor’s lead is one of the smallest in baseball.

So how? We pulled Baseball Savant’s lead-distance tracking, sprint speed, and basestealing run value for every qualified runner over 2023–25, and we ran the whole thing the way we ran the Adjustable Swing and the Pressure Grade: two analytical agents with opposite instincts — one interpretability-first, one a gradient-boosted machine — each working the data blind to the other across three rounds, then forced to referee each other. The answer to the Naylor riddle turned out to open a much bigger door: a base-stealer’s value isn’t in his tools at all.

What we found

Naylor is an ambush. His resting lead is in the 3rd percentile. But on the pitches he picks to go, his lead jumps to the 97th percentile — the largest gap between “standing lead” and “go lead” of any runner in baseball. He hides on the bag, then strikes.
The jump and the legs are separate skills. How much extra lead a runner takes and how fast he is correlate at r = -0.06 — essentially independent. Both are remarkably repeatable year to year (the jump at 0.79, speed at 0.87).
Neither predicts who banks value. Once a model knows a runner’s speed, adding his lead does not reliably improve its prediction of his steal value — not in a single season, and not even pooled over three years. Speed stays the stronger proxy; the jump never overtakes the legs.
Value is conversion. Run value tracks one thing above all: whether the attempts a runner picks actually cash. Success rate and run value correlate at r = 0.87. And conversion barely repeats — a runner’s steal value one year tells you almost nothing about the next.
The single-season leaderboard is mostly noise. About 44% of the spread in one season’s steal value is small-sample luck. You need roughly 21 attempts before the number is even half-reliable.
There are at least two ways to be elite. Pure speed (Buxton, Witt) and pure ambush (Soto, Naylor) bank essentially the same run value — wheels and a well-timed jump are substitutes, not a hierarchy.

One honesty note up front, because it governs everything below: this is measured on the attempts runners chose to make, with public season aggregates. We can show you that a stable jump doesn’t travel with stable value. We can’t see the pitcher’s clock, the catcher, the count, or the green light on each individual attempt — so when we say value lives in “conversion,” that bucket mixes a runner’s judgment with his opponents and with plain luck. Here’s the case.

1. The ambush

Every runner leads a little bigger when he’s about to go — that’s what going means. The question is how much bigger, and whether the size of that swing is unusual. Savant tracks both: a runner’s average secondary lead across all his opportunities, and his lead specifically on the pitches where he breaks for second. The gap between them is a fingerprint.

Ambushers (Soto, Naylor) keep a small standing lead, then open ~13–15 ft on the attempt. Burners (O. Cruz, Witt Jr.) barely change their lead — the legs do the work. 2025, ≥15 attempts.

Naylor’s resting lead (14.3 ft, 3rd percentile among qualified runners) is tiny — he looks like a non-threat, a slow first baseman glued to the bag. Then on his pitch, his lead leaps to 29.2 ft, the 97th percentile. That +14.9-foot swing is the biggest in baseball. Compare a Burner like Byron Buxton, who leads roughly the same whether he’s going or not — he doesn’t need to hide, he just runs.

This is the whole trick. Naylor doesn’t beat the throw with his feet; he beats it with timing. He stands close, gives the battery no reason to hold him, waits for the pitch and the situation he wants, and only then takes an enormous jump. The deception is part of the engine: because his default lead is so small, nobody is ready for the one that counts. And it isn’t a fluke of one season — he ran the same play in 2026 (10-for-11). Juan Soto, who went a perfect 30-for-30 in 2025 despite sprint speed in the 3rd percentile, is the same animal: a slow man with a 99th-percentile ambush.

So the natural next question — the one a coaching staff would ask — is whether the jump is the thing. If Naylor and Soto win on the ambush, is “the jump” the coachable skill the running game has been underrating? We went looking. The answer is more interesting than yes.

2. Two real skills that don’t pay

The jump is real, and it is a skill in the strict sense: it’s separable from speed and it persists. How much extra lead a runner takes has almost nothing to do with how fast he is — the two correlate at r = -0.06, which is to say not at all. And a runner’s jump in one year predicts his jump the next at r = 0.79, about as stable as sprint speed itself (0.87). On the two tests we usually use to certify a skill — is it distinct, and does it repeat — the jump passes clean.

Then it fails the only test that matters. We built the prediction the honest way: start with a model that knows each runner’s speed, then add his jump and watch whether the forecast of his steal value actually improves out of sample. It doesn’t. In a single season, adding the lead buys nothing. Pool three seasons together — quadrupling the attempts behind each number — and the jump finally shows a faint, in-sample association with value, but it still doesn’t out-predict speed, and speed stays the larger term. Both agents, the transparent one and the machine, agreed: the jump is a beautiful, reliable trait that you cannot turn into a forecast of who produces.

If that sounds familiar, it’s the same shape we found in the Pressure Grade and the Adjustable Swing: a trait can be among the most repeatable things you can measure about a player and still be wired to nothing he’s paid for. So if it isn’t the jump and it isn’t the legs, what banks the runs?

3. The four ways to steal a base

Plot every qualified runner by his two independent tools — speed across the bottom, the size of his ambush up the side — and the league spreads into a usable map. There are no natural clusters here; runners fill a continuum. But four corners of behavior are worth naming, because they’re four genuinely different ways to be good at this.

n = 70 qualified stealers (≥15 attempts, 2025). Each dot is one runner. The four labels are a vocabulary over a continuum, not discovered clusters. Faster runners need a smaller jump (r = −0.54).

The downward trend is the headline of the picture: speed and the ambush jump are substitutes (r = -0.54). The faster you are, the smaller the jump you need; the slower you are, the bigger the jump you must manufacture. Burners (Buxton, Witt, O’Neil Cruz) live bottom-right — wheels, ordinary jump. Ambushers (Soto, Naylor, Lindor) live top-left — ordinary-or-slow legs, enormous jump. Pressure runners lead big all the time; Balanced runners do a little of everything.

The map makes the substitution concrete. Nobody is elite on both axes, because you don’t need to be: a runner with Buxton’s top speed can go on a normal lead, and a runner with Naylor’s wheels survives only by manufacturing a jump nobody sees coming. Two roads, same destination. Which raises the obvious question about the destination itself — do these roads actually arrive at the same place?

4. Value is conversion, and it’s on par

They do. Group the league by archetype and the run value lands in a narrow band — and crucially, the slowest group is right there with the fastest.

Whiskers are 2,000-resample bootstrap 95% CIs over n = 70 qualified stealers. The Ambusher and Burner intervals overlap heavily — the slow-ambush route is on par with speed, not better.

Ambushers averaged +2.14 runs, Burners +1.98 — statistically a tie (the difference is +0.16 with a confidence interval from -0.99 to +1.33, and it flips negative if you remove Soto). The two specialist routes — pure speed and pure ambush — both beat “lead big all the time” (+1.08) and “average everything” (+0.55). Whiskers are bootstrap 95% intervals; with only 11 Ambushers they’re wide, and we read this as “on par,” never “better.”

What actually separates the productive base-stealers from the unproductive ones isn’t on the map at all. It’s conversion — whether the attempts a runner chooses turn into steals instead of outs. Run value tracks success rate at r = 0.87; it is very nearly a straight function of it. A caught stealing is worth roughly two and a half times a successful steal in run value, so a runner’s ledger is dominated by his misses, and the runners who pick their spots ruthlessly — Soto and Naylor again, at 100% and 96% — are the ones who bank. The tools get you to the attempt. The decision is what cashes it.

5. Why the steal-value leaderboard lies to you

Here is the uncomfortable corollary, and it’s the most useful thing in this piece for anyone who plays fantasy or runs a front office. The thing that drives a runner’s value — his conversion — is also the thing that doesn’t repeat. A runner’s jump carries over from one year to the next at 0.79; his speed at 0.87. His run value carries over at roughly 0.09. The skills are stable; the scoreboard is nearly random.

A lot of that is simply sample size. We measured how many stolen-base attempts it takes before a single season’s run value is even half-reliable: about 21. Most runners never get there. At a typical 26-attempt season, roughly 44% of the gap between runners on the leaderboard is small-sample noise, and pooling three years — the most data any active runner has — only cuts it to about 30%. Out-of-sample, the predictability of steal value from a runner’s tools rises from almost nothing in one season to modest over three. In other words: the player who “led the league in baserunning value” last year is, more often than not, telling you about his luck and his opportunities, not his ability. Don’t pay for it.

Naylor is the proof in miniature. His tools barely moved across three seasons — same slow legs, same tiny resting lead, same below-average first step. His steal value went -0.01, then -0.6, then +3.4. What changed wasn’t the runner. It was that he started running — nine attempts became twenty-three — and kept cashing them. The skill was there the whole time; the value showed up when the decisions did.

6. Why we trust this: two methods, three rounds, made to fight

A single model that concluded “the jump doesn’t pay” would be easy to wave off. So we didn’t run one. An interpretability-first agent (partial correlations, variance decomposition, bootstrap reliability) and a machine-learning agent (gradient-boosted trees with grouped cross-validation, SHAP, permutation checks) each worked the data independently across three rounds — decompose, then test whether the noise is fundamental or just small samples, then formalize the archetypes — reviewing each other as hostile peers between rounds.

The cross-examination earned its keep. In an early round the interpretability agent flagged that one of its own kill-gates was statistically rigged — an in-sample “improvement” metric that mathematically can’t go negative — and threw it out in favor of the machine’s honest out-of-sample test. (It also caught the author of this piece overstating Naylor’s first-step burst, which on the proper peer group is below average, not elite. The ambush is in the lead, not the legs.) When two methods that share no code agree, the finding isn’t a modeling artifact.

Question	Interpretability	Machine learning	Verdict
Are jump and speed separate?	`r = -0.06`	same	Yes — converge
Is the jump repeatable?	YoY `0.79`	same	Yes — converge
Does the jump beat speed for value?	no incremental signal	CV ΔR² `-0.04`	No — converge
Where do value swings live?	conversion (~66%)	success mix (SHAP 62%)	Conversion — converge
Is the ambush route > speed?	+0.16, CI spans 0	same	On par — converge

Five questions, two independent machines, five agreements — including the verdict that the headline skill of the piece is worth, predictively, about nothing.

What this means for tonight’s game

When the slow guy on first takes off and beats the throw, you’re not watching an upset — you’re watching a different craft. There are two ways to steal a base: outrun the ball, or out-think the battery and take a jump nobody saw coming. Byron Buxton does the first. Josh Naylor and Juan Soto do the second, and it works exactly as well. The radar gun on the basepaths tells you how a runner will steal, not whether he’ll be good at it.

And when a broadcast flashes “led the majors in baserunning value,” hold it loosely. The runner’s tools are real and they last; the value number on the screen is half luck and mostly about the bases he chose to chase. The steal, in the end, is a decision — and decisions don’t show up on a leaderboard the way wheels do.

Methodology

How we built and stress-tested this

Data. Baseball Savant basestealing run-value leaderboards (per-runner stolen bases, caught stealings, run value, and primary/secondary lead distances both overall and on stolen-base attempts), joined to Statcast sprint speed, for 2023–25 (with 2026-to-date as a confirmation sample). The primary pool is qualified runners with at least 15 stolen-base attempts in a season (70 in 2025) or 30 across the pooled window. The unit is the runner (MLBAM id); the “ambush gap” is a runner’s secondary lead on attempt pitches minus his secondary lead across all opportunities.

Two divergent methods. Agent A (interpretability): partial correlations and variance decomposition of run value into volume, success, and leverage; split-half and year-over-year reliability with Spearman-Brown correction; bootstrap confidence intervals on every group statistic. Agent B (machine learning): gradient-boosted trees (LightGBM) predicting run value and success rate with GroupKFold by runner so no runner appears in both train and validation; SHAP with permutation-importance sanity checks; grouped-bootstrap intervals. Neither read the other’s work until it was filed; each then reviewed the other across three rounds.

Pre-registered kill gates. Independence of jump and speed (passes, r = -0.06); persistence of the jump (passes, 0.79); incremental predictive value of the jump over speed (fails — single-season and pooled out-of-fold ΔR² both span zero); the sample-vs-signal reliability curve (~21 attempts for half-reliability); and the archetype formalization (the four labels recover an imposed argmax rule cleanly but unsupervised clustering finds no natural groups, so we present them as a vocabulary over a continuum, not as discovered classes).

Limitations. Everything is associational and built on public season aggregates. Attempt-level context — pitcher time-to-plate, catcher, count, score, who held the runner — is not in public data and cannot be reliably reconstructed from pitch feeds, so the “conversion” bucket that drives value mixes a runner’s decision quality with his opponents and with binomial luck; we make no causal or coaching claim. We also tested first-step burst from Statcast running splits: it is separable from top-end speed but is not a proven value predictor here, and it does not rescue the jump story. Lead-distance tracking is a young dataset, and qualified base-stealer pools are small (11 Ambushers), so group intervals are wide by construction.

What we found

1. The ambush

2. Two real skills that don’t pay

3. The four ways to steal a base

4. Value is conversion, and it’s on par

5. Why the steal-value leaderboard lies to you

6. Why we trust this: two methods, three rounds, made to fight

What this means for tonight’s game

Methodology

Related analysis

The Adjustable Swing: Hitters Have a Dial. It Isn't Wired to Anything.

Two Myths the Data Kills

The Arm-Angle Gambit: We Went Looking for the Cheat Code. We Found a Tax.