Tuesday, April 2, 2013

A Crude Attempt to Statistically Determine Baseball's Best Player for both Pitchers and Hitters

I was bored at work, and with the 2013 MLB season starting, I wanted to go back and see if I could statistically quantify last year's MVP race (and eventually the Cy Young race). As many know, Miguel Cabrera won the MVP voting, as much as the stat geeks/sabermetricians/general managers disagreed.

(For a Glossary behind the stats being used, see: http://www.fangraphs.com/library/ )

I started with the basis that no one stat is perfect (ESPN has fetishized WAR to the point of no return, but then again, ESPN is the worst example of sports journalism pretty much ever). From here, I basically took a basket of accurate and well-supported advanced stats among the baseball community, and assigned a percentage to them in how they would eventually weigh on the final evaluation. This percentage was based roughly on accuracy and general support among statisticians.


For pitching, I came up with the following formula to determine how good a player was:

SIERA (25%) WAR (30%) WPA/LI (15%) WPA (5%) FIP (8%) xFIP (8%) tERA (9%)

The percentages measure how much that stat will eventually be weighted in the final calculation.

I gave WAR the most precedence of any stat because it is still a good catch-all for how good a player is, and more than raw stats, wins are the ultimate goal of baseball.

SIERA is still a relatively new "true" ERA measure that has outperformed tERA and xFIP in recent years in terms of predictor and evaluation powers. I made that the second most weighted stat.

WPA stats have some flaws, what with being somewhat prone to "well, in hindsight..." looks, but are still decent ways of accounting for what we want most here; how the player helped garner victories.

I used the standard advanced pitching statistics trifecta to make up the remaining 25%; FIP, xFIP, and tERA.

The nice thing about this is we have a 50/50 split between probability and value based stats (WAR, WPA/LI, and WPA) and raw pitching stats (SIERA, FIP, xFIP, tERA) which I feel strikes a decent balance.

Positional Players

Positional players were more straightforward. In previous years, I have considered wOBA a very good statistic, but it's not park adjusted, and I didn't want any non-neutral stat to enter the formula - especially given the huge discrepancy between, say, the Yankees "Little League" Stadium, and the Padres "I swallow all of your flyballs" stadium.

wRAA (25%) wRC+ (25%) WAR (30%) WPA/LI (15%) WPA (5%)

Once again, WAR is given a slight emphasis over all other stats. wRAA and wRC+ make up the raw statistical side of the equation, which once again comes out to a 50/50 split in projection/probability and pure numbers. This is eventually what I worked with to crudely calculate a true MVP, although it's way less crude than "he plays hard" or "it's the Triple Crown, stupid!" or whatever other regurgitated garbage pundits come up with.

Mike Trout vs. Miguel Cabrera (2012 MLB Season)

So, how I did this. I wanted to base this off of separation from the median player for one simple reason. If I had settled, as I originally planned to, on simply assigning a points system based on ranking, it would have penalized a large lead. For instance:

Batting Averages: Scenario A
1. .364
2. .328
3. .327

Player 1 is clearly way ahead of Player 2 in this, but a points based structure would have assigned equal difference to player 1 and 2 as player 2 and 3. Not good. So medians it is.

All medians are based off of minimum plate appearances to qualify for end of the year stats, which meant that 143 players were qualified, so median was rounded to 72nd player. I used both NL and AL to make sure the sample size wasn't cut even smaller.

So, here's the data for each player with their stat totals and the league (AL & NL) median.

Median WAR: 2.8
Trout WAR: 10.0
Cabrera WAR: 6.9

Median wRAA: 8.5
Trout wRAA: 48.2
Cabrera wRAA: 57.3

Median wRC+: 110
Trout wRC+: 166
Cabrera wRC+: 166

Median WPA/LI: 0.94
Trout WPA/LI: 6.00
Cabrera WPA/LI 6.34

Median WPA: 0.99
Trout WPA: 5.32
Cabrera WPA: 4.82

From here, I quickly calculated how much better each player was than the league median.

Trout WAR: 3.57143 times better than league median (357.143%)
Cabrera WAR: 2.46429 times better than league median (246.429%)

Trout wRAA: 5.67059 times better than league median (567.059%)
Cabrera wRAA: 6.74118 times better than league median (674.118%)

Trout wRC+: 1.50909 times better than league median (150.090%)
Cabrera wRC+: 1.50909 times better than league median (150.090%)

Trout WPA/LI: 6.38298 times better than league median (638.298%)
Cabrera WPA/LI: 6.74468 times better than league median (674.468%)

Trout WPA: 5.37374 times better than league median (537.374%)
Cabrera WPA: 4.86869 times better than league median (486.869%)

Trout: ((3.57143 * 3) + (5.67059 * 2.5) + (1.50909 * 2.5) + (6.38298 * 1.5) + (5.37374 * .5)) / 10 = 4.092483 total value

Cabrera: ((2.46429 * 3) + (6.74118 * 2.5) + (1.50909 * 2.5) + (6.74468 * 1.5) + (4.86869 * .5)) / 10 = 4.056991 total value

Trout was roughly 409.2483% better than median player
Cabrera was roughly 405.6991% better than median player

Ok, so what happened here. I basically took each player's stats in their respective category, compared it to the league median, and ran it through the formulas from earlier in the article (the ones with the percentages).

Trout should have won MVP (as the stat geeks proclaimed. Even if Ichiro hit .450 with 30 HRs and 100 stolen bases as a leadoff hitter one year, he'd never win triple crown because of RBI dependency, to use a crude example of how crappy the Triple Crown is). This was widely supported among voters at sites like Fangraphs and BaseballProspectus, but I wanted to lend a somewhat simplistic and crude if fair and accurate number to it.

Obvious Flaws:

One thing that stands out is that some stats have larger spreads than others, and thus will play a bigger role in the final tally than others because of a higher total. Because over the course of the year, the best players will generally congregate towards similar stats anyways, I wasn't too worried about this. You'll notice that the spread between Trout and Cabrera; whether the players were ~3 times better or ~6 times better, were very similar in all 5 categories.

This measure tends to still tilt towards hitting at the expense of fielding and baserunning a bit much, but the WAR and WPA stats help negate that a bit.

As we can see, the final numbers gave Trout a small but noticeable higher score than Cabrera. Given that Trout was a much, much, much better fielder and baserunner, that also lends credence to his season being better.