Thursday, May 29, 2014

DICE: Defensive Independent Component ERA

I'm reposting an article from July 2000, because Baseball Mogul players keep asking me what 'DICE' stands for on the pitcher Scouting Report (and because the text in the original article is tiny and hard to read).

Defense Independent Component ERA

July 19, 2000

If you play Baseball Mogul, you have already encountered Defense Independent Component ERA ("DICE"), even though you don't realize it. This is because the artificial intelligence in Baseball Mogul uses DICE to evaluate pitching talent.


We also use it at Sports Mogul to create our annual player projections.

DICE starts with the concept of "Component ERA" invented by Bill James. The concept is pretty simple -- use the components of a pitcher's statistical performance (such as hits allowed and hit batters) to predict a pitcher's ERA. Because there is a strong correlation between these individual events and the pitcher's ERA, you can actually estimate a pitcher's ERA in a season by just looking at the components. In other words, you can predict earned runs allowed by looking at the individual events (such as walks and home runs) that led to the runs themselves.

ERA is a somewhat luck-based stat. One season is a relatively small sample size, and earned runs given up in one season may not be a true indicator of the pitcher's overall ability level. The pitcher might have given up several home runs with the bases loaded, causing his ERA to be higher than it would have been if the home runs had been distributed randomly throughout the season.

By deriving a value from hits, walks, hit batters and home runs, Component ERA attempts to be a better evaluator of a pitcher's true ability to prevent runs.

Here is James' formula for Component ERA (CERA):

CERA=(((H+BB+HBP)*(.89*(1.255*H+2.745*HR)+.56*(BB+HBP-IBB)))/(BFP*IP))*9-.56

But there are a few problems with CERA:

The biggest is that it includes hits. Hits aren't a great indicator of a pitcher's true pitching ability. With the exception of home runs, the number of hits allowed by any pitcher are largely affected by the quality of the defense behind him. This makes sense, but it also stands up to statistical analysis. A pitcher's Strikeout Ratio (strikeouts pitched per 9 innings) is relatively consistent from year to year. However, a pitcher's Hit-Out Ratio (ratio of hits to outs, after removing strikeouts and homeruns) doesn't have the same consistency.

The second problem I have with CERA is that it's tough to calculate. Although they aren't perfect, I like measures such as Slugging Percentage and Total Average with formulae that are pretty easy to remember.

So, I created a slightly different form of Component ERA called "Defensive Independent Component ERA" (or DICE) that uses the variables in Component ERA, but removes hits (but leaves in Home Runs -- because these are almost never affected by defense).

At first, it looked something like this:

DICE = x + (y*(BB + HBP) + z*HR) / IP

Using all active pitchers in 1999 with 500 or more career Innings Pitched, I performed a regression on the above function to determine the constants x, y and z such that DICE best predicted their career average ERA. (There were 229 pitchers in this data set).

But after some experimenting, I noticed that ERAs were also strongly correlated with strikeouts, even when the other stats (walks, hit batters, and home runs) were already taken into account. As strikeouts are also defense-independent, it makes sense to add them to the formula. This is somewhat counter-intuitive. After all, a ground out can be just as good as a strikeout to end an inning. But the regression doesn't lie -- strikeouts are more effective than other types of outs at reducing earned runs. Or more accurately, strikeout numbers are useful in predicting a pitcher's ERA.

So I added strikeouts to the formula and performed another regression to determine the correct coefficients to use in the formula. Finally, I found the integer coefficients that best matched the data (because integers make the math easier than that required for CERA):

DICE = 3 + (3*(BB + HBP) + 13*HR - 2*K) / IP

(The Mean Squared Error for this formula, across all 229 pitchers, is .100697. The Square Root of the Mean Squared Error is about .317 -- meaning that about 2/3 of all actual ERA values should fall with .317 runs of a pitchers DICE value)

So there you have it:
1. Start with a value of 3 times the number of walks and hit batters
2. Add 13 for every home run allowed
3. Subtract 2 for every strikeout
4. Divide this total by the number of innings pitched
5. Finally, add this result to 3.00 to get the pitcher's Defense-Independent Component ERA (aka DICE).

Here's an example using Roger Clemens 1998 season (his most recent Cy Young Award):

DICE = 3.00 + (3 * (68 BB + 7 HB) + 13 * 9 HR - 2 * 292 K) / 264 IP = 2.14
Roger's actual ERA in 1998 was 2.05

Anyway, I first developed this stat to help me predict how a pitcher would perform in my rotisserie league. DICE is a better predictor of a pitcher's ERA in the upcoming year than any other stat I could find (such as his previous year's actual ERA). Using these predictions, I was able to win the league for 4 years out of 6 (and I'm currently in 1st place in year 7). And of course DICE is one of many tools we use inside the Baseball Mogul game engine.