Showing posts with label stats. Show all posts
Showing posts with label stats. Show all posts

Thursday, May 29, 2014

DICE: Defensive Independent Component ERA

I'm reposting an article from July 2000, because Baseball Mogul players keep asking me what 'DICE' stands for on the pitcher Scouting Report (and because the text in the original article is tiny and hard to read).

Defense Independent Component ERA

July 19, 2000

If you play Baseball Mogul, you have already encountered Defense Independent Component ERA ("DICE"), even though you don't realize it. This is because the artificial intelligence in Baseball Mogul uses DICE to evaluate pitching talent.


We also use it at Sports Mogul to create our annual player projections.

DICE starts with the concept of "Component ERA" invented by Bill James. The concept is pretty simple -- use the components of a pitcher's statistical performance (such as hits allowed and hit batters) to predict a pitcher's ERA. Because there is a strong correlation between these individual events and the pitcher's ERA, you can actually estimate a pitcher's ERA in a season by just looking at the components. In other words, you can predict earned runs allowed by looking at the individual events (such as walks and home runs) that led to the runs themselves.

ERA is a somewhat luck-based stat. One season is a relatively small sample size, and earned runs given up in one season may not be a true indicator of the pitcher's overall ability level. The pitcher might have given up several home runs with the bases loaded, causing his ERA to be higher than it would have been if the home runs had been distributed randomly throughout the season.

By deriving a value from hits, walks, hit batters and home runs, Component ERA attempts to be a better evaluator of a pitcher's true ability to prevent runs.

Here is James' formula for Component ERA (CERA):

CERA=(((H+BB+HBP)*(.89*(1.255*H+2.745*HR)+.56*(BB+HBP-IBB)))/(BFP*IP))*9-.56

But there are a few problems with CERA:

The biggest is that it includes hits. Hits aren't a great indicator of a pitcher's true pitching ability. With the exception of home runs, the number of hits allowed by any pitcher are largely affected by the quality of the defense behind him. This makes sense, but it also stands up to statistical analysis. A pitcher's Strikeout Ratio (strikeouts pitched per 9 innings) is relatively consistent from year to year. However, a pitcher's Hit-Out Ratio (ratio of hits to outs, after removing strikeouts and homeruns) doesn't have the same consistency.

The second problem I have with CERA is that it's tough to calculate. Although they aren't perfect, I like measures such as Slugging Percentage and Total Average with formulae that are pretty easy to remember.

So, I created a slightly different form of Component ERA called "Defensive Independent Component ERA" (or DICE) that uses the variables in Component ERA, but removes hits (but leaves in Home Runs -- because these are almost never affected by defense).

At first, it looked something like this:

DICE = x + (y*(BB + HBP) + z*HR) / IP

Using all active pitchers in 1999 with 500 or more career Innings Pitched, I performed a regression on the above function to determine the constants x, y and z such that DICE best predicted their career average ERA. (There were 229 pitchers in this data set).

But after some experimenting, I noticed that ERAs were also strongly correlated with strikeouts, even when the other stats (walks, hit batters, and home runs) were already taken into account. As strikeouts are also defense-independent, it makes sense to add them to the formula. This is somewhat counter-intuitive. After all, a ground out can be just as good as a strikeout to end an inning. But the regression doesn't lie -- strikeouts are more effective than other types of outs at reducing earned runs. Or more accurately, strikeout numbers are useful in predicting a pitcher's ERA.

So I added strikeouts to the formula and performed another regression to determine the correct coefficients to use in the formula. Finally, I found the integer coefficients that best matched the data (because integers make the math easier than that required for CERA):

DICE = 3 + (3*(BB + HBP) + 13*HR - 2*K) / IP

(The Mean Squared Error for this formula, across all 229 pitchers, is .100697. The Square Root of the Mean Squared Error is about .317 -- meaning that about 2/3 of all actual ERA values should fall with .317 runs of a pitchers DICE value)

So there you have it:
1. Start with a value of 3 times the number of walks and hit batters
2. Add 13 for every home run allowed
3. Subtract 2 for every strikeout
4. Divide this total by the number of innings pitched
5. Finally, add this result to 3.00 to get the pitcher's Defense-Independent Component ERA (aka DICE).

Here's an example using Roger Clemens 1998 season (his most recent Cy Young Award):

DICE = 3.00 + (3 * (68 BB + 7 HB) + 13 * 9 HR - 2 * 292 K) / 264 IP = 2.14
Roger's actual ERA in 1998 was 2.05

Anyway, I first developed this stat to help me predict how a pitcher would perform in my rotisserie league. DICE is a better predictor of a pitcher's ERA in the upcoming year than any other stat I could find (such as his previous year's actual ERA). Using these predictions, I was able to win the league for 4 years out of 6 (and I'm currently in 1st place in year 7). And of course DICE is one of many tools we use inside the Baseball Mogul game engine.

Monday, September 2, 2013

A Note On Tackles

"Tackles" have been an official stat since 2001, but there is still some confusion about what the term means. For example, CBS Sports and NFL.com both show Luke Kuechly with 164 Tackles in 2012. But Pro-Football-Reference only gives him 103 tackles.

Luke Kuechly had 103 "tackles" last year. Or did he?
This is because CBS and the NFL are adding together "Solo Tackles" and "Assisted Tackles", but Pro-Football-Reference is only counting "Solo Tackles" (with a column next to it for "Assisted Tackles").

ESPN adds more confusion. Instead of a column called "tackles", they have a column called COMB (for "combined") and one called TOTAL. This doesn't clarify anything, because "total" and "combined" are essentially synonyms, both meaning to "add up".

(This convention even confuses ESPN's own writers. Their Fantasy Projection for Kuechly mentions "200 total tackles" when it is clear that what they really mean, according to their own nomenclature, is "200 combined tackles".)


So... for Football Mogul, we are sticking to the NFL's official definition:
[A tackle is] recorded when a defensive player makes contact with an offensive player, forcing him to go to the ground. Tackles can be recorded as either "solo tackles" or "assisted tackles".
In other words, "tackles" includes both "solo tackles" and "assisted tackles". For every tackle that occurs in the simulation, Football Mogul either awards a "solo tackle" to one defensive player, or an "assisted tackle" to each of two different players.

Thursday, February 28, 2013

Real Minor League Stats (Baseball Mogul)

For Baseball Mogul 2014, we have added over 1 million lines of minor league batting, pitching, fielding and catching data, going back to the 1880s.

Click for larger image
The above screen shot shows Jurickson Profar, the Ranger's #1 prospect. The Baseball Mogul AI thinks he should be in their starting lineup on Opening Day.