Friday, February 1, 2013

The Skin Color Project

Over the years, I have heard baseball researchers ask if there was a "race database" somewhere.

For example. the National League was better than the American League throughout the 1960s and 1970s. From 1960 to 1982, the National League won 23 All-Star Games. The American League only won 2.

I would surmise that the National League dominated because they integrated more quickly, adding the best black players to their teams while a number of American League teams (most notably, the Red Sox) continued to stay all-white.

But I haven't seen data supporting that theory. That's because we have stats going back to the 1870s, but we don't have a database of racial/ethnic categorization. One reason is that every time it comes up, it becomes clear that "race" is a cultural phenomenon, not a scientific one. For example, President Obama is (at least) half-white and yet we call him the first "black" president.

So, forget "race". I'm working on a skin color database, ranking everyone from 1-9 like so:

It's hard to tell from some pictures exactly where someone's skin color falls. But that's okay. Even if we can't agree whether Ken Griffey Jr. is a "7" or an "8", we still have enough data to ask some interesting questions, like whether skin color affects arbitration awards or ball/strike calls.

Also, we can record multiple "votes" for each player, using the wisdom of crowds to get more accurate relative rankings.

I'm using the same ID numbers used by the "Lahman" baseball database. (These are very similar to the ID numbers used by

For example, the first 10 lines of the database look like this:


If you are able to do some of your own research (even just 20 or 30 players) please e-mail it to me and I'll add it to the database.

One use for this data will be to better represent different players in the new Baseball Mogul animations. But I'll also make it public for anyone who wishes to use it for research. As with everything on this blog, the skin color database is covered by a Creative Commons license making it free to "use and remix".


Danny said...

Is skin colour really a simple gradation of brightness moving from light/white to dark/black? What about asian skin colours? Or skin colours that differeniate in part due to perceived differences in hue (red, olive)? It's not obvious to me where they would go on your 1-9 scale, or how they would impact future analysis. E.g., is Ichiro a 2? Does signing Ichiro make the Yankees "whiter" then, by pushing their average skin-score lower? If Louis Sockalexis was a 6 or 7, how could he be a playing baseball in the NL pre-integration?

neilshyminsky said...

Re: Danny's comment. Yep, this. My first reaction was also 'where do Asian guys fall on this?' Skin tone might be reducible to monochromatic variations, but we don't actually perceive race this way.

The problem with measuring skin color is that racial perception goes well beyond it - your name, your language/dialect/accent, your country of origin, or physical features in addition to skin tone. It's strange to distill it down to this one thing. You might be discovering OBP, but you're just as likely (more likely?) to be counting RBI.

I'd also add that the attempt to introduce the large range, while really well-meaning, works against the project because we tend to perceive race as either/or (or either/and). Obama might have had parents of different races, but he's perceived as simply black, for the most part. And I suspect that most white people see little difference between a 3 or 5, or a 6 or 9, on this list - but most people see a tremendous difference between a 2 and a 3. (It also belies the reality that some people can slide along the scale very easily.)

Not saying it's a useless mission, by any means. And if it's just the first, exploratory step of some much larger project, that's great. But I think you'll find a lot of noise as you move away from guys at the extremes.

Clay said...

Great points. This is just the beginning. I mentioned that I'd like to use this data for Baseball Mogul, which means an RGB value would be ideal.

But grading 1-9 is a lot easier for humans to do than specifying an exact color. You can't just color grab a pixel from a picture of a player, because that depends greatly on the lighting conditions. But our brains have the lovely ability to tell the difference between Barry Bonds in bright sunlight and Obama in dim light.

At the moment, all the Asians are ending up as "3" -- partly so I know where they are if I want to flag them later.

blogaboutbaseball said...

Will the pitchers have appropriate skin color a swell?

Clay said...

Yes. We're building a DB of skin colors for pitchers and batters. And the new pitcher and batter anims each have skin color and uniform color variation built in.

D said...

Regarding Neil's comment, I think a really important point is how the big jump seems to come from 2 to 3, where a person moves (to Americans anyway) from being a white person to being a black person. Whereas the difference between 7 to 8 is almost irrelevant, or perceived as hardly a difference at all. This implies your scale is not an interval scale but is merely ordinal. If that's true it really limits what data analysis you can legitimately perform in the future. For example, you can't calculate a mean from a set of ordinal numbers.

dlm said...

Interesting work Clay. If Sammy Sosa had still been playing after his skin bleaching incident, we could have compared his Stats* before and after.

I had a friend pass away recently. He was white and was invited to pitch with the Lebanon and Indianapolis Clowns in the 60's. He had some interesting stories to tell as a sort of successor to Eddie Klep. Once when his father came to see him play the other fathers said which one is your son?

It would be unfortunate if the study revealed preferencial treatment or calls but your game could "level the playing field" and we could learn some lessons from your work.