Statistics

Correlation

Look, if you don't like math, turn back. You don't need to understand this to understand this blog.

OK, to those of you remaining, this explains correlation. I'll use correlation when looking at background issues. Correlation won't help explain why Miguel Cabrera is a good hitter or why Clayton Kershaw's a good pitcher, but it will help determine how we can make those judgments.

Correlation measures how related two variables are. Sometimes, two different measures will be perfectly correlated. Temperature in Fahrenheit and temperature in Celsius are perfectly correlated. For ever one degree rise in Celsius, the Fahrenheit temperature rises exactly 1.8 degrees. That's true at every value of Celsius and Fahrenheit. If you drew a graph, it's be a straight line. Correlation is measured on a 0 to 1 scale. The Celsius/Fahrenheit correlation is 1. 

Other variables are strongly correlated, if not perfectly. As the temperature rises, electricity consumption rises, because people use their air conditioners, and the AC has to work harder. It's not a perfect relationship, but it's close, probably in the neighborhood of 0.7 or 0.8. Other variables are not well correlated. The amount of cereal people buy at the grocery store has nothing to do with their height. That correlation's close to zero. Other pairs are negatively correlated: as one rises, the other falls. The amount of chicken people buy is negatively correlated to price, because as the price rises, consumers substitute cheaper foods like turkey or beef or tofu or something. The correlation between chicken purchased and the price of chicken is, I don't know, something like -0.5, I'd guess.

So why do we need to know correlation in order to analyze baseball? Because correlation helps us determine what's important. On offense, the goal is to score runs. Pitching and defense, the goal is to prevent runs. We can determine the correlation between various offensive measures and run scoring or prevention. For example, batting average is pretty well correlated to runs. The better the batting average, the more runs the team scores. Makes sense, right? Home runs are correlated well. So are bases on balls. Doubles are too, but not as much as the three I've mentioned. Strikeouts are negatively correlated.

Here, I'll show you. Here are the correlations between runs scored and various offensive measures, based on every team from every season since 1995:

  • Doubles: 0.56
  • Triples: 0.05
  • Home runs: 0.71
  • Stolen bases: 0.04
  • Walks: 0.63
  • Strikeouts: -0.24
  • Batting average: 0.80
  • On base percentage: 0.88
  • Slugging percentage: 0.90
  • Grounded into double play: 0.25

What this tells you is that while doubles are correlated to runs, they're not as well correlated as walks, which aren't as well correlated as home runs, which aren't as well correlated as batting average, which isn't as well correlated as on base percentage.

So if you want to figure out whether a team, and by extension, a player on a team, can score runs, the first thing you should look at is slugging percentage, because it's the best correlated to run scoring. The second best is on base percentage. Often, you'll see a player's batting average and home runs flash on the screen as he walks to the plate. That's nice information, but it's not as important as slugging and on base percentage. Why not? Because correlation tells us so.

So we can use correlation to evaluate the statistics to measure baseball. The higher the correlation, the better the measure. Pretty neat.

(Two figures above surprised me, but when you think about it, they make sense. First, there is almost no correlation between stolen bases and runs. That's not what I would've thought; stealing bases helps you score runs. But when you think about it, stealing bases is a small-ball strategy. You do it when you're trying to eke out a run. It's a strategy for teams that can't sit back and wait for a slugger to go deep. The Royals led the majors in stolen bases this year, and they were a below-average scoring team. So while an individual's stolen bases are good, on a team level, they're indicative of a problem scoring runs in bunches. Same with grounding into a double play: How can that be correlated with scoring? Because the teams that ground into a lot of double plays are the teams that put a lot of runners on base. San Diego and Minnesota grounded into the fewest double plays in the majors last year, and they had weak offenses. By contrast, the seven teams that grounded into the most double plays were all above-average at scoring runs.)

No comments:

Post a Comment