Tuesday, April 10, 2012

2011-2012 NHL Competitive Balance

Now that the 2011-2012 NHL regular season is over, let's find out how competitively balanced the NHL was over the 2011-2012 regular season. In order to do this we need a measure of competitive balance, and while there are a number of measures, I am going to choose the Noll-Scully measure of competitive balance. Noll and Scully were looking at basketball and baseball respectively and in each case, there are two outcomes for each game, you win or lose. In some sports that is not always true; hockey being an excellent example. It is possible in hockey to end up with three outcomes: a win, a loss or a tie. Typically, I get around the possibilities of a tie by just calculating each team's winning percentage (# of wins * 2 + # of ties)/(# of games played *2) and use the typical Noll-Scully approach to measure competitive balance - such as we did in our book, The Wages of Wins. If I do that I end up with a Noll-Scully for the 2011-2012 NHL regular season equal to 1.15, which would indicate that the NHL is fairly competitively balanced last year. (If you use the standard deviation of a population instead of the standard deviation of a sample the Noll-Scully is now equal to 1.13).

Yet, what about the possibility of ties and how does that change the competitive balance measure under the Noll-Scully assumption of equal playing strength? Well, David Richardson in a paper "Pay, Performance, and Competitive Balance in the National Hockey League" published in the Eastern Economic Journal in 2000 addressed exactly that question and reports that the formula for calculating the idealized standard deviation of the Noll-Scully measure of competitive balance under a trinomial distribution is: [(1-p)/4n]1/2, where p is the probability of a tie under the equal playing strength assumption and n is the number of games played. Richardson estimates that the probability of a tie (using Stanley Cup playoff series that went 6 or 7 games and ended up tied during regulation to be equal to 0.162). So let's use that for now as our probability of teams with equal playing strength in the NHL of tying to calculate the ideal standard deviation. When doing so, now the Noll-Scully measure of competitive balance is equal to 1.414 - or less competitively balanced than by not including the probability of a tie occurring. (Using the standard deviation of the population results in the Noll-Scully now to be equal to 1.391).

Update: Step-by-step guide to calculate the Noll-Scully Competitive Balance Measure.

Monday, April 9, 2012

2011-2012 NHL Pay and Performance

Given the NHL regular season has just concluded, let's take a look at how team payroll and team performance relate for just the 2011-2012 season. Since we have only one season (and I admit this is a limited sample size so any conclusions should be viewed cautiously), we do not need to calculate relative payroll, as the average payroll is not changing over time - again only one year. Thus all we need to do is gather the data (payroll from USA Today and Standing from hockey-reference) and calculate the r-squared between NHL team payroll and NHL team winning percent for the 2011-2012 regular season. Having done this, I find that the r-squared = 0.198, indicating that the variation in NHL team payrolls "explain" about 20% of the variation in NHL team winning percent for the 2011-2012 NHL regular season.

I plan on coming back to this payroll and performance over a longer period of time next month.

Friday, April 6, 2012

2012 MLB Salary

The USA Today has a nice piece on MLB salaries for 2012. They also have a MLB salary database, which I have used in estimating the relationship between payroll and performance.

Wednesday, April 4, 2012

Estimating MLB Player Run Productivity

Yesterday, I went though a step-by-step guide as to how to determine the marginal product of various MLB team statistics with respect to run productivity at the team level. Yesterday, I treated each team as equal - which may not exactly be true since teams play in different ballparks and this may affect run production. I will come back to this later this month and re-do the analysis by breaking out each team in the overall analysis. I might also just look at different leagues as the AL might be more run producing since pitchers do not bat. It should not take long, so hopefully I will have this done soon.

Today, I want to use the estimates at the team level (see the end of the blog for the actual numbers), and use them to estimate the runs scored productivity at the batter level for last (2011) season. This is not all that difficult, so here are the steps.

Step 1: Open the Batting.csv file and copy the batting data to another file. (You do not have to do this, but I do, so I am calling this step 1).

Step 2: Copy the regression results (estimated coefficients) from the MLB team regression to the right of the player data.

Step 3: Create a formula for the first players run productivity. This is done my taking the coefficient of a single times the number of singles for the first player, plus the coefficient of a double times the number of doubles for the first player, plus the coefficient of a triple times the number of triples for the first player. Keep doing this for all ten variables. Once you have this formula, then copy the formula for the first player and paste it for all the players that year. Note: you might want to put a $ sign in front of the row number of the coefficient in the formula so that Excel keeps referring back to that cell. For example if the coefficient of a single is in cell AE3, then type in $AE$3 instead. This will force Excel to always use that cell when calculating the coefficient of a single. Do this for each of the ten variables.

Step 4: Sort the data by whatever you label the cell for the formula above. I called it Runs Scored, and sort by largest to smallest. Doing so yields the following. Note: that the batters name is using Mr. Lahmen's labeling, but you should be able to figure out who is who rather easily. Here are the top 50, with Miguel Cabrera the most productive batter last year, just above Jose Bautista.

Rank playerID teamID lgID Runs Scored
1 cabremi01 DET AL 114.74
2 bautijo02 TOR AL 113.97
3 ellsbja01 BOS AL 111.70
4 kempma01 LAN NL 110.94
5 braunry02 MIL NL 108.43
6 vottojo01 CIN NL 106.54
7 gonzaad01 BOS AL 105.99
8 fieldpr01 MIL NL 103.42
9 grandcu01 NYA AL 102.57
10 pedrodu01 BOS AL 97.77
11 uptonju01 ARI NL 93.97
12 gordoal01 KCA AL 92.97
13 konerpa01 CHA AL 91.48
14 youngmi02 TEX AL 90.95
15 canoro01 NYA AL 90.92
16 pujolal01 SLN NL 88.55
17 berkmla01 SLN NL 88.48
18 kinslia01 TEX AL 86.93
19 ortizda01 BOS AL 86.04
20 teixema01 NYA AL 85.38
21 ramirar01 CHN NL 85.12
22 tulowtr01 COL NL 84.38
23 reyesjo01 NYN NL 82.24
24 mccutan01 PIT NL 82.19
25 morsemi01 WAS NL 81.96
26 stantmi03 FLO NL 81.73
27 cabreme01 KCA AL 80.32
28 zobribe01 TBA AL 79.82
29 phillbr01 CIN NL 78.12
30 napolmi01 TEX AL 77.13
31 avilaal01 DET AL 76.96
32 butlebi03 KCA AL 75.95
33 francje02 KCA AL 75.69
34 gonzaca01 COL NL 75.55
35 swishni01 NYA AL 75.55
36 martivi01 DET AL 75.44
37 castrst01 CHN NL 75.15
38 beltrad01 TEX AL 74.48
39 howarry01 PHI NL 74.10
40 victosh01 PHI NL 73.99
41 hamiljo03 TEX AL 73.48
42 cabreas01 CLE AL 73.33
43 markani01 BAL AL 73.21
44 santaca01 CLE AL 73.07
45 bruceja01 CIN NL 71.89
46 jonesad01 BAL AL 71.66
47 peraljh01 DET AL 71.14
48 penaca01 CHN NL 71.04
49 longoev01 TBA AL 70.86
50 hollima01 SLN NL 70.37

Tuesday, April 3, 2012

Estimating MLB Team Run Productivity

The "boys of Summer" are back in full swing this week, which got me to thinking about MLB teams and player productivity. In our book, The Wages of Wins, we examined MLB player productivity and how consistent MLB players are in terms of production in chapter 9. We found that MLB players are more consistent than NFL QB's, but less consistent than NBA players. In my Sports Economics class I go through the details as to how this analysis is carried out, and I thought that I would write up the process here.

Specifically, I am going to look at the marginal value (economists call this marginal product) of a MLB player. To accomplish this involves two tasks; one is the estimate the marginal value of various MLB team statistics with respect to runs scored and the second is to take these values and use them to estimate the marginal productivity of MLB players. Today I am going to go through the first task - estimating the MLB team runs scored model and tomorrow I plan on going over the second task and report the top 50 MLB players for last season using this model. So, let's turn to the first task.

Today, we are estimating MLB team runs scored production. Given that baseball is a much different sport than either football or hockey, we can analyze the model by just using the team's on-field production numbers that significantly affect runs scored. This method is similar to Ferdinard Lane, Pete Palmer's Linear Weights and Asher Blass' Runs Created methodologies. So here is a step-by-step guide as to how to estimate the marginal value of various MLB team statistics.

Step 1: Download (for free) from Sean Lahman's website the latest baseball database, which as of April 2012 is version 5.9.1 and I choose the comma-delimited version, which opens up nicely as Microsoft Excel files.

Step 2: Unzip file file and open up the Teams.csv file. Since the file starts with 1871 season, I just copy the team data that I am going to use (2000-2011) to another Excel file.

Step 3a: I organize the team data. Since I will need the following variables: R, AB, H, 2B, 3B, HR, SB, SF, CS and BB from this file, I put them all next to each other in the new team file.
Step 3b: I will need to calculate the following variables: 1B = H - 2B - 3B - HR; NBB = BB - IBB; GIDPCS = GIDP + CS; and Outs = AB - H. Now I have good news and bad news. The good news is that Mr. Lahman's database has all but two of the variables already entered at the team level. The bad news is that Mr. Lahman's database has all but two (IBB and GIDP) of the variables to make those calculations above already entered at the team level. Thus I will need to get this data at the team level to finish out the calculations.

Step 4: To get the team level IBB (intentional walks) and GIDP (grounded into a double play), I have to open the Batting.csv file that I unzipped in step 2. In the Batting.csv file, column S has the IBB data by player and column W has the GIDP data by player. So to go from the player level to the team level, I will need to aggregate up to the team level. Since the data is sorted alphabetically by player name, this data will need to be sorted for each season and for each team. Thankfully, Excel has an excellent sort function that will do that rather quickly.

Step 5: Sort the player data in the Batting.csv file. I highlight the entire data (click on the upper left of the grid in Excel) and under Data click on Sort, which opens up a box. In that box, I choose at the top right the box next to "My Data has headers", and then in the dropdown menu to the right of "Sort by" I choose YearID. Next I click Add Level - which is at the very top left of this box - and in the new dropdown menu choose lgID; and finally click Add Level again and in the newest dropdown menu choose TeamID. Now click OK near the bottom right of the box.

Step 6: Once step 5 is completed, on row 79251 begins the 2000 season, and the 2011 season ends at row 95196. I just copied this data to another Excel file to work on. At this point I insert a blank row at the end of each team for each season for a place to make the IBB and GIDP calculations, which might take a few minutes. Once that is done, just sum up the columns (by team) for IBB and GIDP starting with the 2000 season and finishing with the 2011 season. In order to save a little time in the next step, I also include each time I calculate a team's IBB and GIDP the season, teamID and League, so that when I sort this data in the next step, it will copy and paste nicely into the Team.csv file in the right order.

Step 7: Sort the IBB and GIDP data by season, league and team (similar to step 5). This should allow you to copy and paste the IBB and GIDP data back into the Team.csv file where each datapoint correctly corresponds to the season, league and team in the Team.csv file. Check on this, or step 6 will be done in vain, and your results will be wrong.

Step 8: Run the following linear regression: R = f(1B, 2B, 3B, HR, NBB, HBP, SB, GIDPCS, SF, Outs). I actually choose to do this in another program, which allowed me to control for heteroskedasticity (unequal scatter in the data). Here are the estimated results for these 12 seasons of data.


Coefficient Std. Error t-stat prob.
C 151.167 121.681 1.242 0.215
SINGLE 0.539 0.021 25.618 0.000
DOUBLE 0.710 0.045 15.925 0.000
TRIPLE 1.147 0.134 8.571 0.000
HR 1.430 0.041 34.607 0.000
NBB 0.329 0.021 15.793 0.000
HBP 0.420 0.089 4.742 0.000
SB 0.100 0.041 2.401 0.017
GIDPCS -0.176 0.069 -2.563 0.011
SF 0.604 0.176 3.424 0.001
OUTS -0.146 0.028 -5.128 0.000

A few observations: first, that other than the constant term, each of the variables is statistically significant at the 99% confidence level and of the correct sign. Only GIDPCS and OUTS are negative, which both are using up the finite and scarce resource in baseball: outs. Second, also that HR's have a higher coefficient than Singles (1B), which means that a HR will on average generate more runs than a single. All the other coefficients seem to make sense as well. Finally, note that the coefficient on GIDPCS is greater in absolute value than SB, which means that if a player made two stolen base attempts, and was successful on one attempt but not on the other, that player on average would have cost his team more than he benefited his team.

Tomorrow I plan on reporting the top 50 MLB batters for the 2011 season.