Team Sports Analysis

Tuesday, April 10, 2012

2011-2012 NHL Competitive Balance

Now that the 2011-2012 NHL regular season is over, let's find out how competitively balanced the NHL was over the 2011-2012 regular season. In order to do this we need a measure of competitive balance, and while there are a number of measures, I am going to choose the Noll-Scully measure of competitive balance. Noll and Scully were looking at basketball and baseball respectively and in each case, there are two outcomes for each game, you win or lose. In some sports that is not always true; hockey being an excellent example. It is possible in hockey to end up with three outcomes: a win, a loss or a tie. Typically, I get around the possibilities of a tie by just calculating each team's winning percentage (# of wins * 2 + # of ties)/(# of games played *2) and use the typical Noll-Scully approach to measure competitive balance - such as we did in our book, The Wages of Wins. If I do that I end up with a Noll-Scully for the 2011-2012 NHL regular season equal to 1.15, which would indicate that the NHL is fairly competitively balanced last year. (If you use the standard deviation of a population instead of the standard deviation of a sample the Noll-Scully is now equal to 1.13).

Yet, what about the possibility of ties and how does that change the competitive balance measure under the Noll-Scully assumption of equal playing strength? Well, David Richardson in a paper "Pay, Performance, and Competitive Balance in the National Hockey League" published in the Eastern Economic Journal in 2000 addressed exactly that question and reports that the formula for calculating the idealized standard deviation of the Noll-Scully measure of competitive balance under a trinomial distribution is: [(1-p)/4n]^1/2, where p is the probability of a tie under the equal playing strength assumption and n is the number of games played. Richardson estimates that the probability of a tie (using Stanley Cup playoff series that went 6 or 7 games and ended up tied during regulation to be equal to 0.162). So let's use that for now as our probability of teams with equal playing strength in the NHL of tying to calculate the ideal standard deviation. When doing so, now the Noll-Scully measure of competitive balance is equal to 1.414 - or less competitively balanced than by not including the probability of a tie occurring. (Using the standard deviation of the population results in the Noll-Scully now to be equal to 1.391).

Update: Step-by-step guide to calculate the Noll-Scully Competitive Balance Measure.

Monday, April 9, 2012

2011-2012 NHL Pay and Performance

Given the NHL regular season has just concluded, let's take a look at how team payroll and team performance relate for just the 2011-2012 season. Since we have only one season (and I admit this is a limited sample size so any conclusions should be viewed cautiously), we do not need to calculate relative payroll, as the average payroll is not changing over time - again only one year. Thus all we need to do is gather the data (payroll from USA Today and Standing from hockey-reference) and calculate the r-squared between NHL team payroll and NHL team winning percent for the 2011-2012 regular season. Having done this, I find that the r-squared = 0.198, indicating that the variation in NHL team payrolls "explain" about 20% of the variation in NHL team winning percent for the 2011-2012 NHL regular season.

I plan on coming back to this payroll and performance over a longer period of time next month.

Friday, April 6, 2012

2012 MLB Salary

The USA Today has a nice piece on MLB salaries for 2012. They also have a MLB salary database, which I have used in estimating the relationship between payroll and performance.

Thursday, April 5, 2012

MLB Pay and Performance Revisted

Given MLB is now underway, I thought that I would link to four previous blogs on MLB pay and performance.

First, estimating MLB payroll and performance in 2011.
Second, estimating MLB payroll and performance from 1988 to 2011.
Third, a step-by-step guide as to how to get the results from above.
Fourth, a popular blog from my Hawkonomics blog.

Enjoy!

Wednesday, April 4, 2012

Estimating MLB Player Run Productivity

Yesterday, I went though a step-by-step guide as to how to determine the marginal product of various MLB team statistics with respect to run productivity at the team level. Yesterday, I treated each team as equal - which may not exactly be true since teams play in different ballparks and this may affect run production. I will come back to this later this month and re-do the analysis by breaking out each team in the overall analysis. I might also just look at different leagues as the AL might be more run producing since pitchers do not bat. It should not take long, so hopefully I will have this done soon.

Today, I want to use the estimates at the team level (see the end of the blog for the actual numbers), and use them to estimate the runs scored productivity at the batter level for last (2011) season. This is not all that difficult, so here are the steps.

Step 1: Open the Batting.csv file and copy the batting data to another file. (You do not have to do this, but I do, so I am calling this step 1).

Step 2: Copy the regression results (estimated coefficients) from the MLB team regression to the right of the player data.

Step 3: Create a formula for the first players run productivity. This is done my taking the coefficient of a single times the number of singles for the first player, plus the coefficient of a double times the number of doubles for the first player, plus the coefficient of a triple times the number of triples for the first player. Keep doing this for all ten variables. Once you have this formula, then copy the formula for the first player and paste it for all the players that year. Note: you might want to put a $ sign in front of the row number of the coefficient in the formula so that Excel keeps referring back to that cell. For example if the coefficient of a single is in cell AE3, then type in $AE$3 instead. This will force Excel to always use that cell when calculating the coefficient of a single. Do this for each of the ten variables.

Step 4: Sort the data by whatever you label the cell for the formula above. I called it Runs Scored, and sort by largest to smallest. Doing so yields the following. Note: that the batters name is using Mr. Lahmen's labeling, but you should be able to figure out who is who rather easily. Here are the top 50, with Miguel Cabrera the most productive batter last year, just above Jose Bautista.

Rank	playerID	teamID	lgID	Runs Scored
1	cabremi01	DET	AL	114.74
2	bautijo02	TOR	AL	113.97
3	ellsbja01	BOS	AL	111.70
4	kempma01	LAN	NL	110.94
5	braunry02	MIL	NL	108.43
6	vottojo01	CIN	NL	106.54
7	gonzaad01	BOS	AL	105.99
8	fieldpr01	MIL	NL	103.42
9	grandcu01	NYA	AL	102.57
10	pedrodu01	BOS	AL	97.77
11	uptonju01	ARI	NL	93.97
12	gordoal01	KCA	AL	92.97
13	konerpa01	CHA	AL	91.48
14	youngmi02	TEX	AL	90.95
15	canoro01	NYA	AL	90.92
16	pujolal01	SLN	NL	88.55
17	berkmla01	SLN	NL	88.48
18	kinslia01	TEX	AL	86.93
19	ortizda01	BOS	AL	86.04
20	teixema01	NYA	AL	85.38
21	ramirar01	CHN	NL	85.12
22	tulowtr01	COL	NL	84.38
23	reyesjo01	NYN	NL	82.24
24	mccutan01	PIT	NL	82.19
25	morsemi01	WAS	NL	81.96
26	stantmi03	FLO	NL	81.73
27	cabreme01	KCA	AL	80.32
28	zobribe01	TBA	AL	79.82
29	phillbr01	CIN	NL	78.12
30	napolmi01	TEX	AL	77.13
31	avilaal01	DET	AL	76.96
32	butlebi03	KCA	AL	75.95
33	francje02	KCA	AL	75.69
34	gonzaca01	COL	NL	75.55
35	swishni01	NYA	AL	75.55
36	martivi01	DET	AL	75.44
37	castrst01	CHN	NL	75.15
38	beltrad01	TEX	AL	74.48
39	howarry01	PHI	NL	74.10
40	victosh01	PHI	NL	73.99
41	hamiljo03	TEX	AL	73.48
42	cabreas01	CLE	AL	73.33
43	markani01	BAL	AL	73.21
44	santaca01	CLE	AL	73.07
45	bruceja01	CIN	NL	71.89
46	jonesad01	BAL	AL	71.66
47	peraljh01	DET	AL	71.14
48	penaca01	CHN	NL	71.04
49	longoev01	TBA	AL	70.86
50	hollima01	SLN	NL	70.37

Tuesday, April 3, 2012

Estimating MLB Team Run Productivity

The "boys of Summer" are back in full swing this week, which got me to thinking about MLB teams and player productivity. In our book, The Wages of Wins, we examined MLB player productivity and how consistent MLB players are in terms of production in chapter 9. We found that MLB players are more consistent than NFL QB's, but less consistent than NBA players. In my Sports Economics class I go through the details as to how this analysis is carried out, and I thought that I would write up the process here.

Specifically, I am going to look at the marginal value (economists call this marginal product) of a MLB player. To accomplish this involves two tasks; one is the estimate the marginal value of various MLB team statistics with respect to runs scored and the second is to take these values and use them to estimate the marginal productivity of MLB players. Today I am going to go through the first task - estimating the MLB team runs scored model and tomorrow I plan on going over the second task and report the top 50 MLB players for last season using this model. So, let's turn to the first task.

Today, we are estimating MLB team runs scored production. Given that baseball is a much different sport than either football or hockey, we can analyze the model by just using the team's on-field production numbers that significantly affect runs scored. This method is similar to Ferdinard Lane, Pete Palmer's Linear Weights and Asher Blass' Runs Created methodologies. So here is a step-by-step guide as to how to estimate the marginal value of various MLB team statistics.

Step 1: Download (for free) from Sean Lahman's website the latest baseball database, which as of April 2012 is version 5.9.1 and I choose the comma-delimited version, which opens up nicely as Microsoft Excel files.

Step 2: Unzip file file and open up the Teams.csv file. Since the file starts with 1871 season, I just copy the team data that I am going to use (2000-2011) to another Excel file.

Step 3a: I organize the team data. Since I will need the following variables: R, AB, H, 2B, 3B, HR, SB, SF, CS and BB from this file, I put them all next to each other in the new team file.
Step 3b: I will need to calculate the following variables: 1B = H - 2B - 3B - HR; NBB = BB - IBB; GIDPCS = GIDP + CS; and Outs = AB - H. Now I have good news and bad news. The good news is that Mr. Lahman's database has all but two of the variables already entered at the team level. The bad news is that Mr. Lahman's database has all but two (IBB and GIDP) of the variables to make those calculations above already entered at the team level. Thus I will need to get this data at the team level to finish out the calculations.

Step 4: To get the team level IBB (intentional walks) and GIDP (grounded into a double play), I have to open the Batting.csv file that I unzipped in step 2. In the Batting.csv file, column S has the IBB data by player and column W has the GIDP data by player. So to go from the player level to the team level, I will need to aggregate up to the team level. Since the data is sorted alphabetically by player name, this data will need to be sorted for each season and for each team. Thankfully, Excel has an excellent sort function that will do that rather quickly.

Step 5: Sort the player data in the Batting.csv file. I highlight the entire data (click on the upper left of the grid in Excel) and under Data click on Sort, which opens up a box. In that box, I choose at the top right the box next to "My Data has headers", and then in the dropdown menu to the right of "Sort by" I choose YearID. Next I click Add Level - which is at the very top left of this box - and in the new dropdown menu choose lgID; and finally click Add Level again and in the newest dropdown menu choose TeamID. Now click OK near the bottom right of the box.

Step 6: Once step 5 is completed, on row 79251 begins the 2000 season, and the 2011 season ends at row 95196. I just copied this data to another Excel file to work on. At this point I insert a blank row at the end of each team for each season for a place to make the IBB and GIDP calculations, which might take a few minutes. Once that is done, just sum up the columns (by team) for IBB and GIDP starting with the 2000 season and finishing with the 2011 season. In order to save a little time in the next step, I also include each time I calculate a team's IBB and GIDP the season, teamID and League, so that when I sort this data in the next step, it will copy and paste nicely into the Team.csv file in the right order.

Step 7: Sort the IBB and GIDP data by season, league and team (similar to step 5). This should allow you to copy and paste the IBB and GIDP data back into the Team.csv file where each datapoint correctly corresponds to the season, league and team in the Team.csv file. Check on this, or step 6 will be done in vain, and your results will be wrong.

Step 8: Run the following linear regression: R = f(1B, 2B, 3B, HR, NBB, HBP, SB, GIDPCS, SF, Outs). I actually choose to do this in another program, which allowed me to control for heteroskedasticity (unequal scatter in the data). Here are the estimated results for these 12 seasons of data.

	Coefficient	Std. Error	t-stat	prob.
C	151.167	121.681	1.242	0.215
SINGLE	0.539	0.021	25.618	0.000
DOUBLE	0.710	0.045	15.925	0.000
TRIPLE	1.147	0.134	8.571	0.000
HR	1.430	0.041	34.607	0.000
NBB	0.329	0.021	15.793	0.000
HBP	0.420	0.089	4.742	0.000
SB	0.100	0.041	2.401	0.017
GIDPCS	-0.176	0.069	-2.563	0.011
SF	0.604	0.176	3.424	0.001
OUTS	-0.146	0.028	-5.128	0.000

A few observations: first, that other than the constant term, each of the variables is statistically significant at the 99% confidence level and of the correct sign. Only GIDPCS and OUTS are negative, which both are using up the finite and scarce resource in baseball: outs. Second, also that HR's have a higher coefficient than Singles (1B), which means that a HR will on average generate more runs than a single. All the other coefficients seem to make sense as well. Finally, note that the coefficient on GIDPCS is greater in absolute value than SB, which means that if a player made two stolen base attempts, and was successful on one attempt but not on the other, that player on average would have cost his team more than he benefited his team.

Tomorrow I plan on reporting the top 50 MLB batters for the 2011 season.

Monday, April 2, 2012

MLB Competitive Balance Tax Totals

BizofBaseball.com has some information on the amounts that Major League Baseball teams have paid in what is commonly called the luxury tax, or what is actually referred to as the Competitive Balance Tax. Notice that the majority of the total payouts are from the New York Yankees.

Where is all that money going? Humm.