Team Sports Analysis: April 2012

Tuesday, April 24, 2012

NHL Goalie Performance Measure

Here is a step-by-step guide to David Berri's and my NHL goalie measure from our article in the Journal of Sports Economics in 2010.

Step 1: Calculate the Marginal Value of a Goal Against. I have previously posted on how to do this, so follow the steps in this blog.
Step 2: Download NHL Goalie Data I downloaded the 2011-2012 NHL goalie data from NHL.com. Go to the web page, click on Stats, then choose Individuals. Once that page opens, I choose 2011-2012 Regular Season and under Position I choose Goalie. This opens up a page with all the NHL goalies for the time period you are analyzing. Make sure you get all the goalie data by choosing the additional tables at the bottom of the first table. I then copied and pasted each table into Microsoft Excel.
Step 3: Wins Above Average (WAA) Calculation. WAA is measured as the absolute value (since a goal against has a negative effect on team wins) of the marginal value of a goal against divided by two (since each win is worth two standings points) times the number of shots on goal that goalie faces times the difference in the save percentage of the goalie and the average save percentage of all goalies for that season. Let's break this calculation up into three parts.
Part A: This is already done, since the data is just the columns listed as SA and SV% from NHL.com for each goalie, which in my spreadsheet is column I for SA and column M for SV%.
Part B: Calculate the average save percentage for the league by finding the total number of shots against and the total number of goals against and then subtract the total number of goals against from the total number of shots against and divide that by the total number of shots against. In Excel this would be (S6-S9)/S6, if total shots against is in cell S6 and total goals against is in cell S9. Suppose this formula is put in cell S12.
Part C: The first part of the equation [the marginal value of a goal against] was found in step 1. Take the absolute value of this number and divide by two. If the marginal value of a goal against is in cell S3, then in Excel type ABS($S$3)/2.
Put it all together and in Excel we have the following formula: =(M2-$S$12)*I2*(ABS($S$3)/2). Note that M2 is the first goalie listed save percentage and cell I2 is the first goalie listed shots against. Then copy and paste this formula down in the spreadsheet for each goalie and you have an estimate of each goalie's Wins Above Average (WAA) for that season.
Step 4: Rank Most Productive Goalies. I then rank each goalie by their WAA from highest to lowest for that season. You can do this by sorting from highest to lowest. I do something slightly different but it is the same thing, it just allows me to do this without actually having to use the sort option in Excel.

Monday, April 23, 2012

NCAA Basketball Coaches Salary

New York Times piece on the differences in salary of male and female NCAA basketball coaches.

Thursday, April 19, 2012

Calculating the Marginal Value of a Goal in the NHL

In a previous post, I calculated the regular season performance of NHL goalies. In that calculation, I needed to know the what the marginal value of a goal against equaled. Here, I will go through a step-by-step process of how to estimate the marginal value of a goal against in the NHL. So here it is.

Step 1: Get the NHL Regular Season Standings Data. I got the NHL regular season standings data from Hockey Reference since the data copies and pastes nicely into Microsoft Excel. I choose to include the 1995-1996 to 2011-2012 regular seasons (minus the 2004-2005 season that was cancelled). As you are doing this, make sure that you know which season is which, as that will help if you are going to follow step #2. My choice of which seasons to include is somewhat arbitrary; in our NHL goalies paper published in the Journal of Sports Economics we used a longer time period.

Step 2: Create Dummy Variables for each NHL Regular Season. Please note, you can skip this step and still estimate the marginal value of a goal against. I choose to do this, since the overall fit of the model is substantially better given the NHL has changed the formula it uses to calculate the dependent variable in the regression step. In case you are wondering, a dummy variable is a variable that is equal to 1 if the condition is true and a zero otherwise. To do this insert a column (I choose to the right of the standings data) and title your dummy variable. I titled the dummy variable for the 2011-2012 season d2011. Thus for the new variable I created titled d2011 I set it equal to 1 for the rows of data for the 2011-2012 season and set it equal to zero for all the other rows of data. Then I repeated this process for each of the seasons, making sure that I only have the number 1 for the years corresponding to the dummy variable. If you set up the dummy variables in order, then in the spreadsheet you should have 1's that are moving downward to the right when looking at each season.

Step 3: Run a Linear Regression. I choose to do this is a statistical package called Stata. I also controlled for heteroskedasticity in the regression estimation by using robust standard errors. Here is the linear regression, where pts is the dependent variable, gf = goals for, ga = goals against and dxxxx is a dummy variable for each season included:

pts = f(gf, ga, d2011, d2010, d2009, d2008, d2007, d2006, d2005, d2003, d2002, d2001, d2000, d1999, d1998, d1997, d1996, d1995)

After running the linear regression, I estimate that the value of a goal against in terms of NHL regular season standings points is equal to -0.340430.

Tuesday, April 17, 2012

How to Calculate the Noll-Scully Competitive Balance Measure

Here is a step-by-step guide as to how to calculate the Noll-Scully measure of competitive balance. Note that I will include both the case where a team can have only two outcomes (a win or a loss) and where a team can have three outcomes (a win, a loss or a tie). I will also report both calculations of the actual standard deviation of winning percent using a sample and using a population. They will also help for comparison purposes.

Remember the Noll-Scully measure of competitive balance is calculating how far a particular sport or league's distribution of wins deviates from the purely random outcome (i.e. equal playing strength).

Step 1: Get League Standing Data. Download the league standings data - typically at the end of the regular season - and insert it into a spreadsheet. (If you are having trouble getting the standings data from a webpage to paste easily into Excel, I suggest that you use the Mozilla Firefox browser and copy the data from the data using Firefox to Notepad (can be found under All Programs | Accessories on a Windows computer) and paste the data into Notepad, then copy all of that data from Notepad and paste into Excel - works great for me, though I don't know why).

Step 2a: Calculate Actual Standard Deviation. For leagues with only wins and losses, such as basketball or baseball, to calculate the actual standard deviation of winning percentage in Microsoft Excel if the winning percent data is in Column C and there are 30 teams, then for a sample you will use the formula =stdev(C2:C31) and for a population use the formula =stdevp(C2:C31).
Step 2b: For leagues with wins, losses and ties, first calculate winning percent as (# of wins * 2 + # of ties)/(# of games played * 2), and then in Microsoft Excel if the winning percent formula is in Column C and there are 30 teams for a sample you will use the formula =stdev and for a population use the formula =stdevp. (I will let the statisticians argue which is more appropriate of the two).

Step 3a: Calculate Idealized Standard Deviation. For leagues with only wins and losses, such as basketball or baseball, to calculate the idealized standard deviation of winning percentage in Microsoft Excel if the winning percent data is in Column C and there are 30 teams, you will use the formula =average(C2:C31)/SQRT(B2), where B2 is the number of regular season games played.
Step 3b: For leagues with wins, losses and ties, [(1-p)/4n]^1/2, where p is the probability of a tie under the equal playing strength assumption and n is the number of games played. Richardson's Eastern Economic Journal paper suggests using playoff series games that went six or seven games and calculating how many were tied at the end of regulation. That seems reasonable to me, so this will need to be figured out separately for the league you are using. Once that is done, use the probability of a tie for (p) and the number of regular season games played for n, and this is the idealized standard deviation of winning percent for a league that has wins, losses and ties.

Step 4: Calculate the Noll-Scully Competitive Balance Measure. For both types of leagues, if the actual standard deviation of winning percent is in cell D2 and the idealized standard deviation is in cell D3, then in cell D4 insert the following formula: = D2/D3, and that is the Noll-Scully competitive balance measure.

Wednesday, April 11, 2012

NHL Regular Season Goalie Evaluation

Tonight starts the NHL playoffs, so I thought that I would take a look at how productive NHL goalies were during the 2011-2012 NHL regular season. In order to do so David Berri and I published a paper in the Journal of Sports Economics in 2010 on NHL goalies. We named our measure of NHL goalie productivity WAA. WAA the absolute value (since a goal against has a negative effect on team wins) of the marginal value of a goal against divided by two (since each win is worth two standings points) times the number of shots on goal that goalie faces times the difference in the save percentage of the goalie and the average save percentage of all goalies for that season. I am using -0.221519 for the marginal value of a goal. I plan on updating this value using data from the 2011-2012 season, but have not had the time. Note, any change in the marginal value will just be a monotonic transformation of the WAA (either increase or decrease) from its current value.

Given our measure of NHL goalie performance, here is the top 20 ranking for NHL goalies over the 2011-2012 regular season. (I hope to get some time to give a step-by-step guide as to how I am getting the WAA numbers later this month.)

Rank	Player	Team	WAA	SV%
1	Mike Smith	PHX	3.767	0.930
2	Henrik Lundqvist	NYR	3.196	0.930
3	Jonathan Quick	LAK	3.191	0.929
4	Brian Elliott	STL	2.849	0.940
5	Cory Schneider	VAN	2.456	0.937
6	Pekka Rinne	NSH	2.256	0.923
7	Miikka Kiprusoff	CGY	1.686	0.921
8	Jaroslav Halak	STL	1.672	0.926
9	Kari Lehtonen	DAL	1.630	0.922
10	Tim Thomas	BOS	1.187	0.920
11	Jimmy Howard	DET	1.071	0.920
12	Tuukka Rask	BOS	1.064	0.929
13	Roberto Luongo	VAN	0.954	0.919
14	Niklas Backstrom	MIN	0.786	0.919
15	Jose Theodore	FLA	0.576	0.917
16	Carey Price	MTL	0.522	0.916
17	Jean-Sebastien Giguere	COL	0.514	0.919
18	Tomas Vokoun	WSH	0.498	0.917
19	Ryan Miller	BUF	0.488	0.916
20	Justin Peters	CAR	0.451	0.931

Tuesday, April 10, 2012

2011-2012 NHL Competitive Balance

Now that the 2011-2012 NHL regular season is over, let's find out how competitively balanced the NHL was over the 2011-2012 regular season. In order to do this we need a measure of competitive balance, and while there are a number of measures, I am going to choose the Noll-Scully measure of competitive balance. Noll and Scully were looking at basketball and baseball respectively and in each case, there are two outcomes for each game, you win or lose. In some sports that is not always true; hockey being an excellent example. It is possible in hockey to end up with three outcomes: a win, a loss or a tie. Typically, I get around the possibilities of a tie by just calculating each team's winning percentage (# of wins * 2 + # of ties)/(# of games played *2) and use the typical Noll-Scully approach to measure competitive balance - such as we did in our book, The Wages of Wins. If I do that I end up with a Noll-Scully for the 2011-2012 NHL regular season equal to 1.15, which would indicate that the NHL is fairly competitively balanced last year. (If you use the standard deviation of a population instead of the standard deviation of a sample the Noll-Scully is now equal to 1.13).

Yet, what about the possibility of ties and how does that change the competitive balance measure under the Noll-Scully assumption of equal playing strength? Well, David Richardson in a paper "Pay, Performance, and Competitive Balance in the National Hockey League" published in the Eastern Economic Journal in 2000 addressed exactly that question and reports that the formula for calculating the idealized standard deviation of the Noll-Scully measure of competitive balance under a trinomial distribution is: [(1-p)/4n]^1/2, where p is the probability of a tie under the equal playing strength assumption and n is the number of games played. Richardson estimates that the probability of a tie (using Stanley Cup playoff series that went 6 or 7 games and ended up tied during regulation to be equal to 0.162). So let's use that for now as our probability of teams with equal playing strength in the NHL of tying to calculate the ideal standard deviation. When doing so, now the Noll-Scully measure of competitive balance is equal to 1.414 - or less competitively balanced than by not including the probability of a tie occurring. (Using the standard deviation of the population results in the Noll-Scully now to be equal to 1.391).

Update: Step-by-step guide to calculate the Noll-Scully Competitive Balance Measure.

Monday, April 9, 2012

2011-2012 NHL Pay and Performance

Given the NHL regular season has just concluded, let's take a look at how team payroll and team performance relate for just the 2011-2012 season. Since we have only one season (and I admit this is a limited sample size so any conclusions should be viewed cautiously), we do not need to calculate relative payroll, as the average payroll is not changing over time - again only one year. Thus all we need to do is gather the data (payroll from USA Today and Standing from hockey-reference) and calculate the r-squared between NHL team payroll and NHL team winning percent for the 2011-2012 regular season. Having done this, I find that the r-squared = 0.198, indicating that the variation in NHL team payrolls "explain" about 20% of the variation in NHL team winning percent for the 2011-2012 NHL regular season.

I plan on coming back to this payroll and performance over a longer period of time next month.

Friday, April 6, 2012

2012 MLB Salary

The USA Today has a nice piece on MLB salaries for 2012. They also have a MLB salary database, which I have used in estimating the relationship between payroll and performance.

Thursday, April 5, 2012

MLB Pay and Performance Revisted

Given MLB is now underway, I thought that I would link to four previous blogs on MLB pay and performance.

First, estimating MLB payroll and performance in 2011.
Second, estimating MLB payroll and performance from 1988 to 2011.
Third, a step-by-step guide as to how to get the results from above.
Fourth, a popular blog from my Hawkonomics blog.

Enjoy!

Wednesday, April 4, 2012

Estimating MLB Player Run Productivity

Yesterday, I went though a step-by-step guide as to how to determine the marginal product of various MLB team statistics with respect to run productivity at the team level. Yesterday, I treated each team as equal - which may not exactly be true since teams play in different ballparks and this may affect run production. I will come back to this later this month and re-do the analysis by breaking out each team in the overall analysis. I might also just look at different leagues as the AL might be more run producing since pitchers do not bat. It should not take long, so hopefully I will have this done soon.

Today, I want to use the estimates at the team level (see the end of the blog for the actual numbers), and use them to estimate the runs scored productivity at the batter level for last (2011) season. This is not all that difficult, so here are the steps.

Step 1: Open the Batting.csv file and copy the batting data to another file. (You do not have to do this, but I do, so I am calling this step 1).

Step 2: Copy the regression results (estimated coefficients) from the MLB team regression to the right of the player data.

Step 3: Create a formula for the first players run productivity. This is done my taking the coefficient of a single times the number of singles for the first player, plus the coefficient of a double times the number of doubles for the first player, plus the coefficient of a triple times the number of triples for the first player. Keep doing this for all ten variables. Once you have this formula, then copy the formula for the first player and paste it for all the players that year. Note: you might want to put a $ sign in front of the row number of the coefficient in the formula so that Excel keeps referring back to that cell. For example if the coefficient of a single is in cell AE3, then type in $AE$3 instead. This will force Excel to always use that cell when calculating the coefficient of a single. Do this for each of the ten variables.

Step 4: Sort the data by whatever you label the cell for the formula above. I called it Runs Scored, and sort by largest to smallest. Doing so yields the following. Note: that the batters name is using Mr. Lahmen's labeling, but you should be able to figure out who is who rather easily. Here are the top 50, with Miguel Cabrera the most productive batter last year, just above Jose Bautista.

Rank	playerID	teamID	lgID	Runs Scored
1	cabremi01	DET	AL	114.74
2	bautijo02	TOR	AL	113.97
3	ellsbja01	BOS	AL	111.70
4	kempma01	LAN	NL	110.94
5	braunry02	MIL	NL	108.43
6	vottojo01	CIN	NL	106.54
7	gonzaad01	BOS	AL	105.99
8	fieldpr01	MIL	NL	103.42
9	grandcu01	NYA	AL	102.57
10	pedrodu01	BOS	AL	97.77
11	uptonju01	ARI	NL	93.97
12	gordoal01	KCA	AL	92.97
13	konerpa01	CHA	AL	91.48
14	youngmi02	TEX	AL	90.95
15	canoro01	NYA	AL	90.92
16	pujolal01	SLN	NL	88.55
17	berkmla01	SLN	NL	88.48
18	kinslia01	TEX	AL	86.93
19	ortizda01	BOS	AL	86.04
20	teixema01	NYA	AL	85.38
21	ramirar01	CHN	NL	85.12
22	tulowtr01	COL	NL	84.38
23	reyesjo01	NYN	NL	82.24
24	mccutan01	PIT	NL	82.19
25	morsemi01	WAS	NL	81.96
26	stantmi03	FLO	NL	81.73
27	cabreme01	KCA	AL	80.32
28	zobribe01	TBA	AL	79.82
29	phillbr01	CIN	NL	78.12
30	napolmi01	TEX	AL	77.13
31	avilaal01	DET	AL	76.96
32	butlebi03	KCA	AL	75.95
33	francje02	KCA	AL	75.69
34	gonzaca01	COL	NL	75.55
35	swishni01	NYA	AL	75.55
36	martivi01	DET	AL	75.44
37	castrst01	CHN	NL	75.15
38	beltrad01	TEX	AL	74.48
39	howarry01	PHI	NL	74.10
40	victosh01	PHI	NL	73.99
41	hamiljo03	TEX	AL	73.48
42	cabreas01	CLE	AL	73.33
43	markani01	BAL	AL	73.21
44	santaca01	CLE	AL	73.07
45	bruceja01	CIN	NL	71.89
46	jonesad01	BAL	AL	71.66
47	peraljh01	DET	AL	71.14
48	penaca01	CHN	NL	71.04
49	longoev01	TBA	AL	70.86
50	hollima01	SLN	NL	70.37

Tuesday, April 3, 2012

Estimating MLB Team Run Productivity

The "boys of Summer" are back in full swing this week, which got me to thinking about MLB teams and player productivity. In our book, The Wages of Wins, we examined MLB player productivity and how consistent MLB players are in terms of production in chapter 9. We found that MLB players are more consistent than NFL QB's, but less consistent than NBA players. In my Sports Economics class I go through the details as to how this analysis is carried out, and I thought that I would write up the process here.

Specifically, I am going to look at the marginal value (economists call this marginal product) of a MLB player. To accomplish this involves two tasks; one is the estimate the marginal value of various MLB team statistics with respect to runs scored and the second is to take these values and use them to estimate the marginal productivity of MLB players. Today I am going to go through the first task - estimating the MLB team runs scored model and tomorrow I plan on going over the second task and report the top 50 MLB players for last season using this model. So, let's turn to the first task.

Today, we are estimating MLB team runs scored production. Given that baseball is a much different sport than either football or hockey, we can analyze the model by just using the team's on-field production numbers that significantly affect runs scored. This method is similar to Ferdinard Lane, Pete Palmer's Linear Weights and Asher Blass' Runs Created methodologies. So here is a step-by-step guide as to how to estimate the marginal value of various MLB team statistics.

Step 1: Download (for free) from Sean Lahman's website the latest baseball database, which as of April 2012 is version 5.9.1 and I choose the comma-delimited version, which opens up nicely as Microsoft Excel files.

Step 2: Unzip file file and open up the Teams.csv file. Since the file starts with 1871 season, I just copy the team data that I am going to use (2000-2011) to another Excel file.

Step 3a: I organize the team data. Since I will need the following variables: R, AB, H, 2B, 3B, HR, SB, SF, CS and BB from this file, I put them all next to each other in the new team file.
Step 3b: I will need to calculate the following variables: 1B = H - 2B - 3B - HR; NBB = BB - IBB; GIDPCS = GIDP + CS; and Outs = AB - H. Now I have good news and bad news. The good news is that Mr. Lahman's database has all but two of the variables already entered at the team level. The bad news is that Mr. Lahman's database has all but two (IBB and GIDP) of the variables to make those calculations above already entered at the team level. Thus I will need to get this data at the team level to finish out the calculations.

Step 4: To get the team level IBB (intentional walks) and GIDP (grounded into a double play), I have to open the Batting.csv file that I unzipped in step 2. In the Batting.csv file, column S has the IBB data by player and column W has the GIDP data by player. So to go from the player level to the team level, I will need to aggregate up to the team level. Since the data is sorted alphabetically by player name, this data will need to be sorted for each season and for each team. Thankfully, Excel has an excellent sort function that will do that rather quickly.

Step 5: Sort the player data in the Batting.csv file. I highlight the entire data (click on the upper left of the grid in Excel) and under Data click on Sort, which opens up a box. In that box, I choose at the top right the box next to "My Data has headers", and then in the dropdown menu to the right of "Sort by" I choose YearID. Next I click Add Level - which is at the very top left of this box - and in the new dropdown menu choose lgID; and finally click Add Level again and in the newest dropdown menu choose TeamID. Now click OK near the bottom right of the box.

Step 6: Once step 5 is completed, on row 79251 begins the 2000 season, and the 2011 season ends at row 95196. I just copied this data to another Excel file to work on. At this point I insert a blank row at the end of each team for each season for a place to make the IBB and GIDP calculations, which might take a few minutes. Once that is done, just sum up the columns (by team) for IBB and GIDP starting with the 2000 season and finishing with the 2011 season. In order to save a little time in the next step, I also include each time I calculate a team's IBB and GIDP the season, teamID and League, so that when I sort this data in the next step, it will copy and paste nicely into the Team.csv file in the right order.

Step 7: Sort the IBB and GIDP data by season, league and team (similar to step 5). This should allow you to copy and paste the IBB and GIDP data back into the Team.csv file where each datapoint correctly corresponds to the season, league and team in the Team.csv file. Check on this, or step 6 will be done in vain, and your results will be wrong.

Step 8: Run the following linear regression: R = f(1B, 2B, 3B, HR, NBB, HBP, SB, GIDPCS, SF, Outs). I actually choose to do this in another program, which allowed me to control for heteroskedasticity (unequal scatter in the data). Here are the estimated results for these 12 seasons of data.

	Coefficient	Std. Error	t-stat	prob.
C	151.167	121.681	1.242	0.215
SINGLE	0.539	0.021	25.618	0.000
DOUBLE	0.710	0.045	15.925	0.000
TRIPLE	1.147	0.134	8.571	0.000
HR	1.430	0.041	34.607	0.000
NBB	0.329	0.021	15.793	0.000
HBP	0.420	0.089	4.742	0.000
SB	0.100	0.041	2.401	0.017
GIDPCS	-0.176	0.069	-2.563	0.011
SF	0.604	0.176	3.424	0.001
OUTS	-0.146	0.028	-5.128	0.000

A few observations: first, that other than the constant term, each of the variables is statistically significant at the 99% confidence level and of the correct sign. Only GIDPCS and OUTS are negative, which both are using up the finite and scarce resource in baseball: outs. Second, also that HR's have a higher coefficient than Singles (1B), which means that a HR will on average generate more runs than a single. All the other coefficients seem to make sense as well. Finally, note that the coefficient on GIDPCS is greater in absolute value than SB, which means that if a player made two stolen base attempts, and was successful on one attempt but not on the other, that player on average would have cost his team more than he benefited his team.

Tomorrow I plan on reporting the top 50 MLB batters for the 2011 season.

Monday, April 2, 2012

MLB Competitive Balance Tax Totals

BizofBaseball.com has some information on the amounts that Major League Baseball teams have paid in what is commonly called the luxury tax, or what is actually referred to as the Competitive Balance Tax. Notice that the majority of the total payouts are from the New York Yankees.

Where is all that money going? Humm.