Friday, February 17, 2012

Payroll and Performance Analysis

Yesterday, I wrote about the relationship between MLB regular season team payroll and regular season performance. Today, I am going to go through how I calculated this, step-by-step, for those that are interested in how I came up with the numbers yesterday. As I have been looking at MLB recently, I will be using MLB as the example.

Step 1: Copy the team payroll data for the time period you are interested in analyzing and paste into a spreadsheet - I use Excel. For the team payroll data I used USA Today's salary database covering the 1988 to 2011 seasons. I also add a column named season and typed in the season year so I can easily keep track of each season.
Step 2: For each season of team payroll data, sort the data by team name, or if you include the season, then just sort by season (smallest to largest) and by team (A to Z). (Please note that USA Today has changed the names of previous named teams to their current name - thus the Montreal Expos (1988 - 2004) are named for the entire time period the Washington Nationals; the California Angels (1988 - 1997) and then Anaheim Angels (1998 - 2004) are the Los Angeles Angels for the entire time period; the Florida Marlins (1993 - 2011) are the Miami Marlins. In order for the teams to line up together with the team performance data in step 6, I renamed the teams to their original names in the USA Today payroll Excel data).
Step 3: Calculate relative payroll for the first season by taking each team's total payroll and dividing by that seasons average payroll. So that I do not have to keep changing the denominator for a single season, I use the following formula in Excel for the 1988 MLB team payroll where the total payroll data is in column B. =B2/AVERAGE(B$2:B$27) Now you can copy this formula down the column for that single season and you will be calculating relative payroll for each team correctly for that single season. Then repeat this step for each of the other seasons you are analyzing. Remember to change the range in the denominator each time you change to a new season.
Step 4: Copy the team final regular season standings data for each season of team payroll data into a spreadsheet (I did this in another worksheet in case I made a mistake I will not have to re-copy the payroll data). For the team regular season standings data I used MLB.com. If you choose to use another source for the team regular season data and that data does not have regular season winning percent (they do at mlb.com), make sure you calculate it for each team in the entire time period. Remember winning percent is wins/games played. Again, I also include the season year for each season.
Step 5: For each season of team regular season standings data, sort the data by team name. Or if you have included the season year, then just sort by season (smallest to largest) and add a level to sort by team name (A to Z), and after a little clean-up, you have the each season just as you did with the payroll data.
Step 6: Combine both data sets together (I did this in another worksheet - which I would recommend in case there are any errors you do not have to copy the data again). Make sure that each team for a given season of relative payroll data matches each team for a given season with the winning percent data for each season.
Step 7: Run a linear regression in Excel or (one of many YouTube how-to videos) with regular season winning percent as the dependent variable and relative payroll as the independent variable. I also choose to include a constant term to allow for the possibility of the intercept in the regression to not be forced to equal zero.
Step 8: Interpret the regression results that are displayed in the new worksheet.
Step 8a: Note that the coefficient on relative payroll is positive and statistically significant. Which I interpret as a one unit increase in relative payroll will increase winning percentage by that amount (positive part); and since the t-stat is statistically significant, I interpret this as being different from zero (no effect). I do not disagree that there is a statistical relationship between relative payroll and team winning percentage, but how much of a relationship is what is the issue.
Step 8b: For this, I turn to R-squared (or adjusted R-squared) which I interpret as the amount that the variation in relative payroll "explains" the variation in winning percent, and this is less than half. Thus the variation in relative payroll "explains" less than 50% of the variation in winning percentage, and if we are going to claim that payroll determines winning percentage, this would be much higher.

Some may say that statistically there is no gauge for a lot or little when looking at R-squared, and that is correct. But, if the R-squared is 0.18, that means that 18% of the variation in winning percent is statistically related to the variation in relative payroll then I have a hard time with hanging my hat - so to speak - on 18%, when other variables do much better in explaining the variation in winning percent.