Yesterday, I wrote about the relationship between MLB regular season team payroll and regular season performance. Today, I am going to go through how I calculated this, step-by-step, for those that are interested in how I came up with the numbers yesterday. As I have been looking at MLB recently, I will be using MLB as the example.
Step 1: Copy the team payroll data for the time period you are interested in analyzing and paste into a spreadsheet - I use Excel. For the team payroll data I used USA Today's salary database covering the 1988 to 2011 seasons. I also add a column named season and typed in the season year so I can easily keep track of each season.
Step 2: For each season of team payroll data, sort the data by team name, or if you include the season, then just sort by season (smallest to largest) and by team (A to Z). (Please note that USA Today has changed the names of previous named teams to their current name - thus the Montreal Expos (1988 - 2004) are named for the entire time period the Washington Nationals; the California Angels (1988 - 1997) and then Anaheim Angels (1998 - 2004) are the Los Angeles Angels for the entire time period; the Florida Marlins (1993 - 2011) are the Miami Marlins. In order for the teams to line up together with the team performance data in step 6, I renamed the teams to their original names in the USA Today payroll Excel data).
Step 3: Calculate relative payroll for the first season by taking each team's total payroll and dividing by that seasons average payroll. So that I do not have to keep changing the denominator for a single season, I use the following formula in Excel for the 1988 MLB team payroll where the total payroll data is in column B. =B2/AVERAGE(B$2:B$27) Now you can copy this formula down the column for that single season and you will be calculating relative payroll for each team correctly for that single season. Then repeat this step for each of the other seasons you are analyzing. Remember to change the range in the denominator each time you change to a new season.
Step 4: Copy the team final regular season standings data for each season of team payroll data into a spreadsheet (I did this in another worksheet in case I made a mistake I will not have to re-copy the payroll data). For the team regular season standings data I used MLB.com. If you choose to use another source for the team regular season data and that data does not have regular season winning percent (they do at mlb.com), make sure you calculate it for each team in the entire time period. Remember winning percent is wins/games played. Again, I also include the season year for each season.
Step 5: For each season of team regular season standings data, sort the data by team name. Or if you have included the season year, then just sort by season (smallest to largest) and add a level to sort by team name (A to Z), and after a little clean-up, you have the each season just as you did with the payroll data.
Step 6: Combine both data sets together (I did this in another worksheet - which I would recommend in case there are any errors you do not have to copy the data again). Make sure that each team for a given season of relative payroll data matches each team for a given season with the winning percent data for each season.
Step 7: Run a linear regression in Excel or (one of many YouTube how-to videos) with regular season winning percent as the dependent variable and relative payroll as the independent variable. I also choose to include a constant term to allow for the possibility of the intercept in the regression to not be forced to equal zero.
Step 8: Interpret the regression results that are displayed in the new worksheet.
Step 8a: Note that the coefficient on relative payroll is positive and statistically significant. Which I interpret as a one unit increase in relative payroll will increase winning percentage by that amount (positive part); and since the t-stat is statistically significant, I interpret this as being different from zero (no effect). I do not disagree that there is a statistical relationship between relative payroll and team winning percentage, but how much of a relationship is what is the issue.
Step 8b: For this, I turn to R-squared (or adjusted R-squared) which I interpret as the amount that the variation in relative payroll "explains" the variation in winning percent, and this is less than half. Thus the variation in relative payroll "explains" less than 50% of the variation in winning percentage, and if we are going to claim that payroll determines winning percentage, this would be much higher.
Some may say that statistically there is no gauge for a lot or little when looking at R-squared, and that is correct. But, if the R-squared is 0.18, that means that 18% of the variation in winning percent is statistically related to the variation in relative payroll then I have a hard time with hanging my hat - so to speak - on 18%, when other variables do much better in explaining the variation in winning percent.
Friday, February 17, 2012
Thursday, February 16, 2012
MLB Payroll and Performance
Last season after the World Series I wrote about the payroll and performance hypothesis in MLB and questioned how strong this relationship was during the 2011 MLB season. At that time I admitted that only looking at one season was not a long enough time period to make a definitive conclusion about payroll and performance. So I thought that I would revisit this over a longer time period in MLB.
So I am going to look at the entire time period that the USA Today has MLB team payroll data, which is from 1988 to 2011, and then using mlb.mlb.com for the team standings data, I am going to compare the impact that relative payroll has on team regular season performance.
Why do I use relative payroll? It is a statistical reason, but let me see if I can explain. Of the two variables, winning percent is stationary - in other words average winning percentage for each season is 0.500, but total payroll is non-stationary - in other words the average of total payroll is increasing. Running a regression using a non-stationary independent variable (total payroll) will result in a poorer performing result if the dependent variable (winning percentage) is stationary. Thus to get both variables stationary, I convert the non-stationary total payroll variable to a stationary variable (relative payroll). Relative payroll is calculated as team i's total payroll divided by season j's average payroll. For example, for the Philadelphia Phillies in 2011, I took Philadelphia's total payroll of $172,976,379 and divided by the average of total payroll for the 2011 MLB season, which is equal to $92,872,043 which gives me a relative payroll for Philadelphia equal to 1.8625.
So after running the numbers, I find that relative payroll in Major League Baseball "explains" in a statistical sense only 17.6% of team winning percentage from 1988 to 2011. Another way of looking at this is that relative payroll explain less than 80% of why MLB teams win at different rates. If I fail to take into account this non-stationary issue and just use total payroll instead of relative payroll, then the explanatory power drops to 6.8%.
Hence for this reason, I seriously doubt that payroll is a good indicator of regular season team performance in Major League Baseball.
Still not convinced? Fine - tomorrow I will post a step-by-step procedure as how I calculated this result.
So I am going to look at the entire time period that the USA Today has MLB team payroll data, which is from 1988 to 2011, and then using mlb.mlb.com for the team standings data, I am going to compare the impact that relative payroll has on team regular season performance.
Why do I use relative payroll? It is a statistical reason, but let me see if I can explain. Of the two variables, winning percent is stationary - in other words average winning percentage for each season is 0.500, but total payroll is non-stationary - in other words the average of total payroll is increasing. Running a regression using a non-stationary independent variable (total payroll) will result in a poorer performing result if the dependent variable (winning percentage) is stationary. Thus to get both variables stationary, I convert the non-stationary total payroll variable to a stationary variable (relative payroll). Relative payroll is calculated as team i's total payroll divided by season j's average payroll. For example, for the Philadelphia Phillies in 2011, I took Philadelphia's total payroll of $172,976,379 and divided by the average of total payroll for the 2011 MLB season, which is equal to $92,872,043 which gives me a relative payroll for Philadelphia equal to 1.8625.
So after running the numbers, I find that relative payroll in Major League Baseball "explains" in a statistical sense only 17.6% of team winning percentage from 1988 to 2011. Another way of looking at this is that relative payroll explain less than 80% of why MLB teams win at different rates. If I fail to take into account this non-stationary issue and just use total payroll instead of relative payroll, then the explanatory power drops to 6.8%.
Hence for this reason, I seriously doubt that payroll is a good indicator of regular season team performance in Major League Baseball.
Still not convinced? Fine - tomorrow I will post a step-by-step procedure as how I calculated this result.
Sunday, February 12, 2012
Adieu Whitney Houston
OK not the normal blog, but in honor of Whitney Houston on her recent passing here is one of my favorite performances - her spectacular performance of the national anthem at Super Bowl XXV in 1991. adieu, Whitney.
Thursday, February 9, 2012
2011 NCAA Football Bowl Subdivision and Strength of Schedule
Today I want to focus on the impact of a NCAA FBS team's strength of schedule on their winning percentage. To do this, I will look at a simple model of NCAA FBS production where team performance (as measured by winning percentage) is a function of team points scored and opponents points scored. Using a simple linear regression reveals three helpful results. One is that the estimated marginal impact of a point scored is positive and statistically significant and a point surrendered is negative and statistically significant; second is that the estimated marginal impact of a point scored and a point surrendered is equal when rounded to three decimal places in absolute value; and third that this very simple model "explains" about 83% of the variation in winning percent for the 2011 NCAA FBS season.
Frankly, the first better be true since wins are defined to occur if points scored are greater than points surrendered. The second is more interesting in that I conclude that scoring more points is equally as important as preventing the opponent from scoring more points in terms of the impact on winning. The third is helpful to show that at least this model is useful in evaluating team performance.
Many have argued that the strength of a teams schedule should also be included in modeling team performance and in evaluating better quality teams. I have previously disagreed on this point, but am willing to re-evaluate the impact of strength of schedule on team winning percentage. To evaluate the impact of strength of schedule, I will need a measure of strength of schedule for each NCAA FBS team and then use this measure to determine the statistical significant of strength of schedule on team winning percent. Running the numbers using the simple model above I find that strength of schedule is statistically insignificant (i.e. statistically strength of schedule has zero impact on NCAA FBS winning percent during the 2011-2012 football season). You may be thinking that my measure of strength of schedule is incorrect, so I also ran the numbers using Jeff Sagarin's strength of schedule measure for only the 120 NCAA FBS teams and found the exact same result - strength of schedule is statistically insignificant.
But what about teams in different FBS conferences? If we take into account teams in different football conferences (and group the four independent schools as one conference) does that make a difference? I like the way you think - so I ran the numbers when adjusting for teams in the different NCAA FBS conferences and points scored and points surrendered only and got very similar results as in the simple production model above. Statistically the estimated coefficients (marginal impact) are all statistically significant at the 99% level of confidence, so I am at least 99% confident that the variation in points scored, points surrendered and the adjustment for each of the ten NCAA FBS conferences plus the independent teams are different from zero. Now I did this as a check on the effect that conferences have on winning percentage, and I find that different conferences have different effects on winning percentage (though small) and that they are statistically significant.
So now let's add a measure of a teams strength of schedule. Whether I use my own or the Sagarin measure I still find that strength of schedule is statistically insignificant with respect to team winning percent.
Thus, I conclude that strength of schedule "does not matter" (since strength of schedule is statistically insignificant) in terms of how well NCAA FBS teams performed in 2011-2012.
Frankly, the first better be true since wins are defined to occur if points scored are greater than points surrendered. The second is more interesting in that I conclude that scoring more points is equally as important as preventing the opponent from scoring more points in terms of the impact on winning. The third is helpful to show that at least this model is useful in evaluating team performance.
Many have argued that the strength of a teams schedule should also be included in modeling team performance and in evaluating better quality teams. I have previously disagreed on this point, but am willing to re-evaluate the impact of strength of schedule on team winning percentage. To evaluate the impact of strength of schedule, I will need a measure of strength of schedule for each NCAA FBS team and then use this measure to determine the statistical significant of strength of schedule on team winning percent. Running the numbers using the simple model above I find that strength of schedule is statistically insignificant (i.e. statistically strength of schedule has zero impact on NCAA FBS winning percent during the 2011-2012 football season). You may be thinking that my measure of strength of schedule is incorrect, so I also ran the numbers using Jeff Sagarin's strength of schedule measure for only the 120 NCAA FBS teams and found the exact same result - strength of schedule is statistically insignificant.
But what about teams in different FBS conferences? If we take into account teams in different football conferences (and group the four independent schools as one conference) does that make a difference? I like the way you think - so I ran the numbers when adjusting for teams in the different NCAA FBS conferences and points scored and points surrendered only and got very similar results as in the simple production model above. Statistically the estimated coefficients (marginal impact) are all statistically significant at the 99% level of confidence, so I am at least 99% confident that the variation in points scored, points surrendered and the adjustment for each of the ten NCAA FBS conferences plus the independent teams are different from zero. Now I did this as a check on the effect that conferences have on winning percentage, and I find that different conferences have different effects on winning percentage (though small) and that they are statistically significant.
So now let's add a measure of a teams strength of schedule. Whether I use my own or the Sagarin measure I still find that strength of schedule is statistically insignificant with respect to team winning percent.
Thus, I conclude that strength of schedule "does not matter" (since strength of schedule is statistically insignificant) in terms of how well NCAA FBS teams performed in 2011-2012.
Saturday, February 4, 2012
2011 Regular Season NFL Competitive Balance
This weekend is the "big game" (can't use the phrase S***r B**l as the NFL has tied it up) between the New England Patriots and the New York Giants. So that got me to thinking about how competitive the NFL has been over the last year. I downloaded the 2011 regular season standing results and calculated the Noll-Scully measure of competitive balance for the AFC, the NFC and the NFL.
Here are the results. The AFC has a Noll-Scully of 1.500, while the NFC had a Noll-Scully of 1.806, meaning that the AFC was slightly more competitive than the NFC in terms of winning percent. The NFL overall had a regular season Noll-Scully of 1.636 - which is slightly higher than the NFL's historical average, but not by much.
Compared to the other three "major" US sports leagues, the NFL is the most competitive and last year was no different.
Here are the results. The AFC has a Noll-Scully of 1.500, while the NFC had a Noll-Scully of 1.806, meaning that the AFC was slightly more competitive than the NFC in terms of winning percent. The NFL overall had a regular season Noll-Scully of 1.636 - which is slightly higher than the NFL's historical average, but not by much.
Compared to the other three "major" US sports leagues, the NFL is the most competitive and last year was no different.
Friday, February 3, 2012
2011 Conference Defense Ranking
Two days ago I posted on how each NCAA FBS conference ranked over the 2011-2012 NCAA FBS season and yesterday I posted on how each of conference ranked in terms of the offensive (or scoring) side of the ball. Today, I want to break it down in terms of the defensive (or stopping the opponent from scoring) side. So listed below is each of the NCAA FBS conferences (and the four independents listed as one "conference"). I have measured defensive conference rank just like I measured total rank and offensive rank but of course now I am taking the average of each conference team's defensive rank. Here they are (an to me a little surprising).
Rank | Conference | Defense Rank |
1 | Big East | 26.875 |
2 | SEC | 31.417 |
3 | Big 12 | 46.600 |
4 | Big 10 | 48.167 |
5 | ACC | 49.250 |
6 | Ind | 56.750 |
7 | Sun Belt | 61.750 |
8 | CUSA | 62.636 |
9 | Mid American | 74.533 |
10 | WAC | 78.375 |
11 | Mountain West | 82.500 |
12 | Pac 12 | 82.833 |
Thursday, February 2, 2012
2011 Conference Offense Rankings
Yesterday I posted on how each NCAA FBS conference ranked over the 2011-2012 NCAA FBS season. Today, I want to break it down in terms of the offensive (scoring) side. So listed below is each of the NCAA FBS conferences (and the four independents listed as one "conference"). I have measured offensive conference rank just like I measured total rank but of course now I am taking the average of each conference team's offensive rank as opposed to total rank. Here they are:
Rank | Conference | Offense Rank |
1 | Pac 12 | 46.750 |
2 | Big 12 | 49.500 |
3 | Ind | 55.250 |
4 | Big 10 | 55.667 |
5 | WAC | 56.875 |
6 | Mountain West | 57.875 |
7 | Big East | 60.625 |
8 | ACC | 61.167 |
9 | SEC | 61.417 |
10 | CUSA | 67.273 |
11 | Mid American | 72.600 |
12 | Sun Belt | 80.125 |
Wednesday, February 1, 2012
2011 Conference Rankings
It is national signing day, so most of the NCAA football world is tuned into who is signing where. I on the other hand am looking at the last NCAA FBS season and using my NCAA FBS production model to determine which NCAA FBS conference was the best over all of last season.
To measure how well each conference performed, I am taking the average rank of all the teams in each conference (including the four independents as a conference) and then comparing the numbers between each conference. As a reminder, the lower the number the higher the overall rank of the conference, since I am averaging each team's conference in terms of where they finished in overall rank and the highest ranked team is #1 and the lowest ranked team is #120; so lower numbers mean better productive conferences.
After running the numbers for the regular and post- season NCAA FBS games for 2011 here are the results, and if you are like me, this is a little surprising.
To measure how well each conference performed, I am taking the average rank of all the teams in each conference (including the four independents as a conference) and then comparing the numbers between each conference. As a reminder, the lower the number the higher the overall rank of the conference, since I am averaging each team's conference in terms of where they finished in overall rank and the highest ranked team is #1 and the lowest ranked team is #120; so lower numbers mean better productive conferences.
After running the numbers for the regular and post- season NCAA FBS games for 2011 here are the results, and if you are like me, this is a little surprising.
Rank | Conference | Total Rank |
1 | Big East | 45.125 |
2 | SEC | 47.667 |
3 | Big 12 | 49.500 |
4 | Big 10 | 51.583 |
5 | ACC | 57.417 |
6 | Ind | 57.500 |
7 | Pac 12 | 62.833 |
8 | CUSA | 66.000 |
9 | Mountain West | 66.500 |
10 | WAC | 70.500 |
11 | Mid American | 73.867 |
12 | Sun Belt | 76.250 |
Subscribe to:
Posts (Atom)