Friday, November 5, 2010

Explaining the College Football Model - Simple Model

OK, I have posted these top 25 ranks for NCAA FBS teams, and the question is: how do I determine who should be in the top 25? Or better yet, we have the AP top 25, USA Today top 25, Harris voting poll and the BCS rankings, so why do we need another NCAA football top 25? Fair enough.

The top 25 that I am posting is not my personal preferences or who I think should be in the top 25, but is rather the results of a statistical analysis of NCAA football offense and defense. So to start this off, I first will look at the broad picture and start with a simple model of NCAA football.

Specifically, the simple model of NCAA football is a production function. By production function what I mean is that it relates output - which for the simple model is the individual teams winning percent - to inputs - which for the simple model are the individual teams points scored and points surrendered. This should be rather familiar, since the definition of a win is having greater points scored than surrendered.

With this definition in mind, I then statistically test this simple model of NCAA football production function using each NCAA FBS team's winning percentage, points scored and points surrendered for the 2008 and 2009 NCAA FBS seasons. Since during the 2008 and 2009 season there are 120 NCAA FBS schools, this results in 240 rows of data, 120 for the 2008 season and 120 for the 2009 season. After "running a linear regression" on this data, where the dependent variable is winning percentage and the independent variables are points scored and points surrendered, I highlight the following three results of the linear regression.

1. The coefficient (or weight) on points scored = 0.001 and is statistically significant above the 99% level of confidence.
2. The coefficient (or weight) on points surrendered = -0.001 and is statistically significant above the 99% confidence level.
3. The R-square and adjusted r-square (for those interested) is 0.83, which can be interpreted as the variation in points scored and points surrendered "explain" 83% of the variation in the teams winning percent over those two years.

Given that the model is rather simplistic, why would I be interested in this type of analysis? There are two fundamental reasons. The first is to determine if there is a statistical difference in terms of winning percentage between offense and defense. As we can see from the estimated coefficients, they are equal in terms of determining winning percentage, which makes the analysis easier under the complex model. The second is to show that just using point spread (the difference between points scored and points surrendered) is inferior to the complex model.

While the model does rather well at explaining why teams win, the big problem with this simple model is that it does not allow me to investigate what actually happens on the field to how productive the team actually is. In other words, the model just says that if a teams scores 3 additional points then that will result on average a 0.003 increase in winning percentage or if the team allows their opponent to score 3 more points then that will result on average a 0.003 decrease in winning percentage.

What is missing is what happens if the team throws an interception, or allows a sack or recovers a fumble? How do on field actions impact the production/efficiency of the team's offense or defense?

Periodically, over the rest of this month, I hope to address the "complex invasion sport production function" model - i.e. the model that allows me to rank NCAA FBS teams each week.