Specifically, I am going to look at the marginal value (economists call this marginal product) of a MLB player. To accomplish this involves two tasks; one is the estimate the marginal value of various MLB team statistics with respect to runs scored and the second is to take these values and use them to estimate the marginal productivity of MLB players. Today I am going to go through the first task - estimating the MLB team runs scored model and tomorrow I plan on going over the second task and report the top 50 MLB players for last season using this model. So, let's turn to the first task.
Today, we are estimating MLB team runs scored production. Given that baseball is a much different sport than either football or hockey, we can analyze the model by just using the team's on-field production numbers that significantly affect runs scored. This method is similar to Ferdinard Lane, Pete Palmer's Linear Weights and Asher Blass' Runs Created methodologies. So here is a step-by-step guide as to how to estimate the marginal value of various MLB team statistics.
Step 1: Download (for free) from Sean Lahman's website the latest baseball database, which as of April 2012 is version 5.9.1 and I choose the comma-delimited version, which opens up nicely as Microsoft Excel files.
Step 2: Unzip file file and open up the Teams.csv file. Since the file starts with 1871 season, I just copy the team data that I am going to use (2000-2011) to another Excel file.
Step 3a: I organize the team data. Since I will need the following variables: R, AB, H, 2B, 3B, HR, SB, SF, CS and BB from this file, I put them all next to each other in the new team file.
Step 3b: I will need to calculate the following variables: 1B = H - 2B - 3B - HR; NBB = BB - IBB; GIDPCS = GIDP + CS; and Outs = AB - H. Now I have good news and bad news. The good news is that Mr. Lahman's database has all but two of the variables already entered at the team level. The bad news is that Mr. Lahman's database has all but two (IBB and GIDP) of the variables to make those calculations above already entered at the team level. Thus I will need to get this data at the team level to finish out the calculations.
Step 4: To get the team level IBB (intentional walks) and GIDP (grounded into a double play), I have to open the Batting.csv file that I unzipped in step 2. In the Batting.csv file, column S has the IBB data by player and column W has the GIDP data by player. So to go from the player level to the team level, I will need to aggregate up to the team level. Since the data is sorted alphabetically by player name, this data will need to be sorted for each season and for each team. Thankfully, Excel has an excellent sort function that will do that rather quickly.
Step 5: Sort the player data in the Batting.csv file. I highlight the entire data (click on the upper left of the grid in Excel) and under Data click on Sort, which opens up a box. In that box, I choose at the top right the box next to "My Data has headers", and then in the dropdown menu to the right of "Sort by" I choose YearID. Next I click Add Level - which is at the very top left of this box - and in the new dropdown menu choose lgID; and finally click Add Level again and in the newest dropdown menu choose TeamID. Now click OK near the bottom right of the box.
Step 6: Once step 5 is completed, on row 79251 begins the 2000 season, and the 2011 season ends at row 95196. I just copied this data to another Excel file to work on. At this point I insert a blank row at the end of each team for each season for a place to make the IBB and GIDP calculations, which might take a few minutes. Once that is done, just sum up the columns (by team) for IBB and GIDP starting with the 2000 season and finishing with the 2011 season. In order to save a little time in the next step, I also include each time I calculate a team's IBB and GIDP the season, teamID and League, so that when I sort this data in the next step, it will copy and paste nicely into the Team.csv file in the right order.
Step 7: Sort the IBB and GIDP data by season, league and team (similar to step 5). This should allow you to copy and paste the IBB and GIDP data back into the Team.csv file where each datapoint correctly corresponds to the season, league and team in the Team.csv file. Check on this, or step 6 will be done in vain, and your results will be wrong.
Step 8: Run the following linear regression: R = f(1B, 2B, 3B, HR, NBB, HBP, SB, GIDPCS, SF, Outs). I actually choose to do this in another program, which allowed me to control for heteroskedasticity (unequal scatter in the data). Here are the estimated results for these 12 seasons of data.
Coefficient | Std. Error | t-stat | prob. | |
C | 151.167 | 121.681 | 1.242 | 0.215 |
SINGLE | 0.539 | 0.021 | 25.618 | 0.000 |
DOUBLE | 0.710 | 0.045 | 15.925 | 0.000 |
TRIPLE | 1.147 | 0.134 | 8.571 | 0.000 |
HR | 1.430 | 0.041 | 34.607 | 0.000 |
NBB | 0.329 | 0.021 | 15.793 | 0.000 |
HBP | 0.420 | 0.089 | 4.742 | 0.000 |
SB | 0.100 | 0.041 | 2.401 | 0.017 |
GIDPCS | -0.176 | 0.069 | -2.563 | 0.011 |
SF | 0.604 | 0.176 | 3.424 | 0.001 |
OUTS | -0.146 | 0.028 | -5.128 | 0.000 |
A few observations: first, that other than the constant term, each of the variables is statistically significant at the 99% confidence level and of the correct sign. Only GIDPCS and OUTS are negative, which both are using up the finite and scarce resource in baseball: outs. Second, also that HR's have a higher coefficient than Singles (1B), which means that a HR will on average generate more runs than a single. All the other coefficients seem to make sense as well. Finally, note that the coefficient on GIDPCS is greater in absolute value than SB, which means that if a player made two stolen base attempts, and was successful on one attempt but not on the other, that player on average would have cost his team more than he benefited his team.
Tomorrow I plan on reporting the top 50 MLB batters for the 2011 season.