Wednesday, April 4, 2012

Estimating MLB Player Run Productivity

Yesterday, I went though a step-by-step guide as to how to determine the marginal product of various MLB team statistics with respect to run productivity at the team level. Yesterday, I treated each team as equal - which may not exactly be true since teams play in different ballparks and this may affect run production. I will come back to this later this month and re-do the analysis by breaking out each team in the overall analysis. I might also just look at different leagues as the AL might be more run producing since pitchers do not bat. It should not take long, so hopefully I will have this done soon.

Today, I want to use the estimates at the team level (see the end of the blog for the actual numbers), and use them to estimate the runs scored productivity at the batter level for last (2011) season. This is not all that difficult, so here are the steps.

Step 1: Open the Batting.csv file and copy the batting data to another file. (You do not have to do this, but I do, so I am calling this step 1).

Step 2: Copy the regression results (estimated coefficients) from the MLB team regression to the right of the player data.

Step 3: Create a formula for the first players run productivity. This is done my taking the coefficient of a single times the number of singles for the first player, plus the coefficient of a double times the number of doubles for the first player, plus the coefficient of a triple times the number of triples for the first player. Keep doing this for all ten variables. Once you have this formula, then copy the formula for the first player and paste it for all the players that year. Note: you might want to put a $ sign in front of the row number of the coefficient in the formula so that Excel keeps referring back to that cell. For example if the coefficient of a single is in cell AE3, then type in $AE$3 instead. This will force Excel to always use that cell when calculating the coefficient of a single. Do this for each of the ten variables.

Step 4: Sort the data by whatever you label the cell for the formula above. I called it Runs Scored, and sort by largest to smallest. Doing so yields the following. Note: that the batters name is using Mr. Lahmen's labeling, but you should be able to figure out who is who rather easily. Here are the top 50, with Miguel Cabrera the most productive batter last year, just above Jose Bautista.

Rank playerID teamID lgID Runs Scored
1 cabremi01 DET AL 114.74
2 bautijo02 TOR AL 113.97
3 ellsbja01 BOS AL 111.70
4 kempma01 LAN NL 110.94
5 braunry02 MIL NL 108.43
6 vottojo01 CIN NL 106.54
7 gonzaad01 BOS AL 105.99
8 fieldpr01 MIL NL 103.42
9 grandcu01 NYA AL 102.57
10 pedrodu01 BOS AL 97.77
11 uptonju01 ARI NL 93.97
12 gordoal01 KCA AL 92.97
13 konerpa01 CHA AL 91.48
14 youngmi02 TEX AL 90.95
15 canoro01 NYA AL 90.92
16 pujolal01 SLN NL 88.55
17 berkmla01 SLN NL 88.48
18 kinslia01 TEX AL 86.93
19 ortizda01 BOS AL 86.04
20 teixema01 NYA AL 85.38
21 ramirar01 CHN NL 85.12
22 tulowtr01 COL NL 84.38
23 reyesjo01 NYN NL 82.24
24 mccutan01 PIT NL 82.19
25 morsemi01 WAS NL 81.96
26 stantmi03 FLO NL 81.73
27 cabreme01 KCA AL 80.32
28 zobribe01 TBA AL 79.82
29 phillbr01 CIN NL 78.12
30 napolmi01 TEX AL 77.13
31 avilaal01 DET AL 76.96
32 butlebi03 KCA AL 75.95
33 francje02 KCA AL 75.69
34 gonzaca01 COL NL 75.55
35 swishni01 NYA AL 75.55
36 martivi01 DET AL 75.44
37 castrst01 CHN NL 75.15
38 beltrad01 TEX AL 74.48
39 howarry01 PHI NL 74.10
40 victosh01 PHI NL 73.99
41 hamiljo03 TEX AL 73.48
42 cabreas01 CLE AL 73.33
43 markani01 BAL AL 73.21
44 santaca01 CLE AL 73.07
45 bruceja01 CIN NL 71.89
46 jonesad01 BAL AL 71.66
47 peraljh01 DET AL 71.14
48 penaca01 CHN NL 71.04
49 longoev01 TBA AL 70.86
50 hollima01 SLN NL 70.37