MLB Batting Statistical Analysis

5 min readApr 30, 2020

For this project, I used a 2008 MLB position players batting statistics dataset . I attempted to predict salary with the lowest mean absolute error. The average salary was 4.35 million dollars, and using this figure, the baseline mean absolute error was 3.8 million dollars.

I used feature engineering to create new features . One of these features I called the “A-Rod Factor”. The idea came about when I viewed the top paid players, Alex Rodriguez being the highest , and found similarities in various statistical categories. The “A-Rod-Factor” is comprised of plate appearances, RBIs, home-runs, total bases, and walks. These factors are weighted. It turns out that just having a high count of these statistics, regardless of your other batting statistics or your batting efficiency, has a large impact on your salary.

I engineered another feature called “Team Contribution”. This is comprised of RBIs per plate appearance, runs per plate appearance, RBIS, runs, On-Base-Percentage, and defensive value. These factors are also weighted. I believe “Team Contribution” is a relatively precise measure of a player’s impact on their team in terms of batting efficiency, on base presence, and defensive value.

I then used linear regression and random forest regression to find out which features would correspond to the lowest mean absolute error in salary. The graph above shows two linear regression and two random forest regression models. One of each of them uses standard batting metrics, HR, RBI, PA, AVG, and R, and the other one of each uses my engineered features. As you can see, the models with standard batting metrics did not even beat the baseline MAE of 3.8 million dollars. The linear regression MAE was about 4 million dollars, and the random forest regression MAE was about 3.9 million dollars.

The linear regression model that had the highest success, used the “A-Rod Factor” to predict salary, and had a MAE of under 3.6 million dollars. The random forest model with the highest success in predicting salary used “Team Contribution”, and had an MAE of about 3.3 million dollars. Both of these models beat the baseline model.

The features which I engineered were more accurate predictors of salary than the standard MLB metrics, but it still seemed as though there was something odd happening within this dataset, and I wanted to learn more about why these errors were still relatively high.

The next step I took was to divide the dataset into 9 different subsets, based on salary, and test each one individually by comparing its various features to the statistical mean of these features in the overall dataset. And what I found was a bit staggering. The lowest paid players, with salaries in the 390k-500k dollar range, were outperforming the players being paid 3–6 million dollars, in just about every single useful statistical category.

As shown in these two graphs, the performance of the 390k-500k salary players were matching and outperforming the 3–6 million dollar players in RBI, HR, “Team Contribution”, and “A-Rod-Factor”. I realized that at this point, no matter what features I engineered, the actual performance anomaly within this low salary range would make it nearly impossible to accurately predict salary throughout the overall dataset.

So I pivoted the objective of the project. I standardized HRs, RBIs, “Team Contribution”, and “A-Rod-Factor”, and calculated what each dollar of salary should produce in these categories. I then found the actual performance for each salary range versus its expected performance.

This chart is perhaps the most staggering of all. The lowest paid players are producing 10+ times their worth in all of these categories. The 2nd and 3rd lowest ranges are also overproducing, and then there is a steady and gradual decline in expected performance as the salary ranges increase.

I was curious to understand why the lowest range was overperforming by so much, so I dug deeper and found the year and contract of every high performing player in the 390k-500k range. These results were also shockingly consistent. The majority of these high performing and low paid athletes were in their first contract, and in their 3rd or 4th year in the league. There were also a few outliers who had been in the league for 5–6 years with no success, and then peaked on this particular year. Some of the notable 3rd and 4th year players in this range were Shane Victorino, Dustin Pedroia, Andre Ethier, and Jacoby Ellsbury.

What started as an attempt to predict salary, blossomed into a quest to understand value within the MLB system. I learned two major concepts on this journey. Firstly, no matter how well you design features, if an entire subset of your data is acting as a collective anomaly, your ability to predict will be compromised. Secondly, the key to winning a championship within professional baseball and potentially any professional team sport, is highly correlated to your drafting process. By properly assessing talent within rookies and young players, you can gain value without sacrificing your payroll to get it. The money saved can then be used to purchase high caliber existing talent, or specific position players needed to fill the areas where your roster is weak, giving yourself the highest possible chance to become a sustainable winning franchise. I thought I understood this concept before, but seeing the actual data makes it a lot more vivid. I suspect that a lot of the teams that draft well have a deep understanding of data science, and can use this information to maximize their ability to win.

In MLB, unlike perhaps the NBA, a player on his first contract, brought up through the minor league system, will typically be underpaid. When this player reaches his second contract or becomes a free agent, he will then be far more fairly compensated for his production. And so it is in this first contract that team value can be created. By supplementing all star caliber players with high caliber first contract players, a team can create incredible value and position themselves in a great spot to win championships.

MLB Batting Statistical Analysis

Written by Jesse Berkowitz