Baseball has always been a passion of mine since I was young. It is an analytical game, where numbers and matchups often rule the decisions on and off the baseball diamond. Professional baseball teams track and store enormous amounts of data. Each team having statisticians combing through mountains of data, in search of a needle to gain an advantage.
I thought it would be fun to take some of this data and use it predict the dollar value of Major League Baseball (MLB) for free agents for the year 2017. Assessing value for a player is an important part of the free agency process. Determining a player’s value through predictive analytics, gives management a baseline in which approach free agent negations. For example, if Team A determines a player’s value is 10 million dollars but the market is saying that the player is worth 15 million, perhaps Team A does not want to pay a 5-million-dollar premium on this player. This type of information is crucial for the mid and lower market baseball teams who do not have the payroll dollars of teams like New York Yankees. The smaller teams cannot afford to make bad contract mistakes.
Before we begin here a few disclaimers on the model:
- I am only predicting hitters value. No pitchers are included in this model.
- I did not build in “Time value of money” for this model.
- There a few baseball contract elements that I did not include that could affect player value.
- No trade clause
- Loss of draft pick
- Opt out clause
Example of Bad Free Agent Contracts:
Albert Pujols, 1B, Los Angeles Angels
- 10 years, $240 million (2012 – 2021)
- Will be making $30 million at the age of 41
Joe Mauer, C, Minnesota Twins
- 8 years, $184 million (2011 -2018)
- Injury risk – bilateral leg weakness
- Unable to play catcher, now playing 1B
Alex Rodriguez, 3B, New York Yankees
- 10 years, 275 million (2008 – 2018)
- Yankees paid almost $45 million dollars for seasons he never played in
I utilized three different tools in building my model. I could have completed this with less but I want to use multiple tools for the example.
Oracle Database 12c – This is where I dumped all my statistical baseball data and transformed into useable tables.
SPSS Statistic – This calculated the projected value of MLB free agents in 2017.
Qlikview – I used this tool to visually represent the data.
Designing the Model
- Find statistics that correlate – I had to determine what baseball statistics were most correlated with free agent pay. After several iterations of models, I generated a model with an R squared around .85, meaning that my model was explaining 85% of the variation in the data. An R squared number of .85 is excellent in this scenario, so determined that I had found several stats that predict player pay.
- Create weighted metric – For the type of regression model that I decided to use, I needed to condense these metrics into one metric. I call this metric Weight Wins Above Replacement (WWAR). WWAR is a combination of all the stats from step one, weighted by players performance of the past three seasons. The most recent season being weighted the heaviest.
- Use WWAR to predict – I utilized the WWAR stat inside SPSS to predict the free agent pay for 2017
I took the output from SSPS, created my projections, and uploaded the data to qlikview. Here are few things I found:
Let’s start with what the model says are the best contracts for major league team. At the very top of the list is Justin Turner. My model projects that Mr. Turners value during free agency was almost 23 million dollars but the Dodgers signed him for an annual salary of 16 million, a discount of about 7 million dollars. This discount for the dodgers has only gotten better, Justin Turner is having an excellent 2017 season.
At the top of the bad contracts, sits Yoenis Cespedes. He was signed during the offseason by the New York Mets for 27.5 million dollars annually. My model has Yoenis valued at almost 19 million dollars, an 8.5-million-dollar overpayment. From what Yoenis has done thus far in the season, he is almost perfectly aligned to finish the season with a value of 19 million.
Just below Yoenis Cepedes, is Carlos Beltran. Now, it is no secret to anyone that as athletes get closer to 40 years in age, their athleticism declines. My model also knows that age matters and projected Beltran’s salary to be just shy of 8 million. The Yankees paid 16 million dollars, double my model’s projected dollar amount. Beltran, during the 2017 season, has not played well and has generated little value for his team. It is mathematically impossible for him to generate 16 million dollars of value with the amount of games remaining in the season.
The Model has been extordinally accurate for 2017. In upcoming posts, I will be sharing additional results and detailing how the model was built in SPSS.
If you have any questions about SPSS or how this model was built, please feel free to contact me at: