My first project from my summer with the Hyannis Harbor Hawks.
When it comes to Cape League roster attrition, there is only so much you can prepare for. In March, every team’s roster looks great, full of All-Americans and future first-rounders. On Opening Day in June, the managers are probably still pretty happy with where they stand– maybe a pitcher or two dropped out, but the core of the roster is still intact. Soon after, though, the Cape teams are scrambling. The ace of your rotation is getting shut down for the summer. An infielder hit the transfer portal and needs to take visits. A school keeps a high leverage reliever at school work on a strength program, and your top option for replacing him is already in another league. Your first baseman’s team goes on an unexpected run in Omaha, Team USA poaches a few of your very best, and then an MLB team decides to take a flyer on your starting centerfielder. This isn’t specifically what happened to the Harbor Hawks, or any Cape team, but it’s accurate in terms of quantity and impact. The reasons may be different for each player and team, but no matter what, every roster looks completely different during the August playoffs than it did in the preseason.
The replacement game is critical for Cape League teams, and our operations/analytics group for Hyannis was empowered to find potential players and suggest them to the Front Office. Hyannis used 27 batters and 32 pitchers over the course of the season, seriously considering all of our suggestions, some of whom became critical players. After a semester learning how to use machine learning models in Python, I wanted to try to use basic NCAA stats to predict Cape League offensive production. Ideally, this model would identify overlooked candidates for further vetting and also give some insight into how a batter’s conference can impact their Cape performance.
Data Preparation
My NCAA data came from baseballr (Bill Petti and Saiem Gilani), where I was able to pull stat lines for every D1 player for years 2021-2023. Unfortunately, trying to pull 2019 caused errors I couldn’t quite fix or explain, so I had to go with just three seasons. A possible next step in this model would be to instead use Robert Frey's collegebaseball package whenever he finishes it. With the current setup, I went year by year and pulled players from each university in that year. The data came with some column shifting issues, which I fixed in R and then wrote to a CSV for each year.
Conference strength ratings were a key goal of this project from the very beginning, because with apologies to Gabe Appelbaum, OPSing .900 in the Atlantic 10 is pretty different from doing it in the SEC. Boyd's World was the best resource for this, so I pulled season-end conference ratings for each year I was considering. Considering schools play fairly different non-conference schedules, a more in-depth approach would be to look at a more comprehensive strength of schedule metric.
Cape Cod League stat lines were the final peace of the puzzle, and I calculated those with the play-by-play files I had. I brought the CSV’s into Python, where I’m more comfortable and know the machine learning packages. Before merging, I chose to replace certain names for a common standard. For example, if a “Zachary” was listed as “Zach” in the NCAA data, I wanted to make sure his data correctly merged with the Cape data that listed him as “Zachary.” ChatGPT helped to generate a list of common names and their replacements. (Michael → Mike, Steven → Steve, etc.) I also stripped all whitespace to get rid of any inconsistencies in that area.
After merging the trifecta of college data, college strength of conference and Cape data, I used the 2024 MLB wOBA weights to calculate both NCAA and Cape wOBA. This was certainly not an ideal strategy, and eventually fellow interns Gabe, Aidan and Richard calculated Cape wOBA, but at the time this was the best I had. If anyone has NCAA wOBA weights, feel free to reach out, but of course the run environment will be incredibly different across divisions and even conferences.
The Cape wOBA was my target (or “Y”) variable, and the NCAA stats (including wOBA and conference rating) were my feature (or “X”) variables. After splitting my data into an 80/20 train/test split to score the models, I went into training.
Model Training
I initially went with four regression models to compare:
Multiple Linear
Bagging
Random Forest
XGBoost
Here are their scores without any tuning:
The more advanced model types are classically overfit, which can be improved by both more data and better tuning. (Side note: What other model scores should I be looking at?) The bagging model was the most promising, so I moved forward with that one.
Model Tuning
I used three strategies to tune the bagging regressor model:
Grid Search testing to find the best parameters for number of estimators, maximum samples to use to train each estimator and maximum number of features to use to train each estimator
Result: n_estimators=50, max_samples=.6, max_features=.2
Recursive Feature Elimination to select the optimal number and subset of features. Another intern recommended I sum doubles and triples into one column, which helped the model.
Result: AB, H, 2B+3B, HR, SO, BB, OBP, SLG, wOBA, Conference Rating
At Bat Qualifiers to have a hefty sample size for each player in both of their seasons. After starting off with a one per game qualification to use for training and testing (56 for NCAA season, 44 for Cape season) I tuned this for the best possible model.
Result: NCAA Season minimum 56 AB, Cape season minimum 50 AB
204 Rows before train/test split
Model Accuracy Results:
While still clearly overfit, the tuning process improved sliced my error metrics in half. With the standard deviation of Cape wOBA at .0563 for these years, I felt like the model was fairly good. To put that into perspective, I found that the predicted Cape wOBA was within .09 over 92% of the time. So it’s not exactly a pinpoint predictor– if my model predicted a player’s wOBA at .320 (above average), we would be 92% sure he would actually land between a .230 and a .410 wOBA on the Cape, the difference between an abysmal season and a historically great one. I’ll get more into how I actually used the model in a later section.
It makes sense that NCAA wOBA was the most important feature, and that OBP was the next most. The early assumption that conference strength was an useful feature also turned out to be true. Our General Manager Nick Johnson always says that on the Cape, “the whiff translates, but the power doesn’t,” and now he’s got some data to back him up. Strikeouts were much more important to the model than home runs.
The fun part for me was looking at what the model actually predicted for 2024, and seeing if it passed the hype test. I took the 2024 NCAA data and used it to predict what each player’s Cape wOBA would be. Here are the top 5 predicted Cape wOBAs:
TOP 5 CAPE WOBA PREDICTIONS
This is a list that encompasses the best and worst of my model. At first glance, I was excited to see Charlie Condon and Jac Caglianone highly ranked, as the two best hitters in college this year. I tried to tell the Harbor Hawks to consider taking them, but apparently they were busy with the MLB Draft or something. The two guys above them are more interesting. To me, Mark Shellenberger’s (Evansville) prediction isn’t bad at all. He had an incredible college season with a 1.2 OPS and walked 50% more than he struck out. He’s not Condon, but I would love to have him on the Cape. My model doesn’t know about eligibility though, and he’s exhausted his. He did, however, put up some good stats in the Northwoods and Cape Leagues in previous years before slumping in the MLB Draft League this summer. According to his Twitter, continues to wait about undrafted free agent opportunities, and I would love to see a team give him a shot.
Jordan Smith (George Mason) wasn’t as impressive as Shallenberger and it was in a weaker conference, but I do at least see some of the vision. He strikes out a lot but hits for some solid gap power. He’s done with college baseball as far as I can tell, but I’m not sure where he’ll go next. Preston Shelton (Murray State) is where the model gets way off the mark– an OPS of .658 in an average conference? I just can’t grasp what the model saw in him, but maybe a few very similar profiles raked in the Cape League and this is just another problem with overfitting.
Want to look at the model’s prediction for your favorite player? See the Google Sheet here.
Applications
Due to the low confidence levels of my model and some obvious whiffs on certain players, the predictions became more of a name-finder than anything. When we needed to fill a need at a certain position, I would filter for freshmen and sophomores at that position, sort by their wOBA prediction and go down the list. Isaac Wachsmann (Xavier) was a player I found this way, who I felt like was an under-the-radar Cape candidate with a .322 wOBA prediction (73rd percentile). We needed a corner outfielder at the time, and the model liked him, so I dove deeper:
This year, 1.132 OPS with .700 SLG. in 108 PA
Got some BABIP luck but still pretty interesting.
10 HR, really good for his number of PA
24.1% K Rate, 11% BB rate.
Somewhat passive in terms of swinging: 66% IZSwing, 18% Chase, 30% FPS
vs 90+ mph (n=125 pitches):
.526 SLG
14.3% Chase
11.1% In Zone Whiff
15.6% Whiff (fairly low whiffs)
91.2 avg ExitVelo
111 max ExitVelo,
10% Barrel
vs 75+ Breakers (n=121):
.667 SLG
32.8% Chase
16.7% IZ Whiff
36.2% Whiff
93.8 ExitVelo
111.5 Max ExitVelo
22% Barrel
Gonna miss a lot verse breakers and doesn't have great swing deciscions overall, but I don't think it's unplayable given what we need on our team at the moment. Exit velos against those good pitches are pretty promising as well. Swing looks quick with nice hands, good line drive hitter
I'm going to take a deeper look at defense later, first look it seems playable but average at best as a corner OF.
The model found the name, but I did a lot more work before I put him in front of my bosses. In the end, we didn’t sign him, but Orleans actually did. He slashed .188 / .229 / .344 for them in 35 plate appearances a (low sample) miss by my model.
Lessons + Extensions
I would love to extend this project with advanced data on each hitter rather than just the traditional counting and rate stats– specifically, I think Whiff%, Chase%, Launch Angle and EV90th (or your Exit Velocity stat of choice) would bring this model even further. Plus, of course, more data is always better.
Although the model didn’t become a go-to tool for Hyannis, the process still taught me plenty. Just like you can expect to find in data science everywhere, the most frustrating and time consuming part was just the data scraping and cleaning. Tuning was tough as well, but ultimately rewarding as the model improved with each iteration. Taking this project from raw data to final predictions developed my skills in many different Python concepts.
Thanks and Credits
Thank you to the other analytics interns: Aidan Beilke, Gabe Appelbaum, Richard Legler and Tyler Warren. Richard recommended Boyd’s World and Tyler shared his code to convert play-by-play files to player stat lines. All four were incredibly helpful when I had questions about code and model tuning.
This video from Robert Frey helped a lot when it came to using the baseballr package for NCAA data. Excited to see more with the collegebaseball package!
Thanks a ton to the work done by Boyd's World to create conference strength ratings, not to mention the other fascinating stuff on his site. My personal favorite is the External Factor Index
Credit to the creators of the baseballr package, Bill Petti and Saiem Gilani.