Alright, so check it out, today I’m diving into something I’ve been messing around with lately – baseball data! I’m calling it “baseball’s blue” because, well, I was staring at a bunch of baseball stats and felt a bit blue about not understanding them better. So, I decided to DO something about it!

First off, I grabbed some publicly available baseball data. Found a nice dataset online with batting stats, pitching stats, the whole nine yards. It was messy, like REALLY messy. CSV files all over the place, inconsistent formatting, you name it. So, Step one: cleaning up that mess. I loaded it all into pandas in Python, started looking for missing values, and weird data types. It took ages. I mean, seriously, I spent like a whole evening just fixing dates and making sure numbers were actually numbers.
Next up, I wanted to actually see something. I started by plotting some simple stuff. Like, what’s the distribution of batting averages? Who are the players with the most home runs? Basic stuff, using matplotlib. It was cool to see the visual representation, but it wasn’t telling me anything groundbreaking. Just confirming what I already kind of knew: some guys are good at hitting dingers.
Then I got a bit more ambitious. I wanted to see if there were any correlations between different stats. Does a higher on-base percentage really lead to more runs scored? I ran some correlation analyses using pandas. Turns out, yeah, it kinda does. But the correlations weren’t as strong as I expected. That made me think: maybe there are other factors I’m not considering.
Here is when I started digging into some more advanced stats. Things like WAR (Wins Above Replacement) and wRC+ (Weighted Runs Created Plus). Trying to understand those formulas was a headache. Honestly, I still don’t fully grasp them. But I figured out how to calculate them using the data I had. It was a lot of multiplying and dividing, and I definitely had to double-check my work a few times.
The real fun came when I started trying to predict something. I decided to see if I could predict a player’s future batting average based on their past performance. I split the data into training and testing sets, and then used a simple linear regression model from scikit-learn. I’m no data scientist, so I kept it simple. The results weren’t amazing, but they were better than just guessing. Plus, I learned a TON about how machine learning models work.
Finally, I decided to visualize my findings in a more interactive way. I played around with some dashboarding tools, like Tableau and Plotly. I ended up using Plotly because it was easier to embed into a webpage. I created a dashboard where you could select a player and see their stats, compare them to other players, and see how they’ve performed over time. That was pretty cool to see it all come together!
Lessons Learned: Data cleaning is the worst, but absolutely necessary. Visualizations are powerful. Even simple models can give you interesting insights. And baseball stats are way more complicated than I thought!
Overall, “baseball’s blue” was a fun little project. It helped me practice my data analysis skills and learn a bit more about baseball. Now, I’m thinking about tackling pitching stats next. Wish me luck!
