Unlocking MLB Secrets: How Clustering Reveals Trends in Batter Strategies


Summary

Unlocking the secrets of MLB through data analysis offers insights into batter strategies that can transform the game. Understanding these dynamics is essential for fans, analysts, and teams alike. Key Points:

  • Clustering algorithms like k-means and hierarchical clustering help analyze player performance trends and strategies in baseball.
  • Practical applications of clustering reveal hidden patterns in player behavior, such as pitch selection tendencies and situational strategies.
  • A case study on Bregman and Bichette showcases how clustering can identify specific strengths and weaknesses, informing swing decisions and contact quality.
Ultimately, leveraging clustering techniques provides a powerful lens to enhance baseball analytics, revealing deeper insights into player performance and team strategy.


Unveiling Player Archetypes and Situational Mastery Through Advanced Analytics

To enhance the depth and accuracy of the article, we can incorporate key insights from advanced analytical techniques and situational performance assessments. By employing unsupervised machine learning methods such as clustering algorithms, we can analyze a broad array of hitting metrics, including Statcast data, to categorize players into distinct archetypes based on their batting profiles. This allows for a nuanced understanding that goes beyond traditional metrics like wRC+, revealing how players with similar overall performance may possess fundamentally different skill sets. For instance, two players might post comparable wRC+ figures yet differ significantly in their approach to hitting—one excelling in contact rate and line drives while the other focuses on power and selectivity.

Moreover, examining player performance through the lens of specific game situations is crucial for a comprehensive evaluation. Overall statistics often mask important distinctions in how players perform under pressure or in high-leverage scenarios. Advanced statistical analyses of event-level data illuminate these nuances, providing insight into a player's true effectiveness during critical moments in games. This level of analysis not only enriches our understanding of individual player capabilities but also informs better strategic decisions by teams as they navigate various game contexts.
Key Points Summary
Insights & Summary
  • Player Performance Project focuses on holistic improvement for individual athletes in football.
  • Established in 2013, it is the only performance agency dedicated to gaming and hospitality with a results-first approach.
  • Players can engage in markets based on their performance during matches through spread betting.
  • Player data includes personal information, performance metrics from matches and training, and on-the-ball statistics.
  • Coaches using sports data analytics often concentrate on just three key metrics while still achieving positive outcomes.
  • The Player Performance Index evaluates players` contributions to their teams based on six crucial measurements.

In today`s competitive football landscape, it`s clear that player performance isn`t solely about skills or physical attributes. The Player Performance Project recognizes this by taking a comprehensive approach to athlete development. By focusing not just on stats but also on the overall well-being of players, they aim to foster an environment where both individual athletes and teams can thrive. It`s inspiring to see how data-driven methods are being used to unlock potential and enhance the game we all love.

Extended Comparison:
AspectTraditional MetricsAdvanced AnalyticsPlayer Performance IndexCoaching FocusMarket Engagement
DefinitionBasic statistics such as batting average and home runs.In-depth data analysis including launch angle, exit velocity, and pitch recognition.Evaluates contributions based on six crucial measurements like on-base percentage and defensive metrics.Concentrate on three key metrics while achieving positive outcomes.Allows players to engage in performance-based betting markets.
Data UtilizationLimited use of historical context.Utilizes machine learning algorithms for predictive analytics.Comprehensive overview of player impact on team success.Focuses on immediate game-related metrics for quick adjustments.Incorporates personal data with match performance for better insights.
TrendsReliance on outdated stats among coaches.Growing emphasis on tech-driven insights (e.g., wearable technology).'Big Data' approach gaining traction in evaluating player potential.Shift towards holistic athlete development through multi-faceted coaching strategies.Emerging popularity of gamification in sports betting related to player performance.
Authority PerspectivesExperts caution against over-relying on single metrics.Analysts advocate for a blended approach using both traditional and advanced analytics.Performance agencies recommend the Player Performance Index for comprehensive evaluations.Top coaches emphasize adapting training regimens based on real-time data analysis.Industry leaders suggest integrating sports data into fan engagement strategies.



While traditional metrics like OPS and wOBA provide a useful overview of a player's performance throughout the season, they can be affected by variables that fall outside of the player's influence—such as luck related to batted balls, the dimensions of the ballpark, or the defensive capabilities of opposing teams. For this reason, I chose to omit these statistics from my analysis. Although some players have a tendency to either exceed or fall short of their predicted stats consistently, for most athletes, the impact of batted ball luck tends to balance out over several seasons.
Here is a glimpse of the data I retrieved. My initial plan was to calculate expected Home Run rates by taking the ratio of expected Home Runs (xHR) to plate appearances. However, since Statcast's xHR leaderboard only goes back to 2019, I decided to rely on actual Home Run rates instead. Although this approach is not ideal—given that factors like ballpark dimensions can significantly influence batted ball outcomes—I believe this simplified method will be adequate for my analysis.

After refining the dataset and excluding players with fewer than 200 plate appearances, I normalized the metrics so that each year’s league-average is set at 100. This approach mirrors the widely used "+" statistics like wRC+. For instance, in Figure 2 below, we can see that Michael Brantley's expected batting average (xBA) in 2017 was 13% higher than the league average, while his Barrel Rate fell short by 39%. The fluctuations in the run environment from year to year can be attributed to various factors such as evolving hitting strategies, enhanced baseballs, or changes in regulations.

Several of the metrics illustrated in Figure 2 exhibit collinearity. For example, there is a strong correlation between Barrel Rate and xSLG, with an r² value of 0.816. Conversely, O-Zone Swing % and BB% demonstrate a significant inverse relationship, reflected by an r² value of -0.753.

This point is referred to as the "elbow," representing the principal component number right before the curve begins to level off. As illustrated in Figure 4 below, the elbow occurs at three principal components, which account for approximately 87% of the total variance within the dataset.

According to Figure 4, the first principal component (PC1) accounts for 48% of the variance in the data, while the second principal component (PC2) represents 25%, and the third principal component (PC3) explains 14%. To gain deeper insights into these components, Figure 5 illustrates how each original variable contributes to them. It is noteworthy that "extreme" values within each principal component show a high degree of correlation, as evidenced in Figure 3. For instance, in PC1, key contributors include metrics such as xSLG, xwOBA, xwOBAcon, Barrel Rate, Hard Hit Rate, and Home Run Rate. This correlation among these variables is further validated by Figure 3.


The athletes on the left exhibit the most effective contact, while those on the right demonstrate the least effective or weakest interactions.

The players on the left are those who make contact most often, while the players on the right are the ones who have the least frequent interactions.

The players on the left exhibit the most aggressive swinging tendencies, while those on the right demonstrate a more measured approach. Additionally, we can examine the year-over-year consistency of the three main components. In Figure 9, each dot signifies a player's principal component score for Year t and Year t + 1. Notably, all three principal components show a strong correlation from one year to the next, suggesting that they possess predictive capabilities.

In my analysis, I will be utilizing three newly identified principal components instead of relying on the original twelve variables. When it came to selecting an appropriate model, I initially explored the use of k-means clustering for grouping batters. However, one significant limitation of k-means is that it performs what is known as "hard" clustering; this means each data point is allocated to a single cluster without consideration for how closely or distantly it aligns with the cluster's center.
One alternative approach is the Gaussian Mixture Model (GMM). The main distinction of GMM lies in its ability to assign probabilities to each data point regarding its membership in various clusters, which is determined by its proximity to the centers of those clusters. For instance, in a model with two clusters, a point that is very close to the center of Cluster 1 might have a probability of 0.99 for belonging to Cluster 1 and only 0.01 for Cluster 2. Conversely, a point situated near the dividing line between Cluster 1 and Cluster 2 could exhibit probabilities of approximately 0.51 for Cluster 1 and 0.49 for Cluster 2. This method offers a more detailed understanding of individual players' unique skill sets.
To identify the most suitable number of clusters, I assessed the Silhouette Score for several Gaussian Mixture Models (GMM) with varying amounts of cluster centers. The Silhouette Score evaluates how distinctly each cluster is separated from others, with values ranging from -1 to 1; scores approaching 1 indicate optimal clustering.

It’s important to note that the clustering model was based solely on three main components. However, for clarity and interpretation, Figure 11 also incorporates average normalized Statcast metrics, normalized plate discipline metrics, and "outcome" metrics.

The notable differences in metrics among the various clusters are quite encouraging. Now, let's delve into their meanings using baseball terminology. Cluster 1 - Walkers



The standout characteristic of players in this group is their remarkable patience at the plate. While they exhibit below-average metrics in both Contact Quality and Contact Frequency, their low levels of Swing Aggression enable them to earn walks at an impressive rate. Cluster 2 represents a category of Free-Swinging Power Hitters.



These athletes demonstrate consistent performance across the board, showcasing a commendable mix of both batting average and power. Nevertheless, their primary drawback lies in their tendency to chase pitches that are outside the strike zone, resulting in lower walk rates. Cluster 3 - Patient Contact Hitters



These athletes are particularly skilled at making contact with the ball, although this comes at the cost of some power. They demonstrate low whiff rates on pitches both within and outside the strike zone, resulting in a high rate of successful hits and very few strikeouts. In addition to their exceptional ability to connect with the ball, these hitters exhibit great patience, as evidenced by their low chase rates and solid walk statistics. Cluster 4 - "Three True Outcomes" Hitters



This group includes players characterized by their tendency to strike out frequently, draw a significant number of walks, maintain low batting averages, yet showcase impressive power statistics. The significance of the "Three True Outcomes" has grown in recent times, driven by a heightened focus on metrics like exit velocity and launch angles. These athletes exemplify this modern approach to the game. Cluster 5 - Chasers



Players within this category exhibit a lack of plate discipline, resulting in notably high strikeout rates and disappointingly low walk rates. Additionally, they frequently generate poor-quality contact, which ultimately contributes to subpar offensive performance. Cluster 6 - Contact Seekers



Players within this category aim to make contact with the ball as frequently as possible. Although they often pursue pitches that are outside the strike zone, these athletes exhibit some of the lowest swinging strike rates, both on balls within and beyond the zone. This unique blend leads to notably low rates of both strikeouts and walks, though it generally results in weaker contact. Cluster 7 - Superstars



In baseball, the top hitters stand out for their exceptional mix of Contact Quality, Contact Frequency, and Swing Aggression. It's important to keep in mind a couple of key points: first, players can shift from one archetype to another from season to season if their swing habits evolve or if there's a significant change in their contact metrics. Second, during a specific year, a player may exhibit characteristics that reflect multiple archetypes. To illustrate these trends, let’s examine Giancarlo Stanton's performance over the years.

Stanton has consistently demonstrated elite contact quality, while his swing aggression has remained relatively stable from season to season. In contrast, his contact frequency has experienced more variability, which is the primary factor driving significant changes in his performance cluster probabilities year over year.
In contrast, Luis Arraez has consistently showcased a similar batting style throughout his career. He has earned a reputation as a high-contact, low-power hitter with an impressive ability to navigate the strike zone, leading to his classification as a "Patient Contact Hitter" year after year. However, in 2023, Arraez began to alter this approach by swinging at more pitches outside the strike zone, evident from an increase in his Swing Aggression metrics. While he retained his exceptional contact skills, there was a noticeable shift in his overall profile that indicated this newfound aggressiveness at the plate.

Let’s revisit the case of 2023, focusing on Alex Bregman and Bo Bichette.

Considering their respective principal component scores and various other metrics mentioned earlier, what archetypes do you believe Bregman and Bichette each represent? Spoilers are included below!

Dissecting Contact Quality and Swing Decisions: Uncovering the Nuances of Bregman and Bichette′s Batting Performance

- 1. Delving Deeper into Contact Quality: It is important to consider the distribution of batted ball events when evaluating Bregman's contact quality, which was slightly below league-average. Analyzing whether his subpar hits are primarily weak grounders or if he consistently makes hard contact but struggles with direction can reveal specific areas for improvement and highlight particular weaknesses.- 2. Exploring the Role of Swing Decisions: The app's emphasis on player archetypes and PC scores underscores the significance of swing decisions in batting performance. Investigating how swing decisions—such as chase rate and walk rate—correlate with various pitching patterns and game situations could uncover whether Bregman's low chase tendency and Bichette's high chase rate serve as effective strategies against certain pitch types or in different game contexts. A thorough analysis of these choices would provide valuable insights for player development and team strategy.
You can access all the code and associated files on my GitHub repository.

References

Player Performance Project: Elite Football Training & Performance ...

Player Performance Project is a football high performance centre focused on holistically improving the individual athlete.

Player Performance Group | Casino Gaming Performance & Marketing ...

Since 2013, we have been the only performance agency built exclusively for gaming and hospitality that places property results first.

Player Performance Definitions

Our Player Performance markets let you play on a spread based on a named player's performance in a match, based on the outcomes below.

Source: Sporting Index

Player Performance Project (@playerperformanceproject)

2281 Followers, 1127 Following, 685 Posts - Player Performance Project (@playerperformanceproject) on Instagram: "Football High Performance, Development, ...

Source: Instagram

Player Performance Data - FIFPRO World Players' Union

Player data has five main categories: Personal player data, performance data (collected during matches and training), on-the-ball ...

Source: FIFPro

What Are The Top 3 Player Metrics Every Coach Needs to Track?

Coaches who are using sports data analytics are focusing on just three metrics and still seeing results that improve ...

Source: KINEXON Sports

EA FC 24 PGP - Player Game Performance

PGP represents Player Game Performance. The PGP is based on game statistics taken from players on the market. Values are average per game. Filters. Filters.

Source: FUTBIN

Player Performance Index - Player Scout

The Player Performance Index uses 6 key measurements to determine and measure the players 'contribution' to the success of his or her team.

Source: PlayerScout

D.P.

Experts

Discussions

❖ Columns