Can We Use Chess To Predict Soccer?

Contents

There is no way around it. Every soccer fan has had countless, passionate discussions about which teams were going to win the upcoming matches. To back their guesses, most fans ramble about the players, the coaches, and a myriad of factors from the time of the year to the quality of the field. A few others look at historical stats, mentioning the performance of each team over the last few rounds or how those two teams performed in the last times they played each other. Regardless of the argument, though, every fan is trying to gauge the very same thing: which team has the greatest strength? Official rankings are an attempt to measure and classify teams according to their “quality”, but they have a series of flaws that soccer fans are equally familiar with and can’’ be relied on solely. In this article, we explore an alternative way of evaluating the quality of teams, taking inspiration from the ranking system long used in chess and, over time, adapted to other sports: Elo ratings. Apart from implementing a system from scratch, we also show that Elo ratings are superior to traditional rankings in predicting the outcome of a match. Theoretical Foundations
System implementation
Evaluating results
Conclusion
References
Additional notes
- A deeper dive into how the model performs
An important disclaimer

There is no way around it. Every soccer fan has had countless, passionate discussions about which teams were going to win the upcoming matches. To back their guesses, most fans ramble about the players, the coaches, and a myriad of factors from the time of the year to the quality of the field. A few others look at historical stats, mentioning the performance of each team over the last few rounds or how those two teams performed in the last times they played each other. Regardless of the argument, though, every fan is trying to gauge the very same thing: which team has the greatest strength?

Official rankings are an attempt to measure and classify teams according to their “quality”, but they have a series of flaws that soccer fans are equally familiar with and can’’ be relied on solely. In this article, we explore an alternative way of evaluating the quality of teams, taking inspiration from the ranking system long used in chess and, over time, adapted to other sports: Elo ratings. Apart from implementing a system from scratch, we also show that Elo ratings are superior to traditional rankings in predicting the outcome of a match.

Theoretical Foundations

The core idea and assumptions

Contrary to common belief, Elo is not an acronym but a name. The Elo system was created in 1967 by Arpard Elo to evaluate the performance of chess players (Elo, 1967). According to Elo, the system is based on one simple idea: it is possible to build a rating scale where many performance measurements of an individual player will be normally distributed.

In other words, if we observe a single player over multiple games, his performance is likely to fluctuate slightly between one match and the other, but these fluctuations should revolve around a mean value, which is the player’s true level of skill. Following that reasoning, if the performance of two players can be described by two Normal distributions, the chance of player A winning from player B is equal to the probability of one random sample from A’s Normal being greater than one random sample from B’s.

At its core, Elo created a system of relative scores in which we use the difference between the ratings of two players (which are, in theory, a reflection of their true skill) to estimate how likely each of them is to win. Another interesting aspect of Elo ratings is that, when determining a player’s level of skill, the system also takes into account the fact that not all victories or losses are equally meaningful. Just think about the fact that, if you heard the news that Manchester City (first division) won a match against Bromley (fourth division), you wouldn’t be surprised. However, if the result were the other way around, not only would you be shocked, but you would also rethink your assessment of how strong both teams are. This dynamic is built into the mechanics of Elo’s system, and unexpected results affect the ratings of the teams involved much more than obvious outcomes.

The mathematical implementation

To implement such a system, we need to have a way of estimating how likely each team is to win and a way of updating our assessment of their strengths. This is why Elo devised two essential components that dialogue with each other: the updating and prediction functions.

For a moment, assume we are in the middle of a soccer season and somehow have a list of all teams and their Elo ratings. The rating is simply a number that measures the quality of a team, and by comparing different ratings, we can infer which team is best. A new match is about to happen, and prior to its beginning, we want to estimate each team’s probability of winning. To do so, we use the prediction function of the Elo system, which is given by the formula

[E_H = frac{1}{1+c^{(R_A – R_H)/d}}]

Here, E_H is the expected outcome of the home team, a number between 0 and 1 that represents the probability of a home win. The ratings of each team prior to the match are given by R_H and R_A for the home and away clubs, respectively. Last, c and d are free parameters that could take any value but are conventionally set to 10 and 400, as described in Wunderlich & Memmert (2018). You don’t necessarily need to know this, but by setting these values, we imply that a 400-point difference corresponds to a 10x odds ratio between the teams, meaning that the stronger club is expected to win 10 times for every loss.

In an ideal universe where draws cannot happen, such as a World Cup final, we could also calculate the expected outcome for the away team easily: E_A = (1 — E_H). In practice, though, this is often not the case, and we will soon explain how to account for draws. But before we do so, let’s finish understanding the original system. Back to Manchester City vs. Bromley we go.

A few days after you predict a winner using their Elo ratings, the game actually happens, one of the teams wins, and we have just acquired new information about how each team is performing and what their current strength is. It’s time to update their ratings so that our system reflects reality as closely as possible. To do so, we use the updating function, which is traditionally defined as

[R’_H = R_H + K(S_H – E_H)]

Here, R’_H is the home team’s new rating, R_H is its rating prior to the match, K is a scaling factor that determines how much influence a result can have in the ratings of a team, S_H is the outcome of the match (1 for victory, 0.5 for draw, and 0 for loss), and E_H is the expected outcome, or the probability that the home team would win, according to the prediction step you inferred before. The formulas for the away team are the same, only needing to swap the subscripts from “H” to “A” and vice versa. In practice, you would use this formula to recalculate the ratings of Manchester City and Bromley, which would then inform your estimations in future matches that these teams play in.

Out of all the parameters from the equations we’ve shown, K is the most important. According to Elo’s original publication, higher values of K attribute more weight to recent performances, while lower values of K allow for a greater influence of past performances in defining a team’s rating. Just think about the fact that, if we have a team who lost all of the past matches, they are likely to have a lower rating than everyone else. When that team starts to win again, the greater the value of K in our formula, the faster their rating goes back up.

One aspect to note is that, in the original article, the value of K depends on how many matches a player has on record. When the rating of a new player was calculated, Elo used a high K that allowed his ratings to change significantly. Over time, this value would decrease slightly until reaching a plateau. In practice, however, hardly anyone modifies the value of K as Elo first suggested, and a widespread default is setting K = 32.

The problems of applying Elo to soccer

Despite its popularity, the original implementation of the system had significant shortcomings when applied to soccer. First, having been created for a two-player zero-sum game, it does not directly account for the possibility of a draw. Or, to be more specific, we cannot directly infer the probability of a draw from the prediction step, even though historical data has shown that this outcome happens 26% of the time. Second, Elo works solely based on the results of previous matches, meaning that it does not incorporate any other source of information other than the final outcome, even though they could be useful (Hvattum & Arntzen, 2010). Third, the original system, which had been designed for chess, did not consider which player had black or white, even though white in chess has a natural edge over black due to the first-move advantage. In soccer, this would be equivalent to the natural advantage of the home team: every soccer fan knows that a team that plays at home has a natural advantage over a team playing away.

Many attempts to solve these problems have been proposed, some of which have become widely spread. To derive draw probabilities based on the ratings, for example, different approaches were tested over time, from simple re-normalization techniques using historical draw frequencies (Betfair, 2022) to applications of multinomial logistic regressions (Wunderlich & Memmert, 2018) and formal iterations to the original model (Szczecinski & Djebbi, 2020). There have also been multiple approaches to factor in the home team’s advantage in the model, like the inclusion of a new parameter in the prediction step of the system. Another interesting modification was the inclusion of information beyond the outcome of the match to recalculate the ratings, such as the goal difference between the teams. To factor that in, some authors included a brand new term in the update function (Stankovic, 2023), while others simply modified their K parameter (eloratings.net, n.d.; Wunderlich & Memmert, 2018). One solution worth mentioning is Hvattum and Arntzen’s (2010), who proposed

[ k = k_0(1+delta)^lambda]

with delta being the absolute goal difference, and using k_0 and lambda as fixed parameters greater than zero.

Last, the reader might ask how long the ratings take to reflect a team’s performance accurately. In the original article, Elo mentions that good statistical practice would require at least 30 games to determine a player’s rating with some confidence. This is in line with famous implementations of the system for soccer: eloratings.net, for example, states that ratings tend to converge to a team’s true strength after around 30 matches. Other approaches tend to be more systematic, especially when more data is available. For example, Wunderlich and Memmert (2018) leave the first two seasons to calibrate the Elo ratings for each team. Then, three additional seasons are used to gather data and create an ordered logit model that gives probabilities for home/draw/away. Last, for the final five seasons in their study, the logit provides the probabilities that make the forecast for each match. We took inspiration from this approach to implement our own.

System implementation

Our assumptions

Our implementation of the Elo system is guided by Wunderlich and Memmert (2018) and Hvattum and Arntzen (2010). First, our prediction function is given by

[E_H = frac{1}{1+c^{(R_A – R_H – omega)/d}}]

where c = 10, d = 400, and ω is a home advantage factor set to 100. From this algorithm, we can also infer that

[ E_A = 1 – E_H ]

thus completing the Elo prediction process, even though this is not how we convert ratings into probabilities. The actual probability calculation is performed through a logistic regression, and we use the formulas for E_H and E_A only to derive the variables that are required by the update function. In turn, the update function is given by

[ R’_H = R_H + k_0(1+delta)(S_H – E_H) ]

where the standard K factor was replaced by an adaptive scaling factor that takes into account the absolute goal difference in a match (represented by δ). Here, k_0 = 10, and the final value of K increases with the goal difference. The formula for updating the ratings for the away team is the same, only replacing the subscripts from “H” to “A”.

In our implementation, ratings are season-agnostic, meaning that a team’s rating at the end of a season is carried into the beginning of the next. This naturally causes a problem, given that new teams that we do not have ratings for are promoted every season. To tackle that challenge, we decided that each team in the first division at the very first season of the dataset starts with a rating of 1000 points, and at the end of the season, each newly-promoted team acquires the rating of a demoted team. This mechanism incorporates a more plausible representation of reality than the alternative of setting brand-new ratings of 1000 points for the promoted teams: at least in the beginning, we expect the teams that came from a lower division to have an inferior performance than the teams that remained in the top division. Last, we incorporate a multinomial logistic regression that uses rating differences as its only independent variable to predict which outcome is more likely in every match — and, thus, which team will likely win

The dataset

The dataset we used is originally from https://www.football-data.co.uk/, which gave us permission to use the data for this article, and contains information about all games from the Brazilian Soccer Championship (Brasileirão) between 2012 and 2024.

Screenshot of the dataset. [Image by author]

The first three seasons of the dataset (2012–2014) are used solely for Elo ratings calibration. The following four seasons (2015–2018) are used for calibrating the logistic function that outputs the probability of each result in a match: apart from continuously updating the Elo ratings after each game, we also create a second dataset with the rating difference between the teams involved and the match’s outcome. This dataset is later used to fit a multinomial logistic regression capable of predicting match outcomes based on rating differences. Last, the final six seasons (2019–2024) are reserved for backtesting the system. Ratings are still updated after every match, and the logistic function is calibrated between seasons with all the data collected up to that point. At every game, based on the rating difference between the two teams involved, we want to predict the most likely outcome according to the logistic regression and observe the results after.

Code

Step 1: Initial ratings calibration

Once the system is clearly defined, it’s time to dive into the code! We start by implementing the core of every Elo system: the predict and update functions. (For reference, you can see the full implementation here. I have used AI to document the code so that you can follow along more easily.)

def elo_predict(c, d, omega, teams, ratings_dict):
  '''
  Calculates predicted Elo outcome (E_H and E_A)

  Inputs:
    c, d, omega: int
      Free variables for the formula
    teams: list
      Name of both teams in the match
    ratings_dict: dict
      Dictionary with the teams as keys and their Elo score as value

  Outputs:
    expected_home, expected_away: float
      The expected Elo outcome (E_H and E_A) for each team
    rating_difference: float
      The difference in ratings between both teams (used to inform the logistic regression)

  '''
  rating_home = ratings_dict[teams[0]]
  rating_away = ratings_dict[teams[1]]
  rating_difference = rating_home - rating_away

  exponent = (rating_away - rating_home - omega)/d

  expected_home = 1/(1 + c**exponent) # This is E_H in the formula
  expected_away = 1 - expected_home

  return expected_home, expected_away, rating_difference

def elo_update(k0, expected_home, expected_away, teams, goals, outcomes, ratings_dict):
  '''
  Updates Elo ratings for two teams based on the match outcome.

  Inputs:
    k0: int or float
      Base scaling factor used for the rating update
    expected_home, expected_away: float
      The expected outcomes for the home and away teams (E_H and E_A)
    teams: list
      Name of both teams in the match (home team first, away team second)
    goals: list
      Number of goals scored by each team ([home_goals, away_goals])
    outcomes: list
      Actual match outcomes for both teams ([home_outcome, away_outcome])
      Typically 1 for a win, 0.5 for a draw, and 0 for a loss
    ratings_dict: dict
      Dictionary with the teams as keys and their current Elo ratings as values

  Outputs:
    ratings_dict: dict
      Updated dictionary with new Elo ratings for the two teams involved in the match
  '''
  # Unpacks variables
  home = teams[0]
  away = teams[1]
  rating_home = ratings_dict[home]
  rating_away = ratings_dict[away]
  outcome_home = outcomes[0]
  outcome_away = outcomes[1]
  goal_diff = abs(goals[0] - goals[1])

  ratings_dict[home] = rating_home + k0*(1+goal_diff) * (outcome_home - expected_home)
  ratings_dict[away] = rating_away + k0*(1+goal_diff) * (outcome_away - expected_away)

  return ratings_dict

We also create a quick function to convert the real outcome of a match (win, draw, or loss) to the format required by Elo’s formulas (1, 0.5, or 0):

def determine_elo_outcome(row):
  '''
  Determines outcome of a match (S_H or S_A in the formula) according to Elo's standards:
  0 for loss, 0.5 for draw, 1 for victory
  '''
  if row['Res'] == 'H':
    return [1, 0]
  elif row['Res'] == 'D':
    return [0.5, 0.5]
  else:
    return [0, 1]

Another building block we need is a function to perform the process of assigning new ratings to the teams that are promoted at the beginning of every season.

def adjust_teams_interseason(ratings_dict, elo_calibration_df):
  '''
  Implements the process in which promoted teams take the Elo ratings
  of demoted teams in between seasons
  '''
  # Lists all teams in previous and upcoming seasons
  old_season_teams = set(ratings_dict.keys())
  new_season_teams = set(elo_calibration_df['Home'].unique())

  # If any teams were demoted/promoted
  if len(old_season_teams - new_season_teams) != 0:
    demoted_teams = list(old_season_teams - new_season_teams)
    promoted_teams = list(new_season_teams - old_season_teams)

    # Inserts new team in the dictionary and removes the old one
    for i in range(4):
      ratings_dict[promoted_teams[i]] = ratings_dict.pop(demoted_teams[i])

  return ratings_dict

def create_elo_dict(df):
  # Creates very first dictionary with initial rating of 1000 for all teams
  teams = df[df['Season'] == 2012]['Home'].unique()
  ratings_dict = {}

  for team in teams:
      ratings_dict[team] = 1000

  return ratings_dict

# Calling the function
calibration_seasons = [2012, 2013, 2014]
ratings_dict = run_elo_calibration(df, calibration_seasons)

Finally, all of these pieces come together in a function that performs the first major process we want: running the initial calibration of ratings in the seasons 2012–2014.

def run_elo_calibration(df, calibration_seasons, c=10, d=400, omega=100, k0=10):
  '''
  This function iteratively adjusts team ratings based on match results over multiple seasons.

  Inputs:
    df: pandas.DataFrame
      Dataset containing match data, including columns for season, teams, goals etc.
    calibration_seasons: list
      List of seasons (or years) to be used for the calibration process
    c, d: int or float, optional (default: 10 and 400)
      Free variables for the Elo prediction formula
    omega: int or float (default=100)
      Free variable representing the advantage of the home team
    k0: int or float, optional (default=10)
      Scaling factor used to determine the influence of recent matches on team ratings

  Outputs:
    ratings_dict: dict
      Dictionary with the final Elo ratings for all teams after calibration
  '''
  # Initialize Elo ratings for all teams
  ratings_dict = create_elo_dict(df)

  # Loop through the specified calibration seasons
  for season in calibration_seasons:
    # Filter data for the current season
    season_df = df[df['Season'] == season]

    # Adjust team ratings for inter-season changes
    ratings_dict = adjust_teams_interseason(ratings_dict, season_df)

    # Iterate over each match in the current season
    for index, row in season_df.iterrows():
      # Extract team names and match information
      teams = [row['Home'], row['Away']]
      goals = [row['HG'], row['AG']]

      # Determine the actual match outcomes in Elo terms
      elo_outcomes = determine_elo_outcome(row)

      # Calculate expected outcomes using the Elo prediction formula
      expected_home, expected_away, _ = elo_predict(c, d, omega, teams, ratings_dict)

      # Update the Elo ratings based on the match results
      ratings_dict = elo_update(k0, expected_home, expected_away, teams, goals, elo_outcomes, ratings_dict)

  # Return the calibrated Elo ratings
  return ratings_dict

After running this function, we will have a dictionary containing each team and its associated Elo rating.

Step 2: Calibrating the logistic regression

In the seasons 2015–2018, we will be performing two processes at once. First, we keep updating the Elo ratings of all teams at the end of every match, just like before. Second, we start collecting additional data in each match to train a logistic regression at the end of this period. The logistic regressions will be used later on to generate predictions for each outcome. In code, this translates into the following:

def run_logit_calibration(df, logit_seasons, ratings_dict, c=10, d=400, omega=100, k0=10):
  '''
  Runs the logistic regression calibration process for Elo ratings.

  This function calibrates Elo ratings over multiple seasons while collecting data
  (rating differences and outcomes) to prepare for training a logistic regression.
  The logistic regression is later used to make outcome predictions based on rating differences.

  Inputs:
    df: pandas.DataFrame
      Dataset containing match data, including columns for 'Season', 'Home', 'Away', 'HG', 'AG', 'Res', etc.
    logit_seasons: list
      List of seasons (or years) to be used for the logistic regression calibration process
    ratings_dict: dict
      Initial Elo ratings dictionary with teams as keys and their ratings as values
    c, d: int or float, optional (default: 10 and 400)
      Free variables for the Elo prediction formula
    omega: int or float (default=100)
      Free variable representing the advantage of the home team
    k0: int or float, optional (default=10)
      Scaling factor used to determine the influence of recent matches on team ratings

  Outputs:
    ratings_dict: dict
      Updated Elo ratings dictionary after calibration
    logit_df: pandas.DataFrame
      DataFrame containing columns 'rating_diff' (Elo rating difference between teams)
      and 'outcome' (match results) for logistic regression analysis
  '''
  # Initializes the Elo ratings dictionary
  ratings_dict = ratings_dict

  # Initializes an empty DataFrame to store rating differences and outcomes
  logit_df = pd.DataFrame(columns=['season', 'rating_diff', 'outcome'])

  # Loops through the specified seasons for logistic calibration
  for season in logit_seasons:
    # Filters data for the current season
    season_df = df[df['Season'] == season]

    # Adjusts team ratings for inter-season changes
    ratings_dict = adjust_teams_interseason(ratings_dict, season_df)

    # Iterates over each match in the current season
    for index, row in season_df.iterrows():
      # Extracts team names and match information
      teams = [row['Home'], row['Away']]
      goals = [row['HG'], row['AG']]

      # Determines the match outcomes in Elo terms
      elo_outcomes = determine_elo_outcome(row)

      # Calculates expected outcomes and rating difference using the Elo prediction formula
      expected_home, expected_away, rating_difference = elo_predict(c, d, omega, teams, ratings_dict)

      # Updates Elo ratings based on the match results
      ratings_dict = elo_update(k0, expected_home, expected_away, teams, goals, elo_outcomes, ratings_dict)

      # Adds the rating difference and match outcome to the logit DataFrame
      logit_df.loc[len(logit_df)] = {'season': season, 'rating_diff': rating_difference, 'outcome': row['Res']}

  # Returns the updated ratings and the logistic regression dataset
  return ratings_dict, logit_df

# Calling the function
logit_seasons = [2015, 2016, 2017, 2018]
ratings_dict, logit_df = run_logit_calibration(df, logit_seasons, ratings_dict, c=10, d=400, omega=100, k0=10)

Now, not only do we have an updated dictionary with Elo ratings like before, but we also have an additional dataset with rating differences (our independent variable) and match outcomes (our dependent variable). With this data, we create a function to fit a logistic regression, adapting some code provided by Machine Learning Mastery.

def fit_logistic_regression(logit_df, max_past_seasons = 15, report = True):

  # Prunes the dataframe, if needed
  most_recent_seasons = sorted(logit_df['season'].unique(), reverse=True)[:max_past_seasons]
  filtered_df = logit_df[logit_df['season'].isin(most_recent_seasons)].copy()

  # Adjust outcome columns from str to int
  label_encoder = LabelEncoder()
  filtered_df['outcome_encoded'] = label_encoder.fit_transform(filtered_df['outcome'])

  # Isolates independent and dependent variables
  X = filtered_df[['rating_diff']].values
  y = filtered_df['outcome_encoded'].values
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

  # define the multinomial logistic regression model
  model = LogisticRegression(solver='lbfgs')

  # fit the model on the whole dataset
  model.fit(X, y)

  # report the model performance
  if report:
    # Generate predictions on the test data
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)

    # Compute key metrics
    cm = confusion_matrix(y_test, y_pred)
    recall = recall_score(y_test, y_pred, average='weighted')
    loss = log_loss(y_test, y_prob)
    balanced_acc = balanced_accuracy_score(y_test, y_pred)

    print(f'Recall (weighted): {recall}')
    print(f'Balanced accuracy: {balanced_acc}')
    print(f'Log loss: {loss}')
    print()

    # Display the confusion matrix
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=label_encoder.classes_)
    disp.plot(cmap="Blues")

  return model

Step 3: Running the system

For the 2019–2024 seasons, we run the system to evaluate its performance. At the beginning of every season, we re-train the logistic regression with the latest data available. At the end of every match, we log whether our prediction was correct or not.

def run_elo_predictions(df, logit_df, seasons, ratings_dict, plot_title,
                        c=10, d=400, omega=100, k0=10, max_past_seasons=15,
                        report_ml=False):
    '''
    Runs an Elo + logistic regression pipeline to predict match outcomes.

    This function processes matches across multiple seasons, using Elo ratings
    to estimate team strength and logistic regression to predict match outcomes.
    It logs predictions and actual outcomes for performance evaluation.

    Inputs:
      df: pandas.DataFrame
        Dataset with match data: 'Season', 'Home', 'Away', 'HG', 'AG', 'Res', etc.
      logit_df: pandas.DataFrame
        Historical data with Elo differences and match outcomes to train the model.
      seasons: list
        Seasons (or years) to include in the evaluation loop.
      ratings_dict: dict
        Current Elo ratings for all teams.
      c, d: Elo parameters
      omega: Home advantage parameter
      k0: Elo update factor
      max_past_seasons: int
        How many seasons back to include when training logistic regression
      report_ml: bool
        Whether to print model performance each season

    Outputs:
      posterior_samples (array): Samples from the posterior of prediction accuracy
      prediction_log (DataFrame): Logs model predictions vs actual outcomes
    '''
    ratings_dict = ratings_dict
    logit_df = logit_df

    prediction_log = pd.DataFrame(columns=['Season', 'Prediction', 'Actual', 'Correct'])

    for season in seasons:
        if season == seasons[-1]:
            print('nLogistic regression performance at FINAL SEASON')
            logistic_regression = fit_logistic_regression(logit_df, max_past_seasons, report=True)
        else:
            if report_ml:
                print(f'Logistic regression performance PRE SEASON {season}')
            logistic_regression = fit_logistic_regression(logit_df, max_past_seasons, report=report_ml)

        season_df = df[df['Season'] == season]
        ratings_dict = adjust_teams_interseason(ratings_dict, season_df)

        for index, row in season_df.iterrows():
            teams = [row['Home'], row['Away']]
            goals = [row['HG'], row['AG']]
            elo_outcomes = determine_elo_outcome(row)

            expected_home, expected_away, rating_difference = elo_predict(c, d, omega, teams, ratings_dict)
            yhat = logistic_regression.predict([[rating_difference]])[0]

            prediction = 'A' if yhat == 0 else 'D' if yhat == 1 else 'H'
            actual = row['Res']
            correct = int(prediction == actual)

            prediction_log.loc[len(prediction_log)] = {
                'Season': season,
                'Prediction': prediction,
                'Actual': actual,
                'Correct': correct
            }

            # Update Elo ratings and training data
            ratings_dict = elo_update(k0, expected_home, expected_away, teams, goals, elo_outcomes, ratings_dict)
            logit_df.loc[len(logit_df)] = {'season': season, 'rating_diff': rating_difference, 'outcome': actual}

    # Analyze predictive performance using Bayesian modeling
    num_predictions = len(prediction_log)
    num_correct = prediction_log['Correct'].sum()

    return num_predictions, num_correct

Now, for every one of the final six seasons, we logged how many correct guesses we had. With this information, we can evaluate the accuracy of the system using Bayesian parameter estimation.

Evaluating results

If we consider the fact that, at every match, we make a guess about which team will win which can either be right or wrong, the entire process can be described by a Binomial distribution with probability p, where p is the probability that a guess of ours is correct (or our skill in making guesses). This p is defined by a prior Uniform(0, 1) distribution, which means we have no particular belief about its value before running the model. With the data from the backtested seasons, we use PyMC to estimate the posterior value of p, reporting it through its mean and a 95% credible interval. For reference, the PyMC code is defined as follows.

def fit_pymc(samples, success):
  '''
  Creates a PyMC model to estimate the accuracy of guesses
  made with Elo ratings over a given period of time.
  '''
  with pm.Model() as model:
    p = pm.Uniform('p', lower=0, upper=1) # Prior
    x = pm.Binomial('x', n=samples, p=p, observed=success) # Likelihood

  with model:
    inference = pm.sample(progressbar=False, chains = 4, draws = 2000)

  # Stores key variables
  mean = az.summary(inference, hdi_prob = 0.95)['mean'].values[0]
  lower = az.summary(inference, hdi_prob = 0.95)['hdi_2.5%'].values[0]
  upper = az.summary(inference, hdi_prob = 0.95)['hdi_97.5%'].values[0]

  return mean, [lower, upper]

The results are displayed below. In every season, out of 380 total matches, we correctly guessed the outcome of roughly half of them. The confidence intervals for the value of p, which represents the predictive power of our system, varied slightly from season to season. However, after the six seasons, there is a 95% probability that the true value of p is between 0.46 and 0.50.

Can We Use Chess to Predict Soccer? 1 — Results from the Elo system we created. Note that, in order to estimate the value of p after all seasons, we pooled the data and ran the PyMC model one last time. This implicitly means that we believe this to be a **complete pooling** situation. If you’re not familiar with how pooling works, don’t worry. The bottom line is that, by adding the data from all seasons together, we are assuming our system’s predictive capacity does not change over the seasons.

Considering that, in soccer, there are three possible outcomes, the fact that we guessed the correct result roughly half of the time is great news. This means we are not guessing randomly, for example, given that random guesses would result in only around 33% of predictions turning out to be correct.

However, a more important question arises. Are Elo ratings better at predicting outcomes than traditional rankings?

To answer that question, we also implemented a system that replicates the official leaderboard and guesses the best-ranking team to be the winner of each match. We then ran a similar PyMC model to estimate the sharpness (the p parameter of the Binomial) of this alternative method. Once we had both posterior distributions, we drew random samples from them and compared their values to perform a hypothesis test.

1eqFl3 3CCrVd1kB11fAuNA — Each bar represents the 95% credible interval of the posterior mean for p in each system. The green color only symbolizes that, indeed, the difference is statistically significant. [Image by author]

The figure above shows the 95% credible interval, estimating how well each method can predict results. What we see is that using Elo ratings to predict the winner of a match is, indeed, better than using traditional leaderboards. From an accuracy point of view, the difference between the two methods is statistically significant (p-value < 0.05), which is quite an achievement.

Conclusion

Although Elo ratings are not enough to guess the winner of a match correctly every time, they surely perform better than traditional rankings. Even more, they reflect the fact that unconventional variables can be useful in measuring the quality of teams, and that soccer fans might benefit from using alternative sources of information when evaluating the potential outcomes of matches they’re interested in.

References

A. Elo, The proposed USCF rating system: Its development, theory, and application (1967), Chess Life, 22(8), 242–247.

Betfair, Using an Elo approach to model soccer in R (2022), Betfair Data Scientists.

Eloratings.net, World football Elo ratings (n.d.), Eloratings.net.

F. Wunderlich & D. Memmert, The betting odds rating system: Using soccer forecasts to forecast soccer (2018), PLOS ONE, 13(6).

F. Wunderlich, M. Weigelt, R. Rein & D. Memmert, How does spectator presence affect football? Home advantage remains in European top-class football matches played without spectators during the COVID-19 pandemic (2021), PLOS ONE, 16(3).

L. M. Hvattum & H. Arntzen, Using ELO ratings for match result prediction in association football (2010), International Journal of Forecasting, 26(3), 460–470.

L. Szczecinski & A. Djebbi, Understanding draws in Elo rating algorithm (2020), Journal of Quantitative Analysis in Sports, 16(3), 211–220.

S. Stankovic, Elo rating system (2023), Medium.

Additional notes

A deeper dive into how the model performs

The system we build is not without faults. In order to improve it, we need to understand where it falls short. One of the first aspects we can look into is the regression’s performance. The confusion matrix below shows how the regression guessed outcomes in the final season we evaluated, 2024.

There are three aspects we can notice immediately:

The regression is overconfident about home victories, predicting this to be the right outcome 84% of the time when, in fact, this outcome only corresponds to 48% of our data.
The regression is underconfident about away victories, guessing this outcome only 15% of the time when, in reality, it happened in 26% of matches.
Surprisingly, the regression never predicts draws to be the most likely outcome.

The confusion matrix also allows us to explore another metric worth tracking: weighted recall. In essence, recall evaluates how many instances of a category (home victory, draw, or away victory) were guessed correctly, and we weigh the results according to how common each category is in the dataset. Out of all predicted instances of a home victory, a draw, and an away victory, the amount of correct guesses were 90%, 0%, and 45%, respectively. When we account for the fact that categories are not equally present in the dataset, and home victories, for example, are nearly twice as common as away victories, the weighted recall goes up to 50%. This means that, generally, whenever the model predicts a category, this is only correct 50% of the time. There is no question that such a performance is suboptimal; rather than capturing the underlying behavior correctly, the regression is guessing home victories most of the time because it knows this is the most likely outcome.

To try to fix this problem, we attempted a hyperparameter estimation through grid search tweaking three key parameters from our functions: the number of past seasons included in the dataset each time the regression is trained; the K value, which influences how much a new result affects the ratings of the teams involved; and ω, which represents the magnitude of the home advantage. Using different parameter combinations, we measure the win ratio, which is an in-sample version of accuracy: the proportion of correct guesses made by the regression. The results of this process, however, are underwhelming.

Can We Use Chess to Predict Soccer? 4 — Image by author.

The changes to win ratios (and, consequently, to the estimated sharpness credible intervals, had we calculated them) are minimal regardless of the hyperparameters chosen. This likely means that irrespective of the specific Elo rating of a team, which is influenced by omega and K0, the system reaches a level of stability that the logistic regression captures just as well. For example, suppose that the intrinsic quality of Team A is 40% greater than Team B’s. With the original set of parameters, the difference in ratings between both teams could have been 10 points, but with a new set, it might jump to 50 points. Regardless of the specific number, every time two teams have a similar difference in intrinsic quality, the regression learns which number represents that difference. Given that Elo is a system of relative scores, the system reaches stability, and parameter changes do not influence the regression meaningfully.

Another interesting finding is that, in general, having historical data containing extensive periods does not influence the quality of the regression. The win ratios are mostly similar regardless of using one, five, or nine years of historical data each time we fit the regression. This might be explained by the large number of observations per season: 380. With such a large number of data points, the regression can understand the underlying pattern, even if we have only a single season to look into.

Such results leave us with two hypotheses in mind. First, it might be the case that we explored the potential of Elo ratings in its entirety, and making better guesses would require including additional variables in the regression. Alternatively, it can also be the case that adding new terms to the Elo formulas can result in better predictive capacity, turning the ratings into an even better reflection of reality. Both hypotheses, however, are yet to be explored.

An important disclaimer

Many people arrive at soccer modeling because of sports betting, ultimately wanting to build an algorithm that can bring them fast and voluminous profits. This is not our motivation here, and we do not support betting activity in any way. We want the reader to engage in the challenge of modeling such a complex sport for the sake of technical learning, since this can serve as a great motivation to develop new Data Science skills. (The two articles)