When we started on our journey of predicting the results of the FIFA World Cup matches, we set out with some cautious optimism. We had experienced success with predicting the outcomes of the 2015 Rugby World Cup and the 2016 Oscars, but Football was a whole new ballgame. As our CEO, Jaco Rossouw, said: “We’ve never used our skills as data scientists to predict the outcomes of a football game, and unlike with the Rugby World Cup where we were predicting the point margins between the participating teams, this time we’ll be predicting the exact final scores – a significantly more complex challenge!”
We set out to use this exercise as a training and learning opportunity, and a chance to make data science more relatable to the average person. It was also a nice challenge, and we wanted to see how well different predictive analytics techniques used successfully in other areas can outperform the best human-made predictions on Superbru.com.
We’re pleased to say, that after the final games, we’ve outranked 99.92% of others on the sports predictor site.
Here are our rankings on Superbru after each round, for our most successful model:
|After Group Stages||99.96%|
|After Round of 16||99.98%|
We predicted the outcomes using three different models
The best-performing algorithm uses the Bayesian Inference method. This technique can be used to enhance predictions by using what we already know (determined by looking at historic game results), with a recent sample of data to predict the likely outcome. In this way, recent performance and player statistics are used to enhance the predictions of models that are developed on historical data alone. The model has been automated to adopt a machine learning approach in that it reselects variable and parameters every time it is run, adapting to how the world cup is unfolding. Results of games from previous rounds inform predictions for the next round, and the model is proving more accurate than 98.81% of other Superbru predictions at the end of the tournament.
The machine learning component can take it a step further, though, by also learning from match results within the current round and updating every day. This yields slightly different predictions to the ones the model makes at the start of a round, and Principa have also been adding these predictions to Superbru. Using the latest available data is proving a successful strategy, as it outperformed the other predictions by more than 1%, beating 99.92% of Superbru participants.
The other model used was the Poisson Regression Model, which beat 96% of other Superbru participants. The Poisson distribution is a probability distribution that can be used to model data that can be counted, like the number of goals scored in a football match. This means we have a method of assigning probabilities to the number of goals in a game and from this, we can find probabilities for different match results. To be able to find the probabilities for the different number of goals we would use the regression method, based on certain variables, such as the strength of the attack, ratings of the team etc.
The third model used was the Multinomial Logistic Regression. A multinomial logistic regression model is merely an extension of a binary logistic regression model as it allows for more than two classes of the dependent variable. We will use a method of variable selection to choose which variables are significant in predicting the dependent variable, and that would be our independent variables for our model. The model will then give us the probabilities for each class (or goals scored). If we repeat this for the opponent team, we can logically arrive at the score of each team by choosing the class with the highest probabilities for each run of our Multinomial Logistic Regression model. Using this model, our predictions ranked in the top 12%, to beat 88% of other predictions, at the end of the tournament.
What are the key lessons we’ve learned?
“The most important things we’ve learnt thus far is that world cup football is incredibly unpredictable and not all factors that influence sports outcomes can be quantified, much to the chagrin of our German and Brazilian supporters at Principa!”, says Jaco Rossouw, CEO of Principa. “We have, however, been quite successful on Superbru again, with one of our models out-predicting 99.92% of human-made predictions, which makes all of us very proud. Our predictions also create “football fever” in the office – and we all enjoy some added mid-year cheer!”
Some lessons our data scientists have learned, and will be implementing in future projects, include:
- We’ve found that understanding the sport and the environment, helps your predictions’ accuracy. This is mostly due to understanding more about where to place focus during your modelling and knowing which data to include. In a credit or customer analytics environment, we see this proven true. The more an environment is understood, the better. Principa are experts in the build and implementation process because we understand the environment well. The same is true for machine learning implementations. And in the case where an environment is unfamiliar to our experts, more time is spent on site to understand it.
- The Semi-Final and Final predictions were made by models trained on data from historical group stage data. Next time, we’ll look to train each round with data specifically from that round, as experiments we’ve made since, show this to be more accurate at predicting results and scores. This could be due to the high-pressure of the situation, the intensity in the stadium or simply, like in this World Cup, having one side of the draw with more high-scoring teams and the other side with more lower-scoring teams which come together at the Semi-Finals and Final.
- The changing environment of the sport always plays a role in whether predictions will be accurate or not. While our models were trained on data from games where video decisions weren’t included, this World Cup did include these, which had an impact on a couple of games, including the final. This is not something that can be accommodated for, but it does underline the value of understanding your environment and the rules of the game in which you’re operating.
- We initially planned to overlay extra data as the tournament progressed, but we never had the time to do so. One thing we’ve learned from that is that if you’re predicting a tournament, do all of the planning ahead of time: once you’re in it, you’re in it! But we do believe adding the player age of the starting line-up or gameplay data (for, e.g. when teams score in a game on average) would make our models more accurate. The key lesson being that, the more data, the better the prediction.
And now with the FIFA World Cup and our predictions at an end, what’s next for us? We’ll likely continue to make data-driven predictions for sports tournaments. But until then, we’ll get back to Working Wonders for our clients.
If you’re looking to use a data-driven approach in your business, get in touch with us.