The FIFA World Cup is approaching. I am excited.

A friend of my fine referred me to this beautiful chart, by Andrew Yuan. According to his simulations, Brazil is hugely favorite, with three times more chances to win it than the second best candidates. I want to congratulate him. He’s done a wonderful job. I wish more people could (or tried to) share their mathematical and computational efforts in such an elegant and entertaining way.

However, in science, one question needs to be systematically raised. How trustworthy are the results?

To find it out, we need to look up his model. In essence, he explains the levels of football teams with two factors. First is the FIFA ranking. This ranking is derived from recent game results, with points attributed to wins and draws that depend on teams’ opponents. It’s pretty messy, so I’m not going to attempt an explanation of this ranking here. Second is the Home/Away factor. As any football fan knows it, it’s an advantage to play at home, and the History of World Cup definitely backs up this assertion. Indeed, winning home teams include Uruguay (1930), Italy (1934), England (1966), West Germany (1974), Argentina (1978) and France (1998). The following figure yields a powerful visual representation of this phenomenon.

Yuan then goes on estimating the probability that team A beats team B given these two factors, by looking at the historic of games since 1993. Once again, this corresponds to the figure above. It is rather obvious from the figure above that better ranked teams are more likely to win. But Yuan went further and drew the underlying curve that best fit these data. This curve is what predicts exactly, in Yuan’s model, how FIFA ranking + Home/Away/Neutral affect A’s probability to beat B.

So, what’s wrong with Yuan’s model? I’d say that it’s a wonderful attempt at modeling football games, which already requires a huge amount of work. However, it may still not be detailed enough. Some important factors may be missing. For instance, it’d be interesting to look at how the number of missing players affect a team’s probability to win. If Argentina has to play without Messi, it’s not going to be the same Argentina (even though the factors FIFA ranking + Home/Away/Neutral are unchanged!). On the opposite though, we have to be careful about not adding to much factors, as over-fitting the model may unveil meaningless patterns.

One way of not increasing by too much the number of factors is to question these we already have. Once again, the figure above is particularly good to illustrate what I mean. We clearly see that the FIFA ranking alone is not good enough to explain team A’s probability to beat B. This is particularly true when you consider teams that are separated by less than 20 places, which will be the case in most games of the World Cup. In fact, as you can see on the figure, the Home/Away/Neutral effect is, in this case, way more relevant than the FIFA ranking. Crucially, the FIFA ranking may simply not be reliable enough. This blatant fact is what explains the huge difference between Yuan’s prediction of Switzerland’s chances to win and bookmakers’. In fact, I’ve already criticized the FIFA ranking here, where I point out that it’s not based on any solid mathematical ground and sounds much more like some sort of obscure machinery. Because of that, Yuan’s prediction is sort of like predicting the future of 4-year-old kids based solely on their abilities to count to 10. It may yield some indication… But can’t we do better?

I think so. In fact, 8 years ago, I made my own predictions for the 2006 world cup. Results were (misleadingly) amazingly good. Find out how I did it by reading this article I wrote. Lately, I’ve pondered a use of a more robust Bayesian approach to this modeling. Maybe my statistician days are not over… For 2014 though, I won’t have time to run simulations!