I love analytics (but I guess you figured that out by now) and I also love football (i.e. soccer, but American football as well), so what I really love-love is sport analytics. I remember reading about Billy Beane and thinking: f*** yeah, there’s so much more beyond regular stats, we could improve decision-making in sport drastically!
Today we won’t do anything fancy, just a small mental/computational exercise to answer the question that crossed my mind. It all started with this tweet that popped on my Twitter feed…
…and I thought, damn, this guy is a monster. He’s only 26, already level with Romario and closing fast on Ronaldo, two guys that are absolute icons. But what about the Pele, can he topple O Rei do Futebol (the King of Football)? (Trivia alert: both of them products of Santos academy.) Hmm, interesting question, why not take some pen and paper… or open the laptop.
But what’s the best way to approach this problem? The problem is we don’t know how long Neymar will play for Seleção and we also don’t know what his goal rate will be, so we are stuck… but can we somehow approximate this? What if compile the database of the world’s elite attackers (current and former) and use this data to extract the probability density function for retirement age and goal rate (depending on the player’s age)? Hmm, this might work, we can use these distributions to model future expectations…but model it how? How can we mimic Neymar’s future? Well, there is something, we might simulate rest of his career with Monte Carlo.
When I say Monte Carlo, I mean Monte Carlo method or Monte Carlo simulation, a statistical technique used to model probabilistic systems and get the sense of the odds for a variety of outcomes (in this case, can Neymar surpass Pele). Monte Carlo simulation uses random sampling and statistical modeling to simulate the real-life systems, from fields of particle physics to engineering, finance and well, football. There are two crucial pieces of information needed to make it work:
- model that represents how inputs turn into an output: in our case general idea is simple, expected career goals (output) are product of number of games played and goal rates (inputs);
- reasonable estimation of the variation of inputs: in our case probability density functions of retirement age and goal rates (depending on the player’s age).
By using probability density functions, variables can have different probabilities for different outcomes and this provides for much more realistic way of describing uncertainty in variables used in modeling. During Monte Carlo simulation, values are sampled at random from these functions. Each set of samples is called an iteration, and the resulting outcome (output variable) from that sample (input variables) is recorded. Monte Carlo simulation does this hundred thousand, sometimes millions of times, and the result is a probability density function of possible outcomes. In this way, Monte Carlo simulation provides a much more comprehensive view of what may happen. It tells you not only what could happen, but how likely it is to happen.
Enough with the Monte Carlo, let’s go back to Jogo Bonito and compare our four Brazilians. First plot is Goals vs Caps and it’s very illustrative; Pele is by far most efficient scorer of the lot with goal rate of 0.84 goals per game, Romario interestingly is in second place (with goal rate of 0.79), while Ronaldo and Neymar share the 3rd place with equal goal rate of 0.64.
This is something in line with our research on the elite attackers’ database, as there is negative correlation between age of birth and goal rate, i.e. elite attackers of the past usually had slightly higher goal rates than their modern counterparts. However, modern attackers have chance to play much more official games in their career span. This is also evident if we compare our Brazilian trio; although Neymar, Ronaldo and Pele started playing for the national team around the same age, Neymar will have a chance to play in considerably more games for Brazil.
It’s interesting also to look at Goals vs Age plot, with Pele and Romario as two extremes within the quartet. Pele came with a bang in national team, while Romario was true late bloomer; started his national team career much later but incredibly scored much of his goals in his 30s. Neymar and Ronaldo once again much alike, at least until the age of 22 when Il Fenomeno stagnated due to injury (torn ligament) that plagued him for almost two years.
All of this is even more evident in the following plots, where we tried to visualize Goal Rate vs Caps/Age. Pele was incredibly prolific in the start of his career, while Romario despite the slow start caught up later in his career. I remember him vividly from 1994 World Cup in USA and his iconic battle with Baggio, but unfortunately this was last World Cup for him, despite accumulating more than 30 goals afterwards. It’s worth noting that Ronaldo also had efficiency problems in the beginning of his Seleção career.
All four gentlemen are iconic footballers and lethal goalscorers, but not all goals are created equal, so let’s see who made theirs on the big stage and who improved statistics through friendlies and lower profile matches. In this segment, Ronaldo is clear winner, he has biggest share of World Cup goals which is no surprise since Il Fenomeno is ranked #2 on All Time World Cup scorers list behind only Miroslav Klose. On top of this, Ronaldo has lowest share of goals scored in friendlies.
By comparing our Brazilian quartet we gained lot of insights, but our original answer still remains unanswered: can Neymar topple the King? For this exercise, i.e. Monte Carlo simulation we are going to use R. As stated before, variations in inputs are captured through retirement age and goal rates distributions from which we are going to sample. (Note: we are also going to consider probability of missing a game due to injury/suspension.) However, before we start running iterations and sampling from our distributions we need to fit them first. In order to fit one or more distributions to a data set, first it is necessary to choose good candidates among a predefined set of distributions. This choice may be guided by the knowledge of stochastic processes governing the modeled variable, or in our case, in the absence of knowledge regarding the underlying process, by the observation of its empirical distribution. fitdistr package for maximum-likelihood fitting proved particularly useful in this regard. As a first step you can use Cullen and Frey graphs to compare distributions in high moments space (skewness & kurtosis) and reject unlikely candidates and then go for goodness of fit to select the best results. (Non-zero skewness reveals a lack of symmetry of the empirical distribution, while the kurtosis value quantifies the weight of tails in comparison to the normal distribution.)
After we fitted our data to appropriate distributions we are ready to go! We simulated rest of Neymar’s career 10.000 times (by sampling from retirement age and goal rate distributions while taking into account probability of him not playing due to injury/suspension) and the results are as follow…
…or if we plot it as boxplot we can see distribution summary much more clear: median is 94 goals, 1st quartile is at 80 and 3rd quartile at 106 goals. There are some outliers and occurrences of high number of expected goals which are due to iterations of simulation where Neymar benefited of very long and productive career (think of Romario) but most of the distribution lies left of the 100 goal mark. In case you were wondering, probability of Neymar reaching 100 goal mark is around 38%.
Oh, yeah, almost forgot, only 1200 words later the answer you all have been eagerly waiting for: yes! Neymar will most probably succeed Pele to the Brazil’s all-time goalscorer throne and when we say “most probably” we mean that chances of him doing this are very high, or to be more precise, around 83% based on our simulations. However, we are not saying that Neymar is or will be better footballer than Pele. As said before, Pele was incredibly efficient goalscorer with significantly higher goal rate than Neymar, while Neymar benefits of much higher number of Seleção games.
That’s all folks, stay tuned and don’t forget to follow the 2018 World Cup!
P.S. Go Croatia! 🙂