[D] Predicting Fantasy Football Points for German Bundesliga (kicker.de)
Hey guys. This years fantasy football season is about to start. So I decided to do a little project and try to predict each players points for the upcoming season. At first I searched for other people who have done similar stuff. Most of it was in US Sports (NFL, NBA, MLB). One reason imo was the availability of data. Other projects on football were done for week to week predicitons, and mostly in the middle of the season. This is not what I am going for. Speaking of data availability. I had to put a lot of effort into getting historic fantasy points. But eventually I managed to obtain them by scraping the website (kicker.de) and recalculating the points from season 00/01 onwards (makes about ~9k datapoints). As for features I have, name, club, position and age. The target value is obv the fantasy score by the end of the season. The scores is heavily influenced by Rating (1.0-6, German school grading system. 1.0 best, 6 worst) and is done by editors of the newspaper. This is a human perception rating, not a statistic derived one! I used mainly sklearn. My metric to optimize is mean-squared-error (MSE) and I tried several linear regression methods (Lasso, ElasticNet and BayesianRidge). I got results around 2000 MSE. Some hyperparameter tuning later I got to ~1650.Then I thought of giving LGBM a go, but it was actually worse, even with hyperparameter tuning. I thought about trying LSTM, but I think the dataset is way to small for that. To be hones I am DISAPPOINTED with the results. ~1650 MSE is about 41 points of error on AVERAGE. Thats alot. So my next idea was to analyse if the model is bad on all the data, but good at particular areas. All this was done for the 2018/19 season. left-to-right (Goalkeeper, Defenders, Midfield, Forward) This is the mean-error per position. GK seems to be really hard to predict. One thing thats common knowledge is that from all regular players goalies tend to score a lot of points on average. But ofc there is just one per team playing and they are less likely to get injured or even be left out because of fatigue. Defenders seem the easiest, Midfielders slightly worse and Forwards even more. Those positions are kind of reasonable good, but still not in the are were I would be confident. y = mean error, x=amount of players For this one I put the data into bins of points (-75,250, 25). Real in this context is the actual points that were scored and predicted, the predicted one. So I basically wanted to see the point distribution of the real points and my predicted points. And as it turns out the model is putting way to many players in the (26,50) range. Bot distributions kind of have the same shape, but the spread for the predicted ones is not big enough. y = mean error, x= bins of points So my idea was, maybe the model is good in some point range. And schockingly, where most of our data for predictions is, there is most of our error. The model is quite good and predicting the top end players (150+ points) and quite good to sort out the garbage( <0 points), but inbetween its quite horrible. We see now the error for all the teams, two suprising points are the very accurate ones. Augsburg and Nürnberg are in the bottom half of the table, no suprise. Nürnberg was a promoted team and got relegated instantly, Augsburg survived but also fought against relegation all season long. On the other end Borussia Dortmund played a quite good season, fighting for the Championship with Bayern. The big error I assume is because of several factors. First they bought some players that performed quite well with no historic data (Witsel, Paco, Hakimi) or just very few (Sancho, Akanji, Delaney). Secondly they performed all quite well and above league average. So teamwise there doesn’t seem to be much insight imo. https://i.redd.it/qo8wegslged31.png So I was interested if there is maybe some sort of correlation between age and points. Like very young players rarely score points, beginning to mid-late 20s is the prime. With eventual bumps and with the 30s they are declining. But as the boxplots suggest, the data is pretty much all over the place. We have huge variances right from the beginning. It really tones down in the 30s. Age also seems like a pretty bad feature to consider. Additionally, I plotted some age/pts graphs for specific players. There are veterans, even on world class level (Neuer), but also from midtable teams, that never made it to international level. Also players with injury issues (Reus, Bender) or people who seemed like good international material and then completely vanished. Also some One-Hit-Wonders. Just have a peek. I put them in an imgur album to not make this post even bigger (https://imgur.com/a/2kSVXdm) So my question. What would be your ideas to improve the performance? More data? Feature Engineering (but what?). Maybe train a regressor per position? P.S: if there is enough interest in the data and you guys wanna play around with it yourself I can publish it. submitted by /u/thecluelessguy90 |