Data scientist, physicist, and fantasy football champion

Comparing DEF models throughout the 2016 season

What was the best model this year?

How did its accuracy change change every week?

How do these numbers compare to the pros?

These are the questions that keep me (and I’m sure you, dear reader[singular]) up at night. Toward the end of the season a few of my models felt more accurate, but I’m not sure if they really were. I remember at least one week toward the end of the season with very poor accuracy, but that can hardly be my fault, right? What I’d like to do here is to run through the entire year again with a few of my models to see how they did every week (without having to just dig through my old results by hand). I have two years’ worth of fantasy DEF data and a year of closing odds from Footballocks that I use to make predictions. Let’s predict!

DEF Model A for weeks 2 through 17

We all remember my team defense model Model A, right? It used factors for team, opponent, predicted team score, predicted opponent score, and whether the team was home or away. It also used only the data from 2016. The first thing I want to do is to check what the accuracy would have been had I been posting these as early as week 2.

## [1] "Model A accuracy for week 2 = 104"

Good, that worked, thought I hardly would have posted this had it not. Remember that the accuracy score is actually more of an inaccuracy score where the lowest score is the best, golf-style. I remember thinking that 104 isn’t bad, but isn’t exactly good either. I want to see how this progresses throughout the year.

Note also that R issued me a warning that some of my factors were rank-deficient and might lead to poor results. This makes sense; many of these factors are categorical (Team, Opponent, HomeAway) and that uses up a lot of degrees of freedom. I suspect that it would take a few weeks before this model’s accuracy improves. I predict that models that use last year’s data as well will be more accurate initially. We’ll see later whether my prediciton is true.

Let me try and run this for weeks 2 through 17 and see how my accuracy improves week to week:

Ugh, gross. Just look at those 95% confidence intervals. The blue line is the best fit but with 95% confidence we can say that it should fit within the grey ribbon. I was hoping to have a much narrower band than that. Yes, this model improved over time, but I can’t say with (statistical) confidence that it improved by a lot. Still, let’s start by comparing it to one other model. Let’s say Model C, since it uses the same factors as A but includes the previous year’s data. I can include week 1 here. This will be good because at least for the first few weeks next year I’ll probably have to use something like this.

Comparing models A and C

This is actually what I was hoping to see. Model C starts off a little better because it actually has some data, but Model A improves a litttle more quickly until it starts beating it around the mid-season mark. This makes sense to me. There are some team changes every year (coaches, players, etc.), and some defenses do a little better and others a little worse across years.

Both of these models did well last year. Let’s see what happens when we look at a model that didn’t do so well. Model E was a bit of a crzy shot. It included ‘momentum’ terms (that is, week, week:Team, and week:Opponent terms) and only used data from the previous 6 weeks. It was not the most effective model, but it should hopefully provide a good contrast to models A and C.

Comparing Models A, C, and E

(I removed the lines and the confidence intervals because they were distracting and make for a very cluttered figure)

The momentum terms are weak so it doesn’t make a huge difference, but it’s definitely a little worse than model A by the end of the season. Finally, let’s go ahead and add the remaining models:

  • Model B: Just Team, Opponent, and Opponent score terms, 2016 data only
  • Model F: TruScor model (I really will trademark this, so just back off). Same terms as model A and using data from 2015 and 2016, but with an improvement on the TD score prediction method. Go read about it in the Methodology section.

I’m ignoring Model D. I took a quick look and it just falls almost completely on top of Model E.

Comparison of Models A, B, C, E, and F

There’s a lot to unpack here, and not all of it is really encouraging. First, all of the models improve over the course of the year as I had hoped they would. Second, it looks like the models have a tendency to cluster every week, meaning that week 9 most of the models did well, week 10 most did poorly, and so on. This probably has two causes. When I initially explored which factors I should add to the models I found that only Team, Opponent, and Opponent Score terms were statistically significant (with the data that I had at the time). Since all of the models use those three terms, all that’s left is minor additions of not-significant terms. This is probably also due to a few unexpected teams teams performing very well each week. If we look at the top 3 teams in week 12 (NYG with 23 pts, TB with 19 and KC with 15) I think the only one anyone would have predicted was KC. Even models that put KC as #1 and NYG as #3 would have still scored 16 accuracy points for their effort.

Model F, my favorite the last few weeks, may have deserved to be my favorite. It distinguished itself from the pack toward the end of the season (if we ignore statistical arguments for now) and tended to score as well or better than Models A and B each week. But I should urge caution here. When I look at a boxplot of the accuracy scores for the second half of the season (weeks 9 through 17) I get the following:

Model F has the lowest mean accuracy score (not pictured), but actually a marginally higher median accuracy than Model A (bold line in the middle of the box). It also has the lowest ceiling and floor of any of the models. All of these figures are telling me that Model F was better, but not by much. Any statistical tests say that the differences are not significant. I ran a few versions of ANOVA and t-tests and found no statistically significant differences between A and F or even E and F.

Comparing to the pros

Okay, now for the big one. Was I actually better than the Yahoo pros? I’m picking on Yahoo here because I’m in a Yahoo league and because they keep their predictions available online forever. I picked up the average predictions of five or six of their experts (typically Brandon Funston, Andy Behrens, Brad Evans, Scott Pianowski, and Dalton Del Don and maybe another guy depending on the week) for each week in 2017. It was easier than it sounds; Yahoo posts this every week so I just picked them all up.

I’m going to compare Yahoo’s predictions to just models A, C, and F to make it a little clearer (I tried it with all the models and it looks a little cluttered):

Alright, that’s good, right? I mean, it’s not statistically significant, but Yahoo’s line is a little higher (less accurate) than Model C throughout the entire year and is much higher than Models A and F in the second half of the season. I can now formally stop listening to their predictions every week! Hooray!

You know, kinda. Yahoo jumps up and down the same as my models every weeks, but I do marginally better overall. Yahoo was more accurate in weeks 4, 5, 6, 13, and 16, but never by much. And when it was bad it was bad (weeks 10, 11, 12, 14), and Model F often took the lead. I think Model F looks better, but it’s by so little that it might not be significant. But after this analysis I now feel more confident in trusting myself, so that’s good news at least.


I would consider this a weak success of my not-yet-patented TruScor system. All of the models improved over the year, but by the end (and especially in the 2nd half of the season) Model F was the best. Trying to remove some of the variability of touchdowns improved the accuracy score of the model a little and decreased the variance as compared to Model A.

The week-to-week variability caused by random teams coming up with an extra TD or two is as large or larger than the variability between models. That’s not good, but it’s somewhat expected. There are always successes and flops every week and since TDs are worth 6 points just one can propel your team into the top 15.

Given this analysis, I don’t have a great game plan for next year. On the one had, Model C seems to be the best in the beginning of the year, but it’s not due to the fact that it has more previous data since Model F uses the same data (2015 and 2016). I really think Model F is the way to go across the entire season, but I’ll have to be careful. I’m not usre if I’ll want to use 2015, 2016, and 2017 data or just 2016 and 2017.

The good news is that I’ll have more data, and with more data comes more possibilities. I’ll keep creating models this offseason and by the time the season starts again I’ll have this thing down.

Up next: Kickers and my failure to model them. Stay tuned readers!

Model accuracy throughout the year - Kickers

Week 16 K Predictions