by Marshall Flores
Welcome to Episode 3 of Awards Daily’s Statsgasm. Last week I led a descent into deeper statistical madness by introducing regression analysis; hopefully I wasn’t Darryl Revok and didn’t induce too many Scanners-style exploding heads among you all. Today we’re going to make our furthest probe into the statistical singularity of regression. Yes, that means more math, but we will also see for the first time the primary method that forms the basis of AD’s Oscar prediction models.
Note: before you venture past this point, I do invite you to take the time to read and/or review my first two Statsgasm posts. I of course try to make each post self-contained, but advanced stats (and math in general) does snowball from basic concepts and terminology. Revisiting the first two episodes of Statsgasm may be useful in ensuring that you don’t get too lost today, as I believe this will be the longest and most technical episode in the entire series.
The simple linear regression (SLR) model I introduced in Episode 2 is a very powerful tool, but it’s appropriate for only a certain type of data, specifically, when the response variable is continuous (i.e. the response can assume *any* value on the real number line). Many things can be represented by continuous variables, but not everything. For instance, if something only has two possible outcomes (e.g. whether a student passes or fails a course), it makes much more sense to use a binary (0 for failure, 1 for pass) variable to model it.
So what do binary variables have to do with us? Hmm, lets see… oh yeah, the Oscars are a textbook example of something that can be coded as a binary variable!! Each category only has one winner – the rest of the nominees don’t win. So we simply assign a 1 to a winner and a 0 to the rest.
However, if we do want to use a binary variable as a response, we cannot use the SLR model (for a number of theoretical reasons I won’t go over). Fortunately, we can use a kissing cousin of linear regression called logistic regression. The underlying mechanics of logistic regression are difficult to explain without explicit math, and since I’m interested in writing for you guys without exploding your heads, I’m going to handwave a lot of the formalities and focus on showcasing what the model can do. Still, I do want to emphasize one very important thing that distinguishes logistic regression from SLR and makes it a very appropriate method for us to use regarding the Oscars:
The SLR model estimates an average change in the response variable given a change in our predictor(s). Logistic regression effectively does the same thing, but it now calculates how the binary response variable is influenced by the predictors in terms of probability (more specifically, it calculates probabilities when the response variable equals 1). To put it in the context of Oscar predicting, logistic regression can enable us to determine how certain precursors (winning the DGA, having the most nominations, etc.) affect, on average, the chances of winning the Oscar.
With my introduction of logistic regression as a magical mystery box out of the way, let’s see it perform some Abracadabra! Last week we used an SLR model to examine how Best Picture winners with more nominations tend to also take home more Oscars as a result. We’re now going to build a logistic regression model that estimates how the number of nominations a BP nominee receives influences its chances of winning BP. I’ll also throw in the strongest BP predictor historically, the DGA, into the mix as well.
For this example, we’re going be using data from 1950-2012. First things first – always visualize the data we’re working with. A histogram that shows the distribution of total nominations each BP nominee received in that period:
This distribution doesn’t look as skewed compared to the ones we saw using data from 1980-2012 – in fact, it actually looks pretty normal. Let’s see what our logistic regression model comes up with.
Our model p-value is 0.0000, which means the logistic model is appropriate. Meanwhile, Psuedo R2 is an analogue to the R2 value in the SLR model that acts as an indicator of goodness-of-fit (although we do not interpret it in the exact same way). In general, we would be happy if we got Psuedo R2′s ranging between 0.2 – 0.4; the fact that we’re getting 0.5077 using just a two predictor model to explain 63 years worth of BP history indicates the model is pretty darn good.
The p-values for both total_noms and DGA are also very small, indicating their significance as predictors, so let’s move on to interpreting their coefficients. Our model estimates that each additional Oscar nomination a BP nominee receives *increases* its chances of winning BP (on average) by 27%. Alternatively, we can also interpret this as each additional Oscar nomination a BP nominee obtains makes its winning BP 1.27 times more likely. Meanwhile, the model estimates that if a BP nominee wins the DGA, the odds of winning BP are 50 times more likely.
Here is a graph that depicts our model’s estimates of BP win probability when considering only total nominations:
where the blue band indicates the 95% confidence interval of the model’s probability estimates. As expected, more nominations equates with a higher chance of winning BP (though interestingly enough, there seems to be a drop in win probability going from 12-13 nominations!). Our model predicts (with 95% confidence) that if a BP nominee received 12 Oscar nominations, its chances of winning BP are between 56% and 72%.
Let’s see how things change when we throw the DGA into the picture:
We now have two bands in our graph: the blue band depicts predicted BP win probability *without* winning the DGA, while the red band indicates win probability *with* a DGA win. As we can clearly see, the DGA is extremely influential in predicting BP, more influential than just total nominations. This time, the model predicts (with 95% confidence) that a film with 12 nominations that *did not* win the DGA has a BP win probability between 0.2 to 25%, and a BP win probability between 81 to 97% *with* a DGA win – a **huge** difference.
So we now have a good idea of how number of nominations and DGA factor with predicting BP in general. But just how well does a model using only those two variables explain past outcomes? We can generate what is known as a classification table for this purpose after running a logistic regression.
There’s quite a bit going on here, so let’s zero in on two things. Sensitivity indicates the model’s ability to predict true positives within the data, while Specificity indicates the ability to predict true negatives. In other words, these are measures of how good our model is in picking out which films actually won BP and which films did not win. As the table indicates, our model correctly identifies 50/63 (79%) of the BP winners from 1950-2012, and 253/265 (95%) of the BP non-winners. Not too shabby for a model using only two predictors, although there’s always room for improvement.
And that, my friends, is the bare bones of logistic regression – the spinal column of AD’s Oscar forecasting models. Logistic regression helps quantify the conventional wisdom we long-time Oscar watchers have used in predicting winners. In this case, we now have some idea just how influential number of nominations and the results of the DGA are in the BP race in terms of odds and percents.
This episode will actually be the last one in this series that will get into such technical detail, so you all can rest easy In January before Oscar nomination morning, I will introduce one of AD’s models in its entirety (right now, I’m thinking I will unveil AD’s prediction model for Visual Effects). I will also reveal (to some extent) the modifications and adjustments I make when the models generate their predictions. Again, I emphasize that there is a lot of personal “art” that goes into statistical prediction in addition to the science, and the Oscars are certainly no exception to that.
If you are interested in the (many) details I left out regarding logistic regression, I invite you to post in the comments below. And as always, feel free to e-mail me at marshall(dot)flores(at)gmail(dot)com or converse with me on Twitter at @IPreferPi314.