by Marshall Flores
Welcome back for Part 2 of Awards Daily’s Statsgasm – a miniseries of posts that will lead up to the unveiling of AD’s first generation of Oscar forecasting models in January. Last week in the pilot episode, I began our descent into stats mania by showing how useful it is to visualize historical data, to see past trends. Specifically, we analyzed the past 33 Best Picture winners (1980-2012) – how many nominations each received, and how many Oscars each won.
We’re going to revisit this data in order to introduce a very powerful statistical tool that allows us to estimate relationships between two (or more) variables – regression analysis.
First, let’s briefly review how nominations were distributed among the past 33 BP winners:
From this histogram, we determined that the distribution was negatively skewed, i.e. more of the data is clustered on the right. Hence, we hypothesized that BP winners tend to receive more nominations than less.
Let’s continue exploring this hypothesis by taking a look at another graph. This time, let’s take each of the 33 BP winners and graph its total nomination count against its total win count. This is called a scatter plot, which can help determine if there is a relationship between the two variables.
(Note, there is some overlap in the data, e.g. both Amadeus and Gandhi had 11 nominations and 8 wins, so I induced a small amount of “jitter” in the scatter plot so we can see each data point more clearly).
As we can see, there is a bit of spread in the data, but there does seems to be a direct/positive correlation between total nomination count and total win count, i.e. more nominations tends to lead to more wins. It’s by no means a perfectly linear relationship, but a relationship does exist to a certain extent.
Now, what if we wanted to make a formal estimate of this relationship, i.e. create a mathematical formula that spits out a guess of how many wins are expected for a BP winner, given its total number of nominations. This is where regression analysis comes into play, specifically what is known as simple linear regression.
So what is simple linear regression? Well, allow me to introduce this concept by segueing into a smaller example. Instead of looking at the past 33 BP winners, let’s look at two: Crash (6 nominations and 3 wins) and Return of the King (11 nominations and 11 wins). Let’s call the number of nominations each received a predictor variable and the number of wins the response variable (predictor and response are synonymous with independent and dependent, respectively – I’m sure many of you have heard of independent and dependent variables in your science classes). It’s very easy to determine a perfect linear relationship between nominations and wins with 2 data points, because, well, there are just two data points. Two points are all we need to draw any straight line.
But now let’s add another BP winner: The Hurt Locker (9 nominations and 6 wins).
Although the direct relationship between total nominations and total wins still seems to hold, we simply can’t draw a straight line through all 3 points now. Still, a linear relationship appears to be a good estimate for all 3 points. Attempting to fit a line through the data would remain a reasonable indicator of how the two variables are related.
But how do we determine this line? Intuitively, it seems the best line would lie somewhere in the middle of the data points. Well, friends, this is *exactly* what a simple linear regression does. A simple linear regression mathematically determines a “best fit” line through the points of a data set, “best fit” meaning that this particular line minimizes the overall deviation between itself and the data (using a criterion called least squares). In the small three data point set, the best fit (regression) line ends up looking like this:
This regression line represents what we call the expected mean (average) response between the two variables. In other words, it indicates how many Oscars a BP winner collects, on average, based on how many nominations it has.
With this toy example out of the way, let’s now build a regression model using the data from all 33 BP winners in our time frame. Again, we will use the number of nominations as a predictor variable, the number of wins as the response variable. There are formulas that allow us to determine the regression by hand if we wish, but it’s far easier to use software. My particular weapon of choice is the statistical software package STATA.
After running the regression in STATA, we obtain the following summary table:
There is a lot of technical information in this table that I won’t go over, but I do want to highlight a few things. The first thing we need to check after running any regression is if the model is any good, i.e. if a relationship actually exists between the predictor and the response. The Prob > F part of the table informs us of this by showing that is called the p-value of the overall model. I won’t go into too much detail, but the p-value indicates the probability that the relationship we’re testing is due to pure chance. If this p-value is high, then the relationship is insignificant. As we can see, the model p-value is very, very small (0.001), indicating that a significant relationship between nominations and wins does exist.
Another important feature I want to briefly explain is R-squared. As we saw in the scatter plot, the data is spread out and cannot fit on a line, but the regression estimates the best possible linear fit with the data. R-squared is a measure of goodness-of-fit, or how well the model fits with the data. The table indicates a R-squared value of 0.93, so we say that the model explains 93% of the variation in the data in our sample. In other words, the model seems to be very good at explaining how many Oscars a BP winner ends up taking home.
Finally, there is the sub-table of regression estimates on the bottom. The coefficient of total_noms indicates the slope of the regression line. We interpret the slope as how the response variable changes (on average) when there is a one-unit change in the predictor. In this case, one additional Oscar nomination is estimated to increase the total number of Oscar wins by approximately 0.59. We can use the information in this table to create an equation that predicts how many Oscars a BP winner is expected to win:
number of Oscar wins = 0.59 * number of Oscar nominations
So, let’s say that a BP winner had 8 nominations. The model would then estimate that the BP winner received a total of 0.59(8) = 4.72 Oscars. Now, of course a film can’t win 4.72 Oscars, so keep in mind that this is just an point estimate that has a certain amount of uncertainty attached to it (due to the specific slice of Oscar history we’re looking at). But we can use the results of our regression model to create what is called a confidence interval (CI). The CI will help show the degree of uncertainty in this estimate.
I won’t explicitly calculate it, but the 95% confidence interval for the average number of Oscar wins a BP winner with 8 nominations receives is 4.30 to 5.21. The 5 BP winners in our sample that received 8 nominations (Platoon, Rain Man, American Beauty, A Beautiful Mind, and No Country for Old Men) all have trophy hauls that fall into this range – each won either 4 or 5 Oscars in total.
Graphically, our regression model looks like this:
where the gray band depicts the 95% confidence interval of the mean estimates. As we can see, quite a few of the data points clearly fall outside this band, especially once we move past 8 nominations. Still, the model appears to be a reasonable guess at explaining the relationship between nominations and wins within the sample. If we analyzed a different era of Oscar history, we would certainly obtain different estimates.
If we wanted to improve the model’s fit with the data, we can actually re-run the regression using multiple predictors, or even specify a different (i.e. non-linear) relationship between the predictor and the response. But it’s much harder to visualize what’s going on, and the underlying math gets more complex, so I’ve limited today’s introduction to regression analysis to a simple, one variable model.
And with that, we come to the end of today’s wonky rant on statistics. TL;DR summary: regression analysis allows us to determine possible relationships among variables within data, which we can then use to predict future outcomes. It’s not without its limitations, and there are quite a few details that I left out in this post. But when used properly, regression analysis can be a very powerful tool to have. All of AD’s forecasting models have been built using a form of regression analysis.
Feel free to leave any questions or concerns in the comments below, or contact me directly at email@example.com or on Twitter at @IPreferPi314. Next week I’ll go one step further into the rabbit hole of regression by introducing the actual meat-and-potatoes basis of AD’s Oscar prediction models – logistic regression.