# Statsgasm–Week Two – Regression!

# by Marshall Flores

Welcome back for Part 2 of Awards Daily’s Statsgasm – a miniseries of posts that will lead up to the unveiling of AD’s first generation of Oscar forecasting models in January. Last week in the pilot episode, I began our descent into stats mania by showing how useful it is to visualize historical data, to see past trends. Specifically, we analyzed the past 33 Best Picture winners (1980-2012) – how many nominations each received, and how many Oscars each won.

We’re going to revisit this data in order to introduce a very powerful statistical tool that allows us to estimate relationships between two (or more) variables – regression analysis.

First, let’s briefly review how nominations were distributed among the past 33 BP winners:

From this histogram, we determined that the distribution was negatively skewed, i.e. more of the data is clustered on the right. Hence, we hypothesized that BP winners tend to receive more nominations than less.

Let’s continue exploring this hypothesis by taking a look at another graph. This time, let’s take each of the 33 BP winners and graph its total nomination count against its total win count. This is called a scatter plot, which can help determine if there is a relationship between the two variables.

(Note, there is some overlap in the data, e.g. both Amadeus and Gandhi had 11 nominations and 8 wins, so I induced a small amount of “jitter” in the scatter plot so we can see each data point more clearly).

As we can see, there is a bit of spread in the data, but there does seems to be a direct/positive correlation between total nomination count and total win count, i.e. more nominations tends to lead to more wins. It’s by no means a perfectly linear relationship, but a relationship does exist to a certain extent.

Now, what if we wanted to make a formal estimate of this relationship, i.e. create a mathematical formula that spits out a guess of how many wins are expected for a BP winner, given its total number of nominations. This is where regression analysis comes into play, specifically what is known as simple linear regression.

So what is simple linear regression? Well, allow me to introduce this concept by segueing into a smaller example. Instead of looking at the past 33 BP winners, let’s look at two: Crash (6 nominations and 3 wins) and Return of the King (11 nominations and 11 wins). Let’s call the number of nominations each received a predictor variable and the number of wins the response variable (predictor and response are synonymous with independent and dependent, respectively – I’m sure many of you have heard of independent and dependent variables in your science classes). It’s very easy to determine a perfect linear relationship between nominations and wins with 2 data points, because, well, there are just two data points. Two points are all we need to draw any straight line.

But now let’s add another BP winner: The Hurt Locker (9 nominations and 6 wins).

Although the direct relationship between total nominations and total wins still seems to hold, we simply can’t draw a straight line through all 3 points now. Still, a linear relationship appears to be a good estimate for all 3 points. Attempting to fit a line through the data would remain a reasonable indicator of how the two variables are related.

But how do we determine this line? Intuitively, it seems the best line would lie somewhere in the middle of the data points. Well, friends, this is *exactly* what a simple linear regression does. A simple linear regression mathematically determines a “best fit” line through the points of a data set, “best fit” meaning that this particular line minimizes the overall deviation between itself and the data (using a criterion called least squares). In the small three data point set, the best fit (regression) line ends up looking like this:

This regression line represents what we call the expected mean (average) response between the two variables. In other words, it indicates how many Oscars a BP winner collects, on average, based on how many nominations it has.

With this toy example out of the way, let’s now build a regression model using the data from all 33 BP winners in our time frame. Again, we will use the number of nominations as a predictor variable, the number of wins as the response variable. There are formulas that allow us to determine the regression by hand if we wish, but it’s far easier to use software. My particular weapon of choice is the statistical software package STATA.

After running the regression in STATA, we obtain the following summary table:

There is a lot of technical information in this table that I won’t go over, but I do want to highlight a few things. The first thing we need to check after running any regression is if the model is any good, i.e. if a relationship actually exists between the predictor and the response. The Prob > F part of the table informs us of this by showing that is called the p-value of the overall model. I won’t go into too much detail, but the p-value indicates the probability that the relationship we’re testing is due to pure chance. If this p-value is high, then the relationship is insignificant. As we can see, the model p-value is very, very small (0.001), indicating that a significant relationship between nominations and wins does exist.

Another important feature I want to briefly explain is R-squared. As we saw in the scatter plot, the data is spread out and cannot fit on a line, but the regression estimates the best possible linear fit with the data. R-squared is a measure of goodness-of-fit, or how well the model fits with the data. The table indicates a R-squared value of 0.93, so we say that the model explains 93% of the variation in the data in our sample. In other words, the model seems to be very good at explaining how many Oscars a BP winner ends up taking home.

Finally, there is the sub-table of regression estimates on the bottom. The coefficient of total_noms indicates the slope of the regression line. We interpret the slope as how the response variable changes (on average) when there is a one-unit change in the predictor. In this case, one additional Oscar nomination is estimated to increase the total number of Oscar wins by approximately 0.59. We can use the information in this table to create an equation that predicts how many Oscars a BP winner is expected to win:

number of Oscar wins = 0.59 * number of Oscar nominations

So, let’s say that a BP winner had 8 nominations. The model would then estimate that the BP winner received a total of 0.59(8) = 4.72 Oscars. Now, of course a film can’t win 4.72 Oscars, so keep in mind that this is just an point estimate that has a certain amount of uncertainty attached to it (due to the specific slice of Oscar history we’re looking at). But we can use the results of our regression model to create what is called a confidence interval (CI). The CI will help show the degree of uncertainty in this estimate.

I won’t explicitly calculate it, but the 95% confidence interval for the average number of Oscar wins a BP winner with 8 nominations receives is 4.30 to 5.21. The 5 BP winners in our sample that received 8 nominations (Platoon, Rain Man, American Beauty, A Beautiful Mind, and No Country for Old Men) all have trophy hauls that fall into this range – each won either 4 or 5 Oscars in total.

Graphically, our regression model looks like this:

where the gray band depicts the 95% confidence interval of the mean estimates. As we can see, quite a few of the data points clearly fall outside this band, especially once we move past 8 nominations. Still, the model appears to be a reasonable guess at explaining the relationship between nominations and wins within the sample. If we analyzed a different era of Oscar history, we would certainly obtain different estimates.

If we wanted to improve the model’s fit with the data, we can actually re-run the regression using multiple predictors, or even specify a different (i.e. non-linear) relationship between the predictor and the response. But it’s much harder to visualize what’s going on, and the underlying math gets more complex, so I’ve limited today’s introduction to regression analysis to a simple, one variable model.

And with that, we come to the end of today’s wonky rant on statistics. TL;DR summary: regression analysis allows us to determine possible relationships among variables within data, which we can then use to predict future outcomes. It’s not without its limitations, and there are quite a few details that I left out in this post. But when used properly, regression analysis can be a very powerful tool to have. All of AD’s forecasting models have been built using a form of regression analysis.

Feel free to leave any questions or concerns in the comments below, or contact me directly at marshall.flores@gmail.com or on Twitter at @IPreferPi314. Next week I’ll go one step further into the rabbit hole of regression by introducing the actual meat-and-potatoes basis of AD’s Oscar prediction models – logistic regression.

I love this and I almost did it for my college thesis.

Thanks, Zach! “Almost” did it?

Slumdog Millionaire … whatever … still pissed

Yes, I’m afraid the idea fascinated me, but outside of this site, not many people would’ve wanted to read it! But that’s OK, we know it means something. Please, if you have the time and patience, do a Best Picture regression and include variables for nominations in all the other major categories, as well as for things like whether a film has the most nominations, prior BP awards, etc. Too fun to see it play out!

Zach, I have already built a BP model and as I indicated in the comments of my pilot episode, I’m pretty confident in its ability.

As I indicated in the pilot post, I have already built models for 21 of the categories – everything except for the 3 shorts categories. I do plan to introduce one of them in full come January to show everyone how the models work in general. Right now I’m using the first few posts in this series to demonstrate the underlying processes instead of just treating regression analysis as a magical mystery box that is beyond everyone’s comprehension. (“A boat is a boat, but the mystery box could be anything! It could even be a boat!”)

But yeah, everything is set. AD’s models just need to be fed the results of certain events in this year’s Oscar gauntlet. When I have something to report, you’ll see it here at AD.

Oh, I missed Part 1. When we get there, I would love to see what it said (or now says) about Argo, Crash, and Million Dollar Baby. I have a feeling this year will be pretty predictable once the nominees are set and the Globe and SAG winners come in.

Part 1 is available here: http://www.awardsdaily.com/blog/pilot-episode-of-awardsdailys-statsgasm/

I’m not going to post the full BP model and all 8 of its predictors, but I am willing to talk about aspects of it. It does correctly predict every single BP winner in the past 30 years, including Argo, Crash, Million Dollar Baby, Shakespeare in Love etc.

Interesting that a BP winner almost must win a certain amount of other categories. I remember a certain resistance to that idea here last year as Sasha confidently predicted that it would have to win Writing and Editing so that BP wasn’t its only win. I think all four of the other adapted screenplays were far more popular on this site, many preferred the more creative editing of Life of Pi. (Nice twitter handle Marshall! WTF does it mean? Let me say that I was born on March 14, and I prefer Life of Pi.) but Sasha was right.

These statistics speak not only to the tendency of the Big Tech winner that isn’t as heady as Matrix – say Titanic, Gladiator, Gandhi, etc, but also to what we could call the Argo tendency – the voters don’t want to send a film to the winners circle all naked.

Hoo boy, there was a lot of resistance to that last season. AMPAS may change and can/will act unpredictably. But on the whole, they have exhibited clear tendencies – the fact that *every* BP winner since Rebecca has won at least 3 Oscars and also at least took home one of director/screenplay/acting in addition to BP could not be any clearer. Sasha to her great credit stuck to her guns and was ultimately vindicated on that point.

Hahaha. You’re on the right track. IPreferPi is also a palindrome!

My head just exploded

Speaking as a college math teacher who used to teach introductory statistics (linear regression included), fine job explaining this for a general audience!

HI Dan, thanks! Much appreciated! I’ve been a teaching assistant to introductory statistics courses at the undergraduate level, so that has definitely helped. Although I’m not really sure how my stats professors would ultimately react to my work, haha.

Thanks Marshall! Fascinating, even for this math-challenged guy :).

If Titanic and LOTR won 11 oscars each including Best Picture, then I think Gravity still has a chance to SWEEP this year’s awards. Silly script and all, Gravity is SO much better than those two together. #teamgravity