## A new weekly column by Marshall Flores

This Oscar season, I am excited to announce the introduction of formal predictive models on Awards Daily. These models will attempt to forecast winners in 21 categories based on past data and trends. As far as I know, this will make Awards Daily the most prominent Oscars site to utilize statistical models for predicting the Academy Awards, providing an informed alternative to Nate Silver and his future work.

I won’t claim that statistics can tell everything we need to know about predicting the Oscars – as far as I’m concerned, predicting using stats is actually as much art as science, especially so with the Oscars, given that we’ll never have access to the best possible data for such an undertaking – AMPAS voter preferences. But I certainly do believe that statistics can tell an important part of the story. Statistics can provide empirical validation of conventional wisdom: specifically, which precursors and indicators are significant in predicting Oscar winners, and which are not. Through my analysis so far, I have found that many of the predicting rules-of-thumb we longtime Oscar watchers have grown up with over the years can actually be supported by statistical evidence.

Over the next few weeks I will review some fundamental statistical concepts and methods, with the purpose of building up enough fundamental knowledge to understand the models Awards Daily will be using, how they arrive at their predictions. I’m well aware that many of you may be math-averse and will balk at seeing a lot of symbols, terms, and graphs on a film site. But I will try my very best to keep what I write about accessible to a general audience, and will certainly give relevant examples in each post to illustrate the concepts I introduce.

To begin, let’s look at some data of Oscar Best Picture winners from 1980-2012: how many nominations each BP winner received, and how many Oscars each ultimately won. A table of the past 33 BP winners:

We can use something called a *histogram* to show how nominations and wins are distributed. A histogram is a common way to show the shape of data. It sums up frequencies, how often certain numerical outcomes occur. Visualizing data is an important first step before doing any type of formal statistical analysis – not only can they help verify that certain requirements are met before using certain statistical techniques, but they can also help in making hypotheses that we can later test.

As you can see, the histogram is comprised of a series of columns. The height of each column is the sum of how often something occurred. For example, in the past 33 years, there was only **one** BP winner in the past 33 years (The Departed) that received 5 total nominations, there were **two** BP winners (Crash, Ordinary People) that received 6, and so forth. We can see that the distribution has two peaks (bimodal), indicating the most frequent outcomes: there were 5 BP winners that received 8 nominations, and another 5 that received 11 nominations.More importantly, we can also see that the distribution of nominations is a little** left/negatively skewed**. This means that most of the data are concentrated on the right side of the graph, and the few outcomes that are on the left side are “dragging” (skewing) the overall shape of the distribution to the left a bit. Because of this shape, we can make the reasonable observation that BP winners tend to receive more nominations than less. This is an inference that can (and will) be tested in a future post.

Now let’s take a look at how wins are distributed among the past 33 BP winners.

Here, the distribution of wins is quite different from the distribution of nominations. For one thing, there’s only one peak (unimodal distribution) – 11 BP winners won a total of 4 Oscars total over the past 33 years, so it would appear that winning 4 Oscars is the most frequent outcome, occurring about one-third of the time. Since most of the data is now clustered to the left, we say that the distribution is **right/positively skewed**. As a result, we can infer that AMPAS tends to like spreading the wealth, and that big sweeps of the Return of the King or Slumdog Millionaire sort are rare.

That’s all for now. Next time I’ll go deeper into the statistical rabbit hole by introducing regression analysis