One of the most important steps in building a statistical model is deciding which data to include. With very large datasets and models that have a high computational cost, impressive efficiency can be realized by identifying the most (and least) useful features of a dataset prior to running a model. Feature selection is the process of identifying the features in a dataset that actually have an influence on the dependent variable.
High dimensionality of the explanatory variables can cause both high computation times and a risk of overfitting the data. Moreover, it’s difficult to interpret models with a high number of features. Ideally we would be able to select the significant features before performing statistical modeling. This reduces training time and makes it easier to interpret the results.
Some techniques to address the “curse of dimensionality” take the approach of creating new variables in a lower-dimensional space, such as Principal Component Analysis (Pearson 1901) or Singular Value Decomposition (Eckart and Young 1936). While these may be easier to run and more predictive than an un-transformed set of predictors, they can be very hard to interpret.
We’d rather—if possible—select from the original predictors, but only those that have an impact. There are a few sophisticated feature selection algorithms such as Boruta (Kursa and Rudnicki 2010), genetic algorithms (Kuhn and Johnson 2013, Aziz et al. 2013) or simulated annealing techniques (Khachaturyan, Semenovsovskaya, and Vainshtein 1981) which are well known but still have a very high computational cost — sometimes measured in days as the dataset multiplies in scale by the hour.
As genuinely curious, investigative minds, we wanted to explore how one of these methods, the Boruta algorithm, performed. Overall, we found that for small datasets, it is a very intuitive and beneficial method to model high dimensional data. Below follows a summary of our approach.
Why such a strange name?
Boruta comes from the mythological Slavic figure that embodies the spirit of the forest. In that spirit, the Boruta R package is based on ranger, which is a fast implementation of the random forests classification method.
How does it work?
We assume you have some knowledge of how Random Forests work—if not, this may be tough.
Let’s assume you have a target vector T (what you care about predicting) and a bunch of predictors P.
The Boruta algorithm starts by duplicating every variable in P—but instead of making a row-for-row copy, it permutes the order of the values in each column. So, in the copied columns (let’s call them P’), there should be no relationship then between the values and the target vector.
Boruta then trains a Random Forest to predict T based on P and P’.
The algorithm then compares the variable importance scores for each variable in P with it’s “shadow” in P’. If the distribution of variable importances is significantly greater in P than it is in P’, then the Boruta algorithm considers that variable significant.
The dataset of interest here were records of doctors’ appointments for insurance-related matters, and the target variable of interest was whether or not the patient showed up for their appointment. Part of our task was to find the most significant interactions, and with fifty jurisdictions and thirty doctor specialties, we already have a space of 1,500 potential interactions to search through—not including many other variables.
The set of features can be visualized by creating a set of boxplots for the variable importances for each potential feature.
The three red boxplots represent the distribution of minimum, mean and maximum scores of the randomly duplicated “shadow” variables. This is basically the range of variable importances that can be achieved through chance.
The blue bars are features that performed worse than the best “shadow” variables and should not be included in the model. Purple bars are features that have the same explanatory power as the best “shadow” variable, and its use in the model is up to the discretion of the analyst. The green bars are variables with importances higher than the maximum “shadow” variable — and are therefore good predictors to include in a future classification model.
model.matrix( ~ 0 + `Doctor Specialty` + `Business Line` +
Jurisdiction, data = show_df, contrasts.arg = lapply( show_df[, c('Doctor Specialty', 'Business Line', 'Jurisdiction')], contrasts, contrasts = FALSE ) )show_mm_st <- cbind(status = show_df$`Appt Status`, show_mm)show_mdf <- as.data.frame(show_mm_st)library(Boruta)b_model <- Boruta(status ~ ., data = show_mdf)cat(getSelectedAttributes(b_model), sep = "\n")
# Doctor SpecialtyChiropractic Medicine
# Doctor SpecialtyNeurology
# Doctor SpecialtyNurse
# Doctor SpecialtyOrthopaedic Surgery
# Doctor SpecialtyOther
# Doctor SpecialtyRadiology
# Business LineDisability
# Business LineFirst Party Auto
# Business LineLiability
# Business LineOther
# Business LineThird Party Auto
# Business LineWorkers Comp
## Importance plot plot(b_model, las =2, cex.axis=0.75)
Businesses often want to better understand their customers by segmenting them along a common set of attributes. In a previous post, we explored how to build segments based on customers’ trajectories of interactions with a brand. In this post, we’ll show how to build segments based purely on the products that customers have purchased; this approach has the added bonus of not just segmenting your customers, but your products too! And this isn’t just a strategic intelligence tool—finding groups of similar customers can lead to advanced recommendations systems that are personalized for each one of your customers.
How one can find groups of similar customers that purchase similar products? How do we define what “similar” means in an assortment of thousands (or tens of thousands) of SKUs? Flip the question on its head: how do we find products that are purchased by similar customers? How do we define similarity between customers?
As you’ll see—asking either one of those questions individually is a bit like asking which blade of the scissors cuts the paper. In a product segmentation, customers are said to be similar when they purchase from the same set of products; and products are similar when they are purchased by the same set of customers.
Ok—let’s dive into how it’s done.
We start with what we’re trying to explain, which is our observed customer-by-product matrix:
To keep things really simple, we’re going to put a 1 in the cell if the customer has ever purchased that product, and 0 if they never have.
Again, let’s remember that we’re trying to explain this data in terms of customers (rows) and products (columns)—sounds like we’re trying to split this matrix in two! In fact we are—the tool that we use to explain our observed data is called non-negative matrix factorization. It is a group of algorithms that simplify the original matrix of data (let’s call it V) by 2 other matrices (W and H), which, when multiplied together, come to (approximately) the original matrix.
So, let’s say you have 10,000 customers and sell 1,000 products. Your customer-by-product matrix is going to be 10,000 rows and 1,000 columns. But, you could factor this matrix into a:
10,000 row by 2[†] column matrix (W), and a
2[†] row by 1,000 column matrix (H)
†2 is arbitrary—but is typically determined by trying a number of different options
This would mean that for each customer, you have two pieces of information that tell you what kinds of products they purchase (instead of 1,000); and for each product, you have two pieces of information that tell you which kinds of customers purchase them.
This approach can be thought of as a multidimensional scaling algorithm that has better features than Principal Component Analysis as it is well defined for non-negative values in the data (and counts of purchases are always non-negative).
Working backwards, if you work out the sums, a single cell of the matrix that you arrive at when you multiply the customer- and product-segment matrices together is:
CS1 * PS1 + CS2 * PS2 + … + CS5 * PS5
Where CS1 is that customer’s score for segment 1, and PS1 is that product’s score for being in segment 1—and so on through to segment 5. So if a customer and product have high scores for the same segments, then our factorization is implying that this cell in the customer-by-product matrix has a high value.
Visually, the results can be shown in the form of a heatmap, which shows each customer’s score by each segment (the charts below use the word “basis” in lieu of segment).
The dark entries (in column 2, for example) mean that those customers preferentially buy products from segment 2. And which products are those? Well, we have a corresponding heatmap for our product segments, that looks like this:
Now we have the two pieces of information together—the customers that tend to purchase similar products, and the products that tend to be purchased by similar customers.
With this information in hand, you can start using these scores as the basis for strategic decisions and marketing enhancements, like:
Developing product-based customer segments to build need-based personas
Deciding which products should be offered together as a bundle
Building a product recommendation engine that uses a customer’s segment to determine which products should be merchandised
Working through such an analysis, a common challenge is having a very sparse dataset. Typically there are many products, and customers tend to only purchase a few of them. This can usually be addressed by applying some expert knowledge to develop a hierarchy of information about the product—from brand, to style, to size (or SKU), and choosing the appropriate level of the hierarchy to use.
In addition, non-negative matrix factorization takes a number of options and parameters—there a number of different algorithms and choices to make for each. One needs to determine which loss function to use and how to determine the starting state of the matrix in the estimation process.
Not to mention, you have to choose the number of segments for your analysis—this is typically done by trying a range of possible segments and comparing how well they explain the data (by comparing their errors) and how well they perform for you, the analyst. Too many segments, and the information is hard to digest; too few, and you are not explaining the data well.
Finally, the computations are expensive and tricky. We use the NMF library in R, which is as performant and flexible as they come—but even still we often encounter tricky and hard-to-diagnose errors all the time.
Segmentation is the art understanding your customer-base so that you may better serve them. It comes from an understanding that customers are diverse, and to serve a large market effectively, you must understand how customers differ —and which differences are important.
One thing our clients often want to understand about their customer-base is the customer journey: all the phases and key points of a customer’s lifecycle of interactions with the brand.
For example, some customers may start in-store and move exclusively online; others may only shop on mobile; still others may be truly omni-channel and buy online when they’re at work, in-store when they’re in the area, and on mobile when they’re on their commute. Some may become more loyal after redeeming a discount; others less so—and so on. But if you have thousands (or millions) of customers, how can you understand the behavior patterns across your customer base?
How can we do this? Sequence analysis—which, at its core, is analyzing a large set of sequences (one from each customer) and drawing meaningful statements about which sequences are similar or different from each other.
Wherever I travel, I ask myself if I could live wherever I’ve found myself. Is the food good? How are the parks? Are the people friendly? And most importantly, can I find a job I like and a decent place to live? Now that we are in prime summer vacation season, perhaps other Americans are asking the same question at their holiday destinations.
We’re curious where and why Americans are packing up and moving. With some creative number crunching with Census Bureau migration data and beautiful mapping applications we built a few tools to visualize migration patterns.
First, we tackle the simple matter of showing which cities are attracting more people than they are losing. Because cities are complex creatures and have an influence far beyond “downtown”, we used migration data from each of the 374 metropolitan areas (called “MSAs”) across the country to calculate the number of people that moved in and out to arrive at a net migration value.
The map below displays net domestic migration for each metro area. There are clear source and destination clusters. Check out the the industrial Midwest and Northeast — massive out migration. Meanwhile, parts of the South, Texas and Pacific Northwest are attracting many new residents from other parts of the country. Don’t forget about Puerto Rico — the debt crisis and economic malaise are driving a lot of Puerto Ricans to the mainland.
What surprises do you see? Columbus, Ohio and Des Moines, Iowa were surely not on my up-and-coming radar although both have large, well-regarded public universities that are attracting the region’s best students and, critically, providing opportunities and amenities to convince them to stay after graduation.
To me, the biggest surprise is the decline in large coastal cities in the Northeast and California. As a resident of Philadelphia and a frequent visitor to cities up and down the Northeast Corridor, LA and SF, the energy on the streets and sky-high housing costs seem to indicate an unstoppable resurgence.
So why are more people leaving large, established cities than are moving in? Well, one hypothesis could be that the sky-high prices and endless energy are deterring more people than they attract. Perhaps people who are moving to cities are wealthier and live in less-dense housing than residents who are leaving. This is definitely a question we will attempt to answer using predictive models and algorithms in the next phase of the project. (We’re really excited about this part!)
Having established which metro areas are growing or shrinking, we really wanted to know: “Ok, so where are Americans moving from?” and the inverse: “Where are Americans moving to?” The interactive map below shows exactly that. Pick your favorite (or least) metro area and select either Incoming or Outgoing migration change. Any surprises?
Most metro areas aren’t attracting too many people from far away. A long-distance move is certainly not the norm. There are a few exceptions, however. Take a look at the incoming migration to Atlanta — thousands of people are fleeing NYC, DC, LA, SF, and Chicago for Atlanta.
This is just the beginning of our look into Americans’ migration patterns. Over the next few weeks we will be modeling the drivers to moving and embark on a deep-dive into one metro area to identify prime business opportunities using a site suitability analysis.
So while enjoying your summer vacation be sure to ask yourself if it’s somewhere you can live. You might end up moving there.
If you want to look under the hood, the source code is available here.
Are honest and integral to a fault: if we say something will be done, then it will be done. We conform our words to reality (honesty) and reality to our words (integrity).
Do more with less: We are frugal and look for ways to avoid spending time and money when it is not needed.
Think win-win: Success is not zero-sum. Our clients, partners, and vendors’ success is our success. We build credible, reliable, and honest relationships with every client, partner, and ally.
Prove themselves wrong: We seek diverse perspectives, look for alternative hypotheses, investigate the details, and stress-test our analyses to ensure that we are right. We never assume we are right.
Are obsessed with constant improvement: Individually, we are always looking to learn new techniques and develop valuable skills sets. We proactively seek feedback to improve our collective performance.
Collaborate and communicate extremely well: We value team contribution over individual contribution. We are excellent team players that go the extra mile to make it easy to work with others.
Deliver results, not work: We don’t value work, we value results. We are always moving toward delivering value to our clients.
Take care of each other: We care for each other’s well-being and celebrate alternative perspectives.
Self-Manage: We take ownership of our work by prioritizing and organizing effectively with our colleagues while acting on behalf of the entire company.
Investigate deeply: We are never satisfied with the first layer of understanding, or fixing symptoms instead of underlying causes. We fix problems so they stay fixed.
Are ambitious risk takers: We push the definition of normal by moving fast and pursuing new, unconventional solutions.
Christie’s Auction House is a landmark institution in New York City and across influential capital cities. Every few weeks, in the run-up to one of their famous auctions, you can waltz right in and enjoy museum quality artwork enjoying a brief public interlude before it disappears back into private hands. Only a few weeks ago, I was in the midst of Lichtensteins, Chagalls, and Basquiats on their two week vacation in the public eye. In these halls, listen carefully and you can hear the whispers of brokers advising their wealthy clients — which more often than not contains an evaluation of the direction of the art market. But how good are these predictions? Can we do better with data?
One thing we love to do at Gradient is to push our skills with novel, underutilized datasets. A few weeks ago, we got the bright idea of applying statistical techniques to an unlikely recipient of quantitative techniques: fine arts. The proof-of-concept project outlined below presented an opportunity to exercise a few core competencies of intense interest to our clients: assembling bespoke datasets through web scraping and building statistical models with unstructured or semi-structured data. From this proof-of-concept, we also uncovered a valuable case study in the value of regularization.
Assembling the Dataset
Christie’s conveniently hosts an online database of the results of their auctions. A typical page looks something like this:
With some inspection of the page’s source code, we can see how the data is organized:
With tools like R’s rvest, we can automate the scraping of data from auctions and specific lots (sales) and begin to assemble a massive dataset through an automated procedure. Gone are the days of copy-and-paste and manual data entry. For each sale, we collected the following data:
The title of the work
The realized price
The estimated pre-sale price range
An essay describing the work
And details on the work’s provenance
As is typical for projects like these, the data is extraordinarily messy — with many fields missing for many entries. A typical entry looks something like this:
$ lot : chr "802"
$ title : chr "A SMALL GREYISH-GREEN JADE 'BUFFALO’"
$ subtitle : chr "LATE SHANG-EARLY WESTERN ZHOU DYNASTY, 12TH-10TH CENTURY BC"
$ price_realised: int 15000
$ low_est : int 4000
$ high_est : int 6000
$ description : chr "A SMALL GREYISH-GREEN JADE 'BUFFALO’\r\nLATE SHANG-EARLY WESTERN ZHOU DYNASTY, 12TH-10TH CENTURY BC\r\nPossibly a necklace clos"| __truncated__
$ essay : chr "Compare the similar jade water buffalo carved in flat relief and dated to the Shang dynasty in the Mrs. Edward Sonnenschein Col"| __truncated__
$ details : chr "Provenance\r\n The Erwin Harris Collection, Miami, Florida, by 1995."
$ saleid : chr "12176"
We downloaded every lot sale from 2017 — a set of 11,577 observations.
Setting up the model
To start with, we needed to simplify the dataset into a target vector and a set of predictors. We are interested in predicting the actual price — but what price exactly? Since Christie’s supplies an estimated range, we decided we needed to “back out” the information already contained in the estimate. To control for the effect of scale, we used the ratio of the actual price to the upper bound of the estimated range. Since the ratio was not normally distributed, we had to employ a Box-Cox transformation to this vector to normalize it
For predictors, we decided to use the abundance of text contained in the dataset. To add some structure to the text, we tokenized the “bag of words” for each sale item, and included only words that were used between 50 and 200 times. This type of dataset is standard in text mining approaches and is called a document-term matrix, where each “document” is a row, and each possible “term” — typically, a stemmed word — is a column, with the number of times that term appears in a given document in the respective cell.
A naïve approach
Our first model was a simple linear regression with the document-term matrix as the set of predictors and our Box-Cox-transformed price-to-estimate ratio as our target vector. What did we get? A data scientist’s dream come true!
Residual standard error: 0.09126 on 8755 degrees of freedom
Multiple R-squared: 0.967, Adjusted R-squared: 0.9613
F-statistic: 168.4 on 1525 and 8755 DF, p-value: < 2.2e-16
You see that R-squared? 0.96!!
Any good analyst worth their salt will throw up an eyebrow at this result, and an inspection of diagnostic plots starts exposes more cracks in this model:
The model underestimates low ratios:
And residuals do not follow a normal distribution:
And let’s come back to that 0.96 R-squared! This is obviously a case of overfitting — in real life we should not expect to be able to predict the actual price of a sale with that kind of accuracy. If we tested this model on data that we did not use to train the model, would we really expect to get it this right?
In addition, this kind of naïve model gives us almost no insight into what words are significant predictors of actual sale price. With 1,526 predictors, we’d have a lot of data to sort through even after we’ve run the analysis.
What’s the solution to all of these issues? Regularization!
A more sophisticated model — L1-regularized regression with cross validation
We love the L1-norm (or LASSO) for regularizing our regression models. In addition to regularizing the model by ensuring that it is applicable to data the model has not yet seen, it also helps make sense of very “wide” datasets — like those with over 1,500 predictors — by shining a light on only those that have a significant impact. By imposing an extra penalty on coefficients, it zeros out coefficients that have no significant impact and restricts the size of those that are non-zero.
Using cross-validation that repeatedly trains a model on sample of a databse and testing it on the held-out portion, we can tune the regression model to pick the exact combination of predictors that maximizes the penalized fit.
Although this is a busy graph (we did have 1,500 predictors after all), this shows that as we increase the penalization (lambda), more and more coefficients shrink and ultimately become zero. At the value of lambda that we selected through cross validation, there are roughly 40 words that actually have some predictive power. Some predict a price higher than the actual, and some predict a lower price.
Positive indicators. If you see these words in the description, bid high.
Oh, and what was our R-squared? Nothing close to 0.96! This model had a pseudo R-squared of around 0.14. Smaller? Yes, but certainly more reflective of the predictive power of the words in the objects’ descriptions.
So, is 0.14 good or bad? Depends on your perspective! In financial markets an R-squared even ever so slightly above zero has huge value, as any kind of edge can make you millions. This kind of model probably could not be used to reliably inform purchase decisions at auctions, but it certainly raises good questions for the astute art observer: why does the word “warehouse” int he description predict a higher than estimated value? Are watercolours disrespected by the experts but then preferred by buyers? Any discerning buyer should be asking these questions.
One step further — image tagging with computer vision
In addition to price data and a text description of the item, we wanted to see how helpful the photos of each item could be in predicting sales price. We have been wanting to try the Microsoft Azure Computer Vision API, so we sent every image file to the service and it returned a set of tags for each photo. The most popular tags are shown below:
We then built a number of binary classifiers to predict whether or not the object exceeded the high end of its predicted price range. There were 365 tags that appeared for at least two images — these tags were used as the predictors in building a classifier.
We divided items into 2 groups:
“1” — received higher than the estimated price
“0” — received lower than the estimated price
We plot the most popular tags across groups. We were surprised that the counts for the group that underperformed were so low, and that some fairly common words, like “vase”, “small”, and “group” appeared only in group 1. (A puzzle, for sure!).
We built three types of modern classifiers on the full set of tags: a Random Forest, a Neural Network, and a Gradient Boosted Tree. None of them performed spectacularly (the max AUC was 0.5744), but we were surprised that there was any signal in this data at all! We would have thought that all of this data would have been completely incorporated by the specialists at Christie’s in their estimate. Here are the three respective ROC curves detailing the performance of the classifiers:
So much more could be done to improve these models. Certainly we could go much further than the works sold in 2017. We could look for lots sold in more than one auction and build a model to account for changes in price over time. We could build a regression to isolate the performance of certain auction locations — like New York, London, and Hong Kong, or test performance measures between live and online auctions.
As a short internal project, this was really fun, and shows what the Gradient team can do with a novel source of data in a few days. Like what you see? Get in touch.
As Gradient’s newest employee, I’m thrilled to be working with such a talented crew on some fascinating assignments. Admittedly, I was a little intimidated by the abundance of brains and expertise, but we have quickly developed a strong camaraderie and settled into a productive cadence. In the first few days alone, I’ve come to appreciate the responsive and efficient approach with which Gradient approaches research. Having had the opportunity to design and execute research for a diverse set of global clients, I know first hand that a lean and flexible research design is essential to getting clients the exact data and analysis they expect.
I’m no stranger to quantitative research, having spent the past 7 years in various capacities using data to help managers make important decisions. I love seeing how the combination of the right research question, data, and targeted analysis can uncover a completely unexpected finding that changes the direction of an initiative, campaign, or upends an established hypothesis. It’s even more gratifying to work on projects with so many different applications, from learning how Sesame Street programming helps young children learn in a developing country to mapping how the population in an upmarket neighborhood has changed over time and investigating the drivers behind it. I feel way too fortunate that I get to spend my days (and the occasional night) learning about people and their habits, motivations, triggers, emotions, and relationships. I used to tell myself that I “understand” people, but I continue to be surprised and have my assumptions overturned, so I’ve learned to approach every new project with a fresh perspective and no expectations.
While Gradient already has an impressive lineup of brainpower and fascinating clients, I hope to draw upon my diverse experience in emerging markets where the growing consumer class is not well understood. Thanks to mobile phones, the marketing tools and mediums available to emerging markets are not drastically different from those of more established economies. That said, the way in which marketers appeal to this new wave of consumers requires foregoing all prior assumptions and investing in extensive formative research to appreciate the values of a new culture before you can even begin to think about analyzing a dataset. I’d like to help Gradient glue together the cultural context and high-level analytics to ultimately produce analyses that are statistically rigorous but not blind to their surroundings. I’m also a lover of maps (see below) and commonly think the best way to make a point that nobody can refute is to toss your arguments on a map. Too often we overlook the importance of spatial relationships and how our environments affect us. There are many promising applications of spatial analysis that I’d love to incorporate into Gradient’s standard methodology, such as analyzing the performance of our predictive models by ZIP code or neighborhood.
I feel quite fortunate that my professional and personal interests align quite nicely, which is a good thing, right? I’m a super adventurous traveler. I’ll go anywhere — the more far-flung, bizarre, unheard-of place the better. I’ve been to more than 40 countries and enjoyed (with one exception) every single one — you can ask me which one over a beer. When I travel, I’ll eat or drink anything and prefer to spend at least one full day with no map (which is hard for me!) and just a good pair of walking shoes to truly get a sense of a new place. My ideal Saturday is one spent exploring a city on foot with no itinerary or objective.
I just graduated from The University of Pennsylvania with a MS in Spatial Analytics, which might sound like years of misery for some, but was as close to academic heaven as I could get. Full disclosure: I really like maps, I can’t get enough of them. So much so that I have a designated map wall at home and have spent enough time staring at satellite imagery that I can recognize the footprint of pretty much every major city in the world. Thankfully everyone at Gradient embraces nerdiness, so I can be open about my strange obsessions. At Gradient, we all have them, I assure you!
As anyone whose read this blog recently can surmise, I’m pretty interested in how this election turned out, and have been doing some exploratory research into the makeup of our electorate. Over the past few weeks I’ve taken the analysis a step further and built a sophisticated regression that goes as far as anything I’ve seen to unpack what happened.
Background on probability distributions
(Skip this section if you’re familiar with the beta and binomial distributions.)
Before I get started explaining how the model works, we need to discuss some important probability distributions.
The first one is easy: the coin flip. In math, we call a coin flip a Bernoulli trial, but they’re the same thing. A flip of a fair coin is what a mathematician would call a “Bernoulli trial with p = 0.5”. The “p = 0.5” part simply means that the coin has a 50% chance of landing heads (and 50% chance of landing tails). But in principle you can weight coins however you want, and you can have Bernoulli trials with p = 0.1, p = 0.75, p = 0.9999999, or whatever.
Now let’s imagine we flip one of these coins 100 times. What is the probability that it comes up heads 50 times? Even if the coin is fair (p = 0.5), just by random chance it may come up heads only 40 times, or may come up heads more than you’d expect – like 60 times. It is even possible for it to come up 100 times in a row, although the odds of that are vanishingly small.
The distribution of possible times the coin comes up heads is called a binomial distribution. A probability distribution is a set of numbers that assigns a value to every possible outcome. In the case of 100 coin flips, the binomial distribution will assign a value to every number between 0 and 100 (which are all the possible numbers of times the coin could come up heads), and all of these values will sum to 1.
Now let’s go one step further. Let’s imagine you have a big bag of different coins, all with different weights. Let’s imagine we grab a bunch of coins out of the bag and then flip them. How can we model the distribution of the number of times those coins will come up heads?
First, we need to think about the distribution of possible weights the coins have. Let’s imagine we line up the coins from the lowest weight to the highest weight, and stack coins with the same weight on top of each other. The relative “heights” of each stack tell us how likely it is that we grab a coin with that weight.
Now we basically have something called the beta distribution, which is a family of distributions that tell us how likely it is we’ll get a number between 0 and 1. Beta distributions are very flexible, and they can look like any of these shapes and almost everything in between:
So if you had a bag like the upper left, most of the coins would be weighted to come up tails, and if you had a bag like the lower right, most of the coins would be weighted to come up heads; if you had a bag like the lower left, the coins would either be weighted very strongly to come up tails or very strongly to come up heads.
You might now be seeing where this is going. While we can’t observe individuals’ voting behavior (other than whether or not they voted), we can look at the talleys at local levels, like counties. And let’s say, some time before the election, you lined up every voter in a county and stacked them the same way you did with coins as before, but instead of the probability of “coming up heads”, you’d be looking at a voter’s probability of voting for one of the two major candidates. That would look like a beta distribution. You could then model the number of votes for a particular candidate in a particular county would as a beta-binomial distribution.
So in our model we can say the number of votes V[i] in county i is distributed beta-binomial with N[i] voters and voters with p[i] propensity to vote for that candidate:
V[i] ~ binomial(p[i], N[i])
But we’re keeping in mind that p[i] is not a single number but a beta distribution with parameters alpha[i] and beta[i]:
p[i] ~ beta(alpha[i], beta[i])
So now we need to talk about alpha and beta. A beta distribution needs two parameters to tell you what kind of shape it has. Commonly, these are called alpha and beta (I know, it’s confusing to have the name of the distribution and one of its parameters be the same), and the way you can think about it is that alpha “pushes” the distribution to the right (i.e. in the lower right above) and that beta “pushes” the distribution to the left (i.e. in the upper left above). Both alpha and beta have to be greater than zero.
Unfortunately, while this helps us understand what’s going on with the shape of the distribution, it’s not a useful way to encapsulate the information if we were to talk about voting behavior. If something (say unemployment) were to “push” the distribution one way (say having an effect on alpha), it would also likely have an effect on beta (because they push in opposite directions). Ideally, we’d separate alpha and beta into two unrelated pieces of information. Let’s see how we can do that.
It’s a property of the beta distribution that its average is:
alpha + beta
So let’s just define a new term called mu that’s equal to this average.
mu = ------------
alpha + beta
And then we can define a new term phi like so
phi = --------
With a few lines of arithmetic, we can solve for everything else:
And if alpha is the amount of “pushing” to the right and beta is the amount of “pushing” to the left in the distribution, then phi is all of the pushing (either left or right) in the distribution. This is a sort of “uniformity” parameter. Large values of phi mean that almost all of the distribution is near the average (think the upper right beta distribution above) – the alpha and beta are pushing up against each other – and small values of phi mean that almost all the values are away from the average (think the beta distribution on the lower left above).
In this parameterization, we can model propensity and polarization independently.
So now we can use county-level information to set up regressions on mu and phi – and therefore on the county’s distribution of voters, and how they ended up voting. Since mu has to be between 0 and 1 we use the logit link function, and since phi has to be greater than zero, we use the exponential link function
logit(mu[i]) = linear function of predictors in county i
log(phi[i]) = linear function of predictors in county i
The “linear functions of predictors” have the format:
Where uninsured[i] is the uninsurance rate in that county and coef[uninsured] is the effect that uninsurance has on the average propensity of voters in that county (in the first equation) or the polarity/centrality of the voting distribution (in the second equation).
For each county, I extracted nine pieces of information:
The proportion of residents that do not have insurance
The rate of unemployment
The rate of diabetes (a proxy for overall health levels)
The median income
The violent crime rate
The median age
The gini coefficient (an index of income heterogeneity)
The rate of high-school graduation
The proportion of residents that are white
Since each of the above pieces of information had two coefficients (one each for the equations for mu and phi) the model I used had twenty parameters against 3111 observations.
The model performs very well on first inspection, especially when we take the log of the actual votes and the prediction (upper right plot), and even more so when we do that and restrict it only to counties with greater than 20,000 votes (lower left plot):
This is actually cheating a bit, since the number of votes for HRC (which the model is fitting) in any county is constrained by the number of votes overall. Here’s a plot showing the estimated proportion vs. the actual proportion of votes for HRC, weighted by the number of votes overall:
Here is the plot of coefficients for mu (the average propensity within a county):
All else being equal, coefficients to the left of the vertical bar helped Trump, and to the right helped Clinton. As we can see, since more Democratic support is concentrated in dense urban areas, there are many more counties that supported Trump, so the intercept is far to the left. Unsurprisingly (but perhaps sadly) whiteness was the strongest predictor overall and was very strong for Trump.
In addition, the rate of uninsurance was a relatively strong predictor for Trump support, and diabetes (a proxy for overall health) was a smaller but significant factor.
Economic factors (income, gini / income inequality, and unemployment) were either not a factor or predicted support for Clinton.
The effects on polarity can be seen here:
What we can see here (as the intercept is far to the right) is that most individual counties have a fairly uniform voter base. High rates of diabetes and whiteness predict high uniformity, and basically nothing except for income inequality predicts diversity in voting patterns (and this is unsurprising).
What is also striking is that we can map mu and phi against each other. This is a plot of “uniformity” – how similar voting preferences are within a county vs. “propensity” – the average direction a vote will go within a county. In this graph, mu is on the y axis, and log(phi) is on the x axis, and the size of a county is represented by the size of a circle:
What we see is a positive relationship between support for Trump and uniformity within a county and vice versa.
And if you’re interested in bayesian inference using gibbs sampling, here are the trace plots for the parameters to show they converged nicely: mu trace / phi trace.
Conclusion and potential next steps
This modeling approach has the advantage of closely approximating the underlying dynamics of voting, and the plots showing the actual outcome vs. predicted outcome show the model has pretty good fit.
It also shows that whiteness was a major driver of Trump support, and that economic factors on their own were decidedly not a factor in supporting Trump. If anything, they predicted support for Clinton. It also provides an interesting way of directly modeling unit-level (in this case, county-level) uniformity / polarity among the electorate. This approach could perhaps be of use in better identifying “swing counties” (or at least a different approach in identifying them).
This modeling approach can be extended in an number of interesting ways. For example, instead of using a beta-binomial distribution to model two-way voting patterns, we could use a dirichlet-multinomial distribution (basically, the extension of beta-binomial to more than 2 possible outcomes) to model voting patterns across all candidates (including Libertarian and Green), and even flexibly model turnout by including not voting as an outcome in the distribution.
We could build similar regressions for past elections and see how coefficients have changed over time.
We could even match voting records across the ’12 and ’16 elections to make inferences about the components of the county-level vote swing: voters flipping their vote, voting in ’12 and not voting in ’16, or not voting in ’12 and then voting in ’16 – and which candidate they came to support.
In the wake of the surprising (to say the least) electoral result I initiated a few projects to try and understand the politics of the country. One thing I wanted to understand was the impact of demographic and underlying situational variables (e.g. health, income, unemployment, etc.) on how people voted. Was the vote about Obamacare? Was it about lost jobs? Was it all education levels? Or was it all racism? Theories have been floated but I haven’t seen a rigorous evaluation of these hypotheses. What’s below is just an exploratory analysis, but the data does point in some interesting directions.
The type of visualization is called a self organized map. Roughly speaking, each hexagon is a group of counties; the map arranges the counties such that similar ones are closer to each other on the map and dissimilar ones further apart:
For any given variable (here – the proportion of residents in the counties that graduated from high school) the map is a heatmap. Redder colors means the counties index higher, bluer means they index lower. Here, the upper right are the counties where fewer people have a high school diploma, and lower left are the most educated.
Below, we look at the voting share for Hillary. The counties are arranged in the same way as above, but since we’re looking at different variable the map is colored differently. (Confusingly, more votes for HRC are red as opposed to the customary blue for liberals, but work with me here). The reason this map is more organized than the rest is that I used this variable to “supervise” the organization (don’t worry about the details of this – basically it just guaranteed that this particular coloring, which is the reference point, would be organized.)
Now that we have the basics in place, we can look at other variables: let’s check a few variables and see if they line up w/ the HRC voting share map. What we can do is draw a boundary around the areas that went strongly for Trump and for HRC like so:
And we’ll keep these annotations throughout.
Health and insurance:
The breakdowns for uninsurance and health variables like obesity and diabetes don’t break down along electoral lines: the split goes in the opposite direction, with the highest uninsurance and low health areas going to both candidates:
These graphs should put the “economic anxiety” argument to rest, as the areas with highest unemployment went strongest to HRC and those with the least went strongest to Trump.
A few graphs line up pretty well: whiteness and ethnic homogeneity. And whiteness and ethnic homogeneity line up basically on top of each other. This would support the hypothesis that the election for Trump was mainly a cultural (and not a policy) event; white enclaves are reacting against a diminishing place in the cultural landscape – hence the making things great again: