Using Random Forests for Segmentation

A common task in marketing is segmentation: finding patterns in data and building profiles of customer behavior. This involves using a clustering algorithm to identify these patterns. The data is – more often than not – a mix of different data types (categorical, ordered, numerical, etc.). If you’re using survey data, this will be the case 99 times out of 100.

Unfortunately, most clustering algorithms have very strict limits on the type of data they can handle. K-means – probably the most popular – requires strict numerical data, as does model-based-clustering. To use these mean you have to somehow convert non-numeric data to numeric, which is often a kludge. Others, such as spectral or hierarchical clustering, require a notion of distance between two observations. This, too, can be like fitting a square peg into a round hole.

Enter Random Forests. Random Forests are an extremely popular tool for regression and classification, but they can also be used for clustering. In fact, they are a handy tool when you have mixed data sets.

The way that it works in unsupervised mode is as follows:

  • it generates a random target vector of 0s and 1s
  • it builds a Random Forest classifier fitted to the random target vector
  • it counts how often observations end up in the same terminal node

The last count is the source of a “proximity” measure between two observations. A matrix can be recorded with the entry {i, j} being the percentage of trees the observations ended up in the same node. Note that since Random Forests can approximate any function, this can easily handle a mixture of numerical and categorical data.

Also, note that there is no particular reason the target vector has to be random. You can generate proximity matrices from supervised random forests; the clusters that result from these are produced from the dimensions of the data that “matter” to the target, which is an easy way to do supervised clustering.

Once you have this proximity matrix, you can do a number of things.

  1. Spectral clustering methods take a proximity matrix directly as an input, so you can use this information directly
  2. You can first convert the proximity matrix to a distance matrix, and then use multidimensional scaling to convert the data from observation x observation to observation x dimension
  3. With data in this format, you can use k-means or model-based clustering as usual

In summary, Random Forests are a handy, flexible tool to perform clustering analysis when the data is mixed (as is the case in almost all marketing settings).

As an added bonus, you can increase the relevance of the segmentation output by supervising the clustering with a target vector of interest (such as sales, category purchases, or income).