Here are the libraries we use for this. The NMF library is the workhorse here.

library(tidyverse)
library(NMF)
library(stringr)
library(magrittr)
library(knitr)

opts_knit$set(
  eval     = TRUE,
  cache    = FALSE, 
  echo     = TRUE, 
  message  = FALSE, 
  warning  = FALSE
)

We have set up our data in advance, but the result is that we have purchase counts in each cell, with customers as rows and products as columns.

# cbp = customer by product
# 30k = sample of 30k customers
cbp <- read_csv("cbp_30k.csv")

# storing customer ids elsewhere as they will gum up the factorization
c_id <- cbp$customer
cbp <- cbp %>% select(-customer)

# getting rid of customers that have purchased nothing
cbp <- cbp[, which(colSums(cbp) != 0)]

# probably not needed but ¯\_(ツ)_/¯
cbp <- as.matrix(cbp) %>% 
  # sorry we're hiding the names of our client's products!
  set_colnames(str_c("P", 1:ncol(.)))

head(cbp[, 1:20])
##      P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19
## [1,]  0  0  0  0  0  0  0  0  0   0   0   0   0   0   0   0   1   0   0
## [2,]  0  0  0  0  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0
## [3,]  0  0  0  0  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0
## [4,]  0  0  0  0  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0
## [5,]  3  1  0  0  0  0  0  0  0   0   1   0   0   0   0   0   0   0   0
## [6,]  0  0  0  0  0  0  0  0  0   0   0   0   0   0   0   0   0   0   0
##      P20
## [1,]   0
## [2,]   0
## [3,]   0
## [4,]   0
## [5,]   0
## [6,]   0

As you can see, the data is very sparse. This is a very common challenge for the analysis.

Building the segments

Here is the workhorse code of the whole exercise. In res we are storing the output of the factorization process. We've told nmf to factor the matrix with five segments, and we've given it some options (the tv) to be verbose in its output

# we have chosen 5 segments for this code through
# best practice is to use a bunch and see what has a low error rate and is the most useful
# this may run for a very long time, depending on the size of your matrix
res <- nmf(cbp, 5, .options = 'tv')

Visualizing Segments

For each segment, each customer (and each product) is assigned a score. They are not directly assigned to a segment—a customer can be in more than one segment. But by looking at a customers segment scores in relation to each other, we can see which one (or ones) they most belong to. We do this by using a heatmap, which we show below

Customers

# restricting the dataset to just 100 customers and 100 products
res_subset <- res[sample(100), sample(100)]


# careful! with RStudio you typically have to print this to a file—it won't work in RStudio's Viewports
basismap(res_subset, main = "Customer Segments")

Products

coefmap(res_subset, main = "Product Segments") 

Extracting Customer Scores

And finally, you’re likely going to want to use the scores in other applications, so here’s a few simple lines of code aimed at extracting the scores for each customer (a similar approach can be used for products)

res_basis <- basis(res) # W
res_coef  <- coef(res) # H

customer_scores <- 
  (res_basis / rowSums(res_basis)) %>% 
  as.data.frame %>% 
  set_colnames(str_c("S", 1:5)) 

head(customer_scores)
##             S1           S2           S3           S4           S5
## 1 1.509126e-19 6.106923e-01 3.893077e-01 1.509126e-19 1.509126e-19
## 2 6.783763e-19 6.783763e-19 1.000000e+00 6.783763e-19 6.783763e-19
## 3 3.382403e-01 3.841704e-01 9.426346e-02 2.154641e-19 1.833258e-01
## 4 1.530457e-01 6.193899e-01 2.275643e-01 2.809709e-19 2.809709e-19
## 5 1.124007e-01 2.348966e-02 2.731839e-21 8.444042e-01 1.970547e-02
## 6 7.244301e-01 8.451125e-20 2.180128e-01 5.755718e-02 8.451125e-20