Multi-state churn analysis with a subscription product

Subscriptions are no longer just for newspapers. The consumer product landscape, particularly among e-commerce firms, includes a bevy of subscription-based business models. Internet and mobile phone subscriptions are now commonplace and joining the ranks are dietary supplements, meals, clothing, cosmetics and personal grooming products.

Standard metrics to diagnose a healthy consumer-brand relationship typically include customer purchase frequency and ultimately, retention of the customer demonstrated by regular purchases. If a brand notices that a customer isn’t purchasing, it may consider targeting the customer with discount offers or deploying a tailored messaging campaign in the hope that the customer will return and not “churn”.

The churn diagnosis, however, becomes more complicated for subscription-based products, many of which offer multiple delivery frequencies and the ability to pause a subscription. Brands with subscription-based products need to have some reliable measure of churn propensity so they can further isolate the factors that lead to churn and preemptively identify at-risk customers.

This post shows how to analyze churn propensity for products with multiple states, such as different subscription cadences or a paused subscription.  

Unpacking our box

Assume we have an online subscription-based product that can be bought at set delivery intervals: monthly, quarterly and biannually. The customer also has the option to pause the subscription for any reason.

In our hypothetical example, a customer journey involves five states:

State 1:  Starts a subscription for the first time

State 2: Unsubscribes from receiving promotional emails from the brand

State 3: Pauses subscription because the supply from the previous delivery is not depleted

State 4: Unsubscribes from receiving promotional emails and pauses subscription (combination of States 2 and 3)

State 5: Cancels the subscription because no longer has a need for the product

Customer transition matrix

Like any relationship, that between a brand and customer passes through many states and phases. The transition between states can be represented in a transition matrix. This transition matrix presents an example of transitions between states:

This plot below conveys the various transitions in a graphical format. The corners represent transition states and each arrow represents the direction of a possible transition journey. Inevitably, one can see that it is possible to move to State 5, churn, from every possible state. However, to move from State 1 to State 4 requires passing through States 2 or 3. This is just one hypothetical customer journey in our subscription-based product.

Putting our data to use

Let’s also assume the brand has a variety of data points about each customer’s journey. Each customer’s files has:

  • A unique ID
  • The time since an event occurred within a state measured in the number of days (columns St2, St3…)
  • Occurrence of an event within said state (columns St2.s, St3.s…; 0: did not occur; 1: occurred)
  • Demographic and sales data such as year, age, discounts, gender

To generate a measure of size of the number of customers in each state, we can simply calculate the transition frequency and proportion of transitions across the full customer base. Here we see that there are 533 churn events.

We can also see that among customers who started their journey in State 1,640 customers moved to State 2, 777 to State 3, and 160 to State 5.  Furthermore, 332 stayed in State 1, which is classified as a non event.

The most probable transition is from State 1 is to State 3, which is shown below as a proportion. We see that 46% of customers that were in State 3 end up in the State 4, and of those, 25% end up churning.  

Building a time model

A common approach to modeling time is the Cox proportional hazards (PH) model. It identifies the effect of several variables on the time a specified event takes to occur. In other words, what is the likelihood that a particular customer will be exposed to a transition event (eg. moving from State 1 to State 2)?

Don’t forget, however, that the subscription model has more than just an active and inactive state — there are many possible states that need to be assessed for risk. With a Cox PH model, each transition (trans) is modeled separately and takes into the account the time since entering a transition state.

R has a very useful engine to calculate separate input statistics for each transition in a Cox PH model. We use a stratified Cox model where the strata is determined by the transition variable. This means that we have 8 separate models for each transition.

The following code models just time and does not (yet) include the impact of any other variables:

c0 <- coxph(Surv(Tstart, Tstop, status) ~ strata(trans), data = msdata_exp, method = breslow)

To incorporate input statistics, we apply the msfit workhorse, which calculates the probability of being in a state given the original state and time spent in each subsequent state. This is the first approach we use when modeling churn across multiple states.

msfit(object = c0, vartype = greenwood, trans = trans_mat)

So what did we find?

This next plot shows the probability that a customer moves from one state to another (also known as the cumulative hazard) with respect to time since the initial subscription began. Keep in mind, however, that this plot is an aggregate of all customers and does not show the impact of a specific variable, such as gender or product type, on the final hazard slope.

From this plot we see that at 1,000 days, customers who initialize their journey in State 1 have a 75% probability of transitioning to State 2 and a 70% probability to State 3.  

We can also explore the probabilities of a state-to-state transition by creating a probability matrix. This snippet shows the probabilities of customers who started in State 1 at the earliest days of their journey (days 0-6) and the end (days 4560 onward).

By day 5 of the customer journey, there is a 99% likelihood the customer will be still in State 1 and a less than 1% probability of being in State 3. As the journey progresses, however, there is only a 16% probability of still being in State 1 come day 4,787. Of crucial importance for the brand is the likelihood of moving to State 5, the end of the relationship: we see a 33% probability of this occurring.

Here’s one more way to visualize the same trend:

The distance between two adjacent curves represents the probability of being in the corresponding state with respect to time.

Adding more precision with covariates

A full model can be calculated with the workhorse coxph function by introducing stratification by transition and including all additional explanatory variables. From this model we can extract the relevant covariates that explain the likelihood of moving between states for a given demographic or behavioral variable.

In this table, a positive covariate indicates an increase in the hazard of moving from one state to an end state. In other words, the larger the covariate, the more exposed a customer is to the “risk” of transition.

With this model we can predict the distribution of states for a given time since beginning the subscription while taking gender, age or discounts into account. For example, imagine two customers with the following profiles:

Customer A Customer B
  • Discount: Yes
  • Gender: Female
  • Joined: 2013-2017
  • Age: Younger than 20    
  • Discount: No
  • Gender: Male
  • Joined: 2002-2007
  • Age: 20-40

Given these customer profiles, Customer B is more likely over the course of its journey with the brand to churn (State 5) or delay the subscription once and reject promotional emails (State 4). Comparatively, Customer A has a 30% likelihood of remaining an active client with no subscription pauses over the course of 10 years, while client B has a 20% likelihood.

Summary

Multistate models are common in business applications. They allow decision makers to see, literally, how states are distributed across a customer journey for a diverse variety of customer segments. Armed with this intelligence, brand decision makers can focus their outreach and acquisition efforts toward customers that have a higher probability of remaining in an active state for a longer period of time. It also shows where in the customer journey customers are vulnerable to churn — which can then be used to implement a strategy to preemptively mitigate vulnerability before it manifests.

Feature selection with the Boruta Algorithm

One of the most important steps in building a statistical model is deciding which data to include. With very large datasets and models that have a high computational cost, impressive efficiency can be realized by identifying the most (and least) useful features of a dataset prior to running a model. Feature selection is the process of identifying the features in a dataset that actually have an influence on the dependent variable.

High dimensionality of the explanatory variables can cause both high computation times and a risk of overfitting the data. Moreover, it’s difficult to interpret models with a high number of features. Ideally we would be able to select the significant features before performing statistical modeling. This reduces training time and makes it easier to interpret the results.

Some techniques to address the “curse of dimensionality” take the approach of creating new variables in a lower-dimensional space, such as Principal Component Analysis (Pearson 1901) or Singular Value Decomposition (Eckart and Young 1936). While these may be easier to run and more predictive than an un-transformed set of predictors, they can be very hard to interpret.

We’d rather—if possible—select from the original predictors, but only those that have an impact. There are a few sophisticated feature selection algorithms such as Boruta (Kursa and Rudnicki 2010), genetic algorithms (Kuhn and Johnson 2013, Aziz et al. 2013) or simulated annealing techniques (Khachaturyan, Semenovsovskaya, and Vainshtein 1981) which are well known but still have a very high computational cost — sometimes measured in days as the dataset multiplies in scale by the hour.

As genuinely curious, investigative minds, we wanted to explore how one of these methods, the Boruta algorithm, performed. Overall, we found that for small datasets, it is a very intuitive and beneficial method to model high dimensional data. Below follows a summary of our approach.

Why such a strange name?

Boruta comes from the mythological Slavic figure that embodies the spirit of the forest. In that spirit, the Boruta R package is based on ranger, which is a fast implementation of the random forests classification method.

How does it work?

We assume you have some knowledge of how Random Forests work—if not, this may be tough.

Let’s assume you have a target vector T (what you care about predicting) and a bunch of predictors P.

The Boruta algorithm starts by duplicating every variable in P—but instead of making a row-for-row copy, it permutes the order of the values in each column. So, in the copied columns (let’s call them P’), there should be no relationship then between the values and the target vector.

Boruta then trains a Random Forest to predict T based on P and P’.

The algorithm then compares the variable importance scores for each variable in P with it’s “shadow” in P’. If the distribution of variable importances is significantly greater in P than it is in P’, then the Boruta algorithm considers that variable significant.

Application

The dataset of interest here were records of doctors’ appointments for insurance-related matters, and the target variable of interest was whether or not the patient showed up for their appointment. Part of our task was to find the most significant interactions, and with fifty jurisdictions and thirty doctor specialties, we already have a space of 1,500 potential interactions to search through—not including many other variables.

The set of features can be visualized by creating a set of boxplots for the variable importances for each potential feature.

The three red boxplots represent the distribution of minimum, mean and maximum scores of the randomly duplicated “shadow” variables. This is basically the range of variable importances that can be achieved through chance.

The blue bars are features that performed worse than the best “shadow” variables and should not be included in the model. Purple bars are features that have the same explanatory power as the best “shadow” variable, and its use in the model is up to the discretion of the analyst. The green bars are variables with importances higher than the maximum “shadow” variable — and are therefore good predictors to include in a future classification model.

Code

show_mm <-
  model.matrix( ~ 0 +
                 `Doctor Specialty` + `Business Line` + 
                 Jurisdiction,
               data = show_df,
               contrasts.arg =
                 lapply(
                   show_df[, c('Doctor Specialty',
                               'Business Line',
                               'Jurisdiction')],
                   contrasts,
                   contrasts = FALSE
                  )
               )

show_mm_st <- cbind(status = show_df$`Appt Status`, show_mm)
show_mdf <- as.data.frame(show_mm_st)

library(Boruta)
b_model <- Boruta(status ~ ., data = show_mdf)

cat(getSelectedAttributes(b_model), sep = "\n")
# Doctor SpecialtyChiropractic Medicine
# Doctor SpecialtyNeurology
# Doctor SpecialtyNurse
# Doctor SpecialtyOrthopaedic Surgery
# Doctor SpecialtyOther
# Doctor SpecialtyRadiology
# Business LineDisability
# Business LineFirst Party Auto
# Business LineLiability
# Business LineOther
# Business LineThird Party Auto
# Business LineWorkers Comp
# JurisdictionCA
# JurisdictionFL
# JurisdictionMA
# JurisdictionNJ
# JurisdictionNY
# JurisdictionOR
# JurisdictionOther
# JurisdictionTX
# JurisdictionWA

## Importance plot 

plot(b_model, las =2, cex.axis=0.75)