Portfolio: CoxPH

Hi there!

I recently came across the subject of Survival Analysis while taking the statistical learning online course from Stanford University. This inspired me to practice what I learned, as it’s a skill I can apply for different purposes. Besides being impactful, it’s also a surprisingly fun and insightful area of study. So, without further ado…

Analysis Structure

The visualization below represents the survival analysis framework that I used to model subscription cancellation risk within a streaming service. It depicts how key predictors such as subscription tier, content rating and weekly active hours can be used to model the risk over time via the proportional hazards model.

Below, I am simulating survival data for subscribers. The event of interest is cancellation, modeled using the Weibull hazard function and its corresponding survival function.

\(h(t|X) = \lambda \gamma t^{\gamma - 1} \exp(\beta X)\)
\(S(t|X) = \exp(-\lambda t^{\gamma} \exp(\beta X))\)

The hazard function is the mathematical core that determines the time to the event, while the survival function is a direct consequence of this that is used to determine if an observation is censored.

Where:

\(h(t|X)\): The hazard function of the Weibull model. It represents the instantaneous probability of cancellation occurring at time \(t\) for a subscriber with a specific set of characteristics \(X\).

\(S(t|X)\): The survival function of the Weibull model. It represents the probability of a subscriber surviving (not churning) beyond time \(t\), given their characteristics \(X\).

\(\lambda\)(lambda): This is the scale parameter. It determines the overall baseline risk of churning when all other variables are zero. A larger \(\lambda\) means a higher risk of churning.

\(\gamma\)(gamma): This is the shape parameter. It controls how the risk of churning changes over time. If \(\gamma\)=1, the risk is constant; if \(\gamma\)>1, the risk increases; and if \(\gamma\)<1, the risk decreases.

\(\beta X\): This part of the function represents the predictor variables (X) and their corresponding coefficients (\(\beta\)). A positive \(\beta\) value for a variable increases the risk of churning, and a negative value decreases it.

The survival analysis will help us understand how the predictors influence the likelihood of churn over time.

Since its a simulated data, I am going to control the parameters and the effect of each predictor on the churn probability:

Subscription Tier: Basic, Pro and Premium (1-3). Higher tiers of subscription are set to increase churn risk (since they might be more critical with the service).
Content Rating: Rating from 1-5. Higher ratings given by the user reduce their churn risk.
Weekly Active Hours: Increased service usage is configured to lower the risk of churn.

Consequently, with an exception of Subscription tier, the predictors were set to have a negative effect on the probability of churn (see the formulas below). The data was simulated for 500 subscribers over a period of 365 days (the specific business product is not relevant here, as the focus is purely on the statistical perspective). Churn events were simulated based on the following predefined hazard function:

\(h(t|X) = (0.009) \cdot (0.8) \cdot t^{(0.8 - 1)} \cdot \exp(0.5 \cdot \text{subscription tier} - 0.5 \cdot \text{content rating} - 0.05 \cdot \text{weekly active hours})\)
\(S(t|X) = \exp\left( - (0.009) \cdot t^{0.8} \cdot \exp(0.5 \cdot \text{subscription tier} - 0.5 \cdot \text{content rating} - 0.05 \cdot \text{weekly active hours}) \right)\)

After that it is possible to visualize the survival function for subscribers. The survival function was estimated using the Kaplan-Meier (KM) method, which provides a non-parametric estimate of the survival function based on the observed data. Here I am comparing the survival curves for each subscription tier:

Based on the KM curve, the survival probability for subscribers in higher-tier categories apparently is lower, which aligns with the positive parameter I set to model the effect of increased cost and higher user criticality on churn risk.

Of course the differences between tiers might or might not be statistically significant and I am going to use the model later to check on this observational insight.

Cox Proportional Hazards Model

Proceeding with the analysis, I decided to run a Cox proportional hazards model to measure the impact of each variable on the churn event. Before fitting the final model, I validated its performance using 5-fold cross-validation, a powerful technique that splits our data into five subsets. The model was trained on four of these subsets and then tested on the fifth, with this process repeating five times, so every subset served as a test set once. This approach was crucial to get an unbiased estimate of the model’s predictive power on unseen subscribers.

The metric used was the Concordance Index (C-index), which reliably measures the model’s ability to correctly rank customers by their risk of churn. The C-index works by evaluating pairs of subscribers and checking if the model correctly predicts which one is at a higher risk of churn, similar to an accuracy metric, but specifically designed to handle censored data. By averaging the C-index across all five folds, we get a true sense of how the churn score is likely to perform in a real-world scenario.

[1] "Mean C-index across 5 folds: 0.684"

[1] "Standard Deviation of C-index: 0.062"

A C-index of 0.68 indicates that the model has decent discriminatory power, correctly predicting which of a pair of subscribers is at a higher risk of churning approximately 68`% of the time. While this score is better than random chance (0.5), it suggests that adding more user characteristics that impact churn possibly would improve the model’s predictive ability. In a real-world scenario, I would prioritize improving this power.

Final Model:

Therefore, I ran the model for the whole data in order to analyze the variables coefficients (in other words identify what is causing churn):

Call:
coxph(formula = Surv(eventtime, churn_event) ~ subscription_tier_factor + 
    content_rating + weekly_active_hours, data = sim_data_final)

  n= 500, number of events= 105 

                                    coef exp(coef) se(coef)      z
subscription_tier_factorPro      0.73705   2.08976  0.29642  2.487
subscription_tier_factorPremium  1.47474   4.36990  0.27209  5.420
content_rating                  -0.89303   0.40942  0.19573 -4.563
weekly_active_hours             -0.06238   0.93953  0.02561 -2.435
                                Pr(>|z|)    
subscription_tier_factorPro       0.0129 *  
subscription_tier_factorPremium 5.96e-08 ***
content_rating                  5.05e-06 ***
weekly_active_hours               0.0149 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                                exp(coef) exp(-coef) lower .95
subscription_tier_factorPro        2.0898     0.4785    1.1689
subscription_tier_factorPremium    4.3699     0.2288    2.5637
content_rating                     0.4094     2.4425    0.2790
weekly_active_hours                0.9395     1.0644    0.8935
                                upper .95
subscription_tier_factorPro        3.7360
subscription_tier_factorPremium    7.4486
content_rating                     0.6009
weekly_active_hours                0.9879

Concordance= 0.696  (se = 0.026 )
Likelihood ratio test= 56.84  on 4 df,   p=1e-11
Wald test            = 54.5  on 4 df,   p=4e-11
Score (logrank) test = 58.34  on 4 df,   p=6e-12

Based on the model’s output, a higher subscription tier is a statistically significant factor that increases a subscriber’s risk of churning, while a higher content rating reduces it. Specifically, a Pro-tier subscriber has a churn risk that is 109% higher than a Basic-tier subscriber, and a Premium-tier subscriber’s risk is 337% higher. For every one-point increase in a subscriber’s content rating, the risk of churning decreases by approximately 59% (0.4094). Weekly active hours were a statistically significant predictor of churn. For every one-hour increase, the risk of churning decreases by approximately 6%. (p = 0.015)

With this model now it is possible to calculate a churn score for each customer, which can be used to prioritize retention efforts. Subscribers with higher churn scores are at greater risk of leaving the streaming service, allowing for targeted interventions to improve their experience and reduce churn.

Conclusion

While the model’s results were a direct reflection of the simulated data, the primary objective of this post was to explore the statistical methodology itself. It was clear that this is a useful approach for generating insights from covariates by quantifying the influence of various factors on churn, which enables more informed and impactful decision-making.