AnalyticsPythonPower BI2025

Cohort Retention & LTV Modeler

Replaced a 16-hour manual process with a Python pipeline feeding Power BI. First time the business had channel-level retention and LTV/CAC it could trust — went straight into leadership reviews.

40%Time cut
30%Accuracy ↑
₹40 CrTracked
01
Chapter

We were doing this in Excel

The retention numbers leadership wanted lived in a Google Sheet, refreshed twice a month by one analyst. It broke any time a row went in out of order. Each refresh cost about 16 hours.

The data was right there in the revenue ledger. It just wasn't reaching anyone who could use it — so every spend conversation that should have leaned on data leaned on opinion instead.

An analyst shouldn't be on call to rebuild a spreadsheet every Sunday.
02
Chapter

What I actually needed

So I tried doing the cohort math inside Power BI first — that's where the reports already lived. Wrong tool.

Cohort matrices need self-joins, and DAX isn't built for that. The realization: cohorts aren't a BI calculation. They're a data pipeline. The BI layer should just read the answer.

The rule I follow now
If a calculation is complex enough to need unit tests, it doesn't belong in a BI tool.
03
Chapter

Rebuilt it in Python

I rebuilt it as a small Pandas pipeline. Every night it computes the weekly cohort matrix and writes a flat table to Postgres. Power BI reads one view — no DAX, no joins, no off-by-one.

Two specific choices made it work. Weekly cohorts — monthly aggregation hides the week 2–6 churn cliff entirely. And the simplest defensible LTV formula: ARPU ÷ (1 − retention rate). I benchmarked it against fancier models; the difference was under 8% for mature cohorts.

Simple and auditable beats clever and opaque when leadership has to trust the number.

Cohort retention · Synthetic data

12-week retention by cohort

CohortW1W2W3W4W5W6W7W8W9W10W11W12
2024-W011007663565046434139383635
2024-W051007966585248454240383736
2024-W091007764575247444240383736
2024-W131008168605550474442403938
Hover a cell for cohort detail.
Lower
Higher

Synthetic data above, same shape as the live model. The public version — running on the GA4 e-commerce sample on BigQuery — is on Kagglearrow_outward. Every filter recomputes the heatmap, channel curves, LTV:CAC, and a payback simulator. Nothing is hardcoded.

04
Chapter

Now I just update the data

Now it's the same routine every month: drop in the new ledger extract, run the notebook, done. The board deck pulls itself from the same view.

The first month we ran it, one channel showed a 30% higher 12-week retention than the next best — a finding that immediately reshaped acquisition spend. Today, ₹40 Cr+ of revenue sits in the LTV/CAC model that drives quarterly reallocation at the leadership level.

LTV / CAC calculator · Live

Adjust inputs, watch ratio

5,000 ₹/month
85 %
3,000
Lifetime value (LTV)
33,333
LTV : CAC ratio
11.11
Strong
Formula: LTV = ARPU / (1 − retention). Ratio = LTV / CAC.
Healthy > 3 · Marginal 1–3 · Loss < 1.

The internal pipeline still uses the static LTV formula. The public Kaggle build adds a polynomial layer that predicts full LTV from first-week spend (R² ≈ 0.5–0.7 on the GA4 sample). Porting it back is next — especially for cohorts under 8 weeks old, where the static formula underestimates.