Cohort Retention & LTV Modeler
Replaced a 16-hour manual process with a Python pipeline feeding Power BI. First time the business had channel-level retention and LTV/CAC it could trust — went straight into leadership reviews.
We were doing this in Excel
The retention numbers leadership wanted lived in a Google Sheet, refreshed twice a month by one analyst. It broke any time a row went in out of order. Each refresh cost about 16 hours.
The data was right there in the revenue ledger. It just wasn't reaching anyone who could use it — so every spend conversation that should have leaned on data leaned on opinion instead.
An analyst shouldn't be on call to rebuild a spreadsheet every Sunday.
What I actually needed
So I tried doing the cohort math inside Power BI first — that's where the reports already lived. Wrong tool.
Cohort matrices need self-joins, and DAX isn't built for that. The realization: cohorts aren't a BI calculation. They're a data pipeline. The BI layer should just read the answer.
Rebuilt it in Python
I rebuilt it as a small Pandas pipeline. Every night it computes the weekly cohort matrix and writes a flat table to Postgres. Power BI reads one view — no DAX, no joins, no off-by-one.
Two specific choices made it work. Weekly cohorts — monthly aggregation hides the week 2–6 churn cliff entirely. And the simplest defensible LTV formula: ARPU ÷ (1 − retention rate). I benchmarked it against fancier models; the difference was under 8% for mature cohorts.
Simple and auditable beats clever and opaque when leadership has to trust the number.
12-week retention by cohort
| Cohort | W1 | W2 | W3 | W4 | W5 | W6 | W7 | W8 | W9 | W10 | W11 | W12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2024-W01 | 100 | 76 | 63 | 56 | 50 | 46 | 43 | 41 | 39 | 38 | 36 | 35 |
| 2024-W05 | 100 | 79 | 66 | 58 | 52 | 48 | 45 | 42 | 40 | 38 | 37 | 36 |
| 2024-W09 | 100 | 77 | 64 | 57 | 52 | 47 | 44 | 42 | 40 | 38 | 37 | 36 |
| 2024-W13 | 100 | 81 | 68 | 60 | 55 | 50 | 47 | 44 | 42 | 40 | 39 | 38 |
Synthetic data above, same shape as the live model. The public version — running on the GA4 e-commerce sample on BigQuery — is on Kagglearrow_outward. Every filter recomputes the heatmap, channel curves, LTV:CAC, and a payback simulator. Nothing is hardcoded.
Now I just update the data
Now it's the same routine every month: drop in the new ledger extract, run the notebook, done. The board deck pulls itself from the same view.
The first month we ran it, one channel showed a 30% higher 12-week retention than the next best — a finding that immediately reshaped acquisition spend. Today, ₹40 Cr+ of revenue sits in the LTV/CAC model that drives quarterly reallocation at the leadership level.
Adjust inputs, watch ratio
Healthy > 3 · Marginal 1–3 · Loss < 1.
The internal pipeline still uses the static LTV formula. The public Kaggle build adds a polynomial layer that predicts full LTV from first-week spend (R² ≈ 0.5–0.7 on the GA4 sample). Porting it back is next — especially for cohorts under 8 weeks old, where the static formula underestimates.