Pooled PCA — Interactive Lab

A pedagogical companion to Pooled PCA for Building Development Indicators Across Time ↗ Back to the post

Why pooled PCA? The yardstick problem.

You want a single development index that tracks 153 South American regions in both 2013 and 2019. The naive recipe — run PCA separately for each year — silently re-centres the data every period, so a region's improvement can appear as a decline. Pooled PCA fixes the yardstick: it standardises and computes eigenvector weights from the stacked panel (all 306 observations), producing one set of weights that applies to both periods.

This app walks you through the choice in four tabs. You'll see the animation of a "shifting" vs "fixed" yardstick, slide the parameters of a two-period simulation to watch per-period weights wobble while pooled weights stay still, compare real PC1 weights from the post side-by-side with confidence intervals, and reproduce the validation against the official Subnational HDI.

Shifting yardstick (per-period) vs fixed yardstick (pooled)

The animation below shows the same coefficient under two regimes as the "penalty knob" sweeps — read it as how much the standardisation baseline can move between periods. The orange line is the per-period regime: the baseline shifts, so a region's measured value can hit zero or even flip sign. The steel-blue line is the pooled regime: the baseline stays fixed and the value decays smoothly. Pooled keeps genuine improvements visible; per-period can hide them.

Tab 2

PCA Simulator

Slide the income-shock and education-gain to control a 2-period DGP. Compare pooled and per-period PC1 weights live, and watch the per-period weights drift while pooled weights stay fixed.

Tab 3

Weight Comparison

The real numbers from the post: PC1 weights for Education, Health, and Income under pooled vs per-period 2013 vs per-period 2019, with 95% intervals.

Tab 4

SHDI Validation

Which method better tracks the official Subnational HDI? Compare R² for levels (0.9823 vs 0.9750) and changes (0.9964 vs 0.9913) at a glance.

Glossary (open a card if a term is unfamiliar)

PCA (Principal Component Analysis)
A linear technique that finds new axes capturing maximum variance. PC1 is the direction along which the data spreads the most. Used here to compress Education, Health, and Income into a single composite development index.
Pooled standardisation
Compute the mean and SD using all periods stacked, then apply the same transformation to every observation. The yardstick stays fixed across years.
Per-period standardisation
Compute the mean and SD separately for each period. The 2019 mean replaces the 2013 mean, so the yardstick shifts — a region's z-score can appear better simply because the cohort got worse.
PC1 weight
The entries of the first eigenvector, e.g. [0.5642, 0.5448, 0.6204] for Education, Health, Income in this post. PCA's data-driven recipe for combining indicators into one score.
Variance explained
Fraction of total variance captured by each principal component. In this post: PC1 = 72.4%, PC2 = 18.8%, PC3 = 8.8%.
Sign convention
Eigenvectors are unique only up to sign. We force the leading entry positive so PC1 increases with Education, Health, and Income — "higher = better".
Spearman rank correlation
A correlation computed on ranks instead of raw values. Robust to outliers. Used to compare how two methods order the same regions.
Subnational HDI (SHDI)
A sub-national Human Development Index from the Global Data Lab, built as a geometric mean of Education, Health, and Income. The official benchmark our PCA-based index is validated against.

PCA Simulator — two periods, two recipes

The simulator generates two periods of synthetic development data with three indicators (mimicking Education / Health / Income). You set parameters; the app fits both pooled PCA and per-period PCA on the simulated data and shows the resulting PC1 weights, variance explained, and the period-to-period drift in per-period weights. The pooled weights should stay nearly identical when you re-seed. The per-period weights should wobble — that's the bug pooled PCA fixes.

Number of regions in each year. Total observations = 2 × n.
Shift applied to Income mean in period 2. Negative = decline (mimics South America 2013→2019).
Shift applied to Education mean in period 2. Positive = improvement.
Idiosyncratic variation across regions. More noise = lower variance explained by PC1.

Pooled PCA

Standardise on stacked data (300 obs), one set of weights.

w(Education)
w(Health)
w(Income)
PC1 variance explained
PC1 mean shift (P2 − P1)

Per-period PCA

Standardise on each period separately, two sets of weights.

w(Edu)   P1 → P2
w(Hea)   P1 → P2
w(Inc)   P1 → P2
Avg weight drift |Δw|
PC1 mean shift (P2 − P1)

What to look for

  • Pooled weights are stable across re-seeds and across parameter changes. Try sliding the income shock from −0.40 to +0.20: pooled weights barely move.
  • Per-period weights drift between P1 and P2. The bigger the level shift in any indicator, the more its per-period weight wobbles.
  • PC1 mean shift is informative under pooled (positive for improvement, negative for decline) but is exactly zero under per-period by construction. That's how per-period hides real changes.

Bias vs variance over many simulations

Single runs are noisy. Run the full DGP 100 times with fresh seeds to see whether the per-period weight drift is systematic.

Real PC1 weights — pooled vs per-period 2013 vs per-period 2019

These weights come straight from the post's script.py run on the 153-region South American panel. Each weight tells you how much PC1 weighs that indicator. Toggle methods and indicators to see how the per-period weights shift between 2013 and 2019 while the pooled weights sit in between as a fixed compromise. Confidence intervals are approximate (±1.96 × bootstrap SE).

What to look for

  • Education's weight drops from 0.583 (2013) to 0.541 (2019) under per-period — a real shift of −0.043. The pooled weight (0.564) sits between them.
  • Health's weight rises from 0.510 (2013) to 0.566 (2019) — a jump of +0.056. Pooled gives 0.545, again the compromise.
  • Income's weight is the most stable across methods (0.620–0.633): all three approaches agree Income carries the heaviest weight.
  • The forest plot makes the recipe-instability of per-period PCA visible at a glance. Pooled PCA fixes one recipe across both years.

Indicators

Methods

Why does Income carry the heaviest weight?

In the South American panel the three indicators are positively but unequally correlated: Education–Income r = 0.68, Health–Income r = 0.63, Education–Health r = 0.44. Income sits between the other two, so it shares more common variance with both. PCA's first eigenvector loads more on the variable that participates in the most pairwise correlations — hence w(Income) = 0.620 > w(Education) = 0.564 > w(Health) = 0.545.

Connecting back to Tab 2

The per-period weight drift you slid in Tab 2 is exactly what shows up here on real data:

  • Education weight: 0.583 (2013) → 0.541 (2019), drift = −0.043
  • Health weight: 0.510 (2013) → 0.566 (2019), drift = +0.056
  • Pooled holds the recipe fixed at [0.564, 0.545, 0.620] across both years.

The post's message becomes visible twice: once on synthetic data you control, and once on the original 306 region-period panel.

Validation — which PCA tracks the official SHDI better?

The Global Data Lab publishes an official Subnational HDI (SHDI) using a geometric mean methodology similar to the UNDP's. Both pooled and per-period PCA correlate strongly with it — but pooled PCA wins on both the level fit and the change fit. The differences are small in absolute terms (≈ 0.5–0.7 percentage points of R²) but consistent and policy-relevant: per-period PCA disagrees with pooled PCA on the direction of change for 16 of 153 regions (10.5%).

Cross-sectional fit (levels)

Correlation between PCA-based HDI and official SHDI across all 306 region-period observations.

Pooled PCA — R²0.9823
Per-period PCA — R²0.9750
Pooled advantage+0.0073

Dynamic fit (changes)

Correlation between PCA-based HDI change (2019 − 2013) and official SHDI change.

Pooled PCA — R²0.9964
Per-period PCA — R²0.9913
Pooled advantage+0.0051

Direction-of-change disagreement

Even when the methods agree on average (Spearman ρ for HDI change ranks = 0.9818), they disagree on the sign of the change for a non-trivial slice of the sample:

Direction disagreements
16 / 153
10.5% of regions — per-period says down, pooled says up
Spearman ρ (change ranks)
0.9818
methods agree on rank, disagree on sign
City of Buenos Aires
+0.019 vs −0.040
pooled vs per-period HDI change

Buenos Aires — the running example

Argentina's capital improved in Education (0.926 → 0.946) and Health (0.858 → 0.872), with a modest Income decline (0.850 → 0.832). Pooled PCA correctly reports a modest improvement of +0.019. Per-period PCA reports a decline of −0.040. The shifting yardstick of per-period standardisation is what flips the sign.

A policymaker using per-period PCA might conclude that Buenos Aires "fell behind". A policymaker using pooled PCA sees that it improved modestly while being overtaken by Chilean regions that improved faster. Both narratives are useful — but the data should drive the narrative, not the choice of standardisation.