PCA Interactive Lab — Building a Health Index

A pedagogical companion to Introduction to PCA Analysis for Building Development Indicators ↗ Back to the post

Why PCA? Why a Health Index?

In development economics, no single number captures progress. To rank countries on health, you might consider life expectancy, infant mortality, hospital beds, and disease prevalence — variables in different units that you cannot simply add together. Principal Component Analysis (PCA) compresses many correlated indicators into one elegant index by finding the direction in which countries differ most. That direction is the composite indicator.

This app lets you turn the dials yourself. In four tabs you will: watch a rotating cloud of countries until PCA aligns its arrow with the long axis of the data; vary the correlation between two health indicators and see what fraction of variance PC1 absorbs; rank 50 simulated countries by their actual Health Index from the post; and compare the eigenvector loadings against scikit-learn.

PC1 finds the long axis of the country cloud

The animation below sweeps a candidate direction (orange arrow) around the origin while a steel-blue gauge tracks the variance projected onto that direction. The variance peaks at the diagonal — exactly where PCA places PC1. The teal arrow is the perpendicular PC2, which always captures the leftover variance.

Tab 2

PCA Simulator

Vary the correlation between life expectancy and infant survival. Watch PC1's variance explained slide from 50% (no correlation) toward 100% (perfect alignment).

Tab 3

Country Rankings

The post's actual 50-country leaderboard. Sort, hover, and see how raw indicators translate into a single 0-to-1 Health Index.

Tab 4

Loadings & Variance

The closed-form result for two standardised variables: PC1 loadings are always [0.7071, 0.7071] regardless of correlation. See why, and how scikit-learn agrees.

Glossary (open a card if a term is unfamiliar)

Polarity adjustment
Flipping the sign of "more is bad" indicators (multiply by −1) so all variables point in the same direction. Infant mortality → infant survival.
Standardisation (z-score)
Subtract the mean, divide by the standard deviation. Each variable now has mean 0 and SD 1. Makes years and rates directly comparable.
Covariance matrix Σ
A square symmetric matrix with variances on the diagonal and covariances off it. For standardised data, the off-diagonal is the correlation r.
Eigenvector v
A direction in data space that Σ stretches without rotating. The first eigenvector is the PC1 loading — the recipe for combining variables.
Eigenvalue λ
The stretch factor along an eigenvector. Equals the variance of the data projected onto that direction.
Variance explained
λ_k divided by the sum of all eigenvalues. PC1 explained ≈ 98% in the post's two-variable example.
Component score
Z @ v — the projection of standardised data onto PC1. One number per country, ready to rank.
Min-max normalisation
(score − min) / (max − min). Rescales PC1 scores into the [0, 1] Health Index that policymakers expect.

PCA Simulator — how correlation drives compression

Generate simulated countries with two indicators. The latent health factor base_health drives both, but you control how strongly. Slide the correlation knob from 0 (independent indicators — PCA cannot compress) toward ±1 (perfectly redundant — PCA squeezes everything into PC1). The post's setting is r ≈ 0.96, where PC1 absorbs 98% of variance.

More countries make the eigenvectors more stable.
Strength of relationship between life expectancy and infant survival.
Idiosyncratic variation around the latent health factor.
Slide to draw a fresh sample with the same parameters.
Sample correlation
after polarity adjustment
λ₁ (PC1 eigenvalue)
closed form: 1 + |r|
λ₂ (PC2 eigenvalue)
closed form: 1 − |r|
PC1 variance explained
λ₁ / (λ₁ + λ₂)

What to look for

  • At r ≈ 0, the cloud is a fuzzy ball and PC1 explains only 50% — PCA offers no compression. The two indicators carry independent information.
  • At |r| ≈ 0.5, PC1 explains 75% — useful but you would still need PC2 to summarise countries faithfully.
  • At r ≈ ±0.96 (the post's setting), PC1 absorbs ≈ 98%. The cloud is a thin diagonal cigar — one number captures almost everything.
  • The closed form λ₁ = 1 + |r|, λ₂ = 1 − |r| only holds for two standardised variables. With 3+ variables the eigenvalues are not so clean — PCA earns its keep.

Monte-Carlo stability test

How much does the variance-explained estimate move from sample to sample? Run 100 fresh simulations at the current settings and watch the histogram.

The post's 50-country Health Index — interactively

These numbers come straight from health_index_results.csv in the post's folder — the same PC1 scores and Health Index used in §10 and §11. Toggle the indicator filter to see how countries with low life expectancy cluster at the bottom, and how the rank order is almost identical between raw PC1 and the 0-to-1 Health Index.

What to look for

  • The top 10 (Health Index > 0.78) all combine high life expectancy (≥ 79 years) with low infant mortality (≤ 17 per 1,000) — no country reaches the top by excelling on only one indicator.
  • Country_05 has Health Index = 0.00 by construction — it is the worst performer, and min-max normalisation pins the minimum to zero.
  • Country_28 has Health Index ≈ 0.0015 — its bar is invisible at this scale, not missing data. Life expectancy 57.7 years and infant mortality 58.7 per 1,000 place it almost as low as the floor.
  • Hover any bar to read the raw indicators and PC1 score for that country.

Sort by

Filter (life expectancy)

Hide countries below this threshold.

How the Health Index is built — six steps in one line

  1. Polarity: flip infant mortality → infant survival (multiply by −1).
  2. Standardise: z-score both indicators (mean 0, SD 1).
  3. Covariance: compute the 2×2 matrix (off-diagonal = r = 0.9595).
  4. Eigen-decompose: recover eigenvectors [0.7071, 0.7071] and [0.7071, −0.7071] with eigenvalues 1.9595 and 0.0405.
  5. Project: PC1 score = 0.7071·zLE + 0.7071·zIS. Range [−2.39, +2.37].
  6. Normalise: (PC1 − min) / (max − min) → Health Index in [0, 1].

Loadings & Variance — a verification dashboard

The eigenvector entries — called loadings — are the weights PCA multiplies each standardised variable by to compute PC1. For two standardised variables, both loadings are always exactly 1/√2 ≈ 0.7071, regardless of correlation. This is a mathematical certainty. The variance proportions, however, depend strongly on r — and that is what makes PCA useful.

PC1 loadings (this post)

Life expectancy0.7071
Infant survival0.7071
‖ v₁ ‖1.0000

Both indicators contribute equally — PC1 is essentially a scaled average of the two z-scores.

PC2 loadings (this post)

Life expectancy+0.7071
Infant survival−0.7071
‖ v₂ ‖1.0000

Perpendicular to PC1. Captures countries where the two indicators disagree — only 2% of variance.

Variance explained by component

PC1 captures 97.97% of all variation in the standardised data — the entire purpose of PCA. PC2 captures the leftover 2.03%, almost all of it noise from the data-generating process.

Manual vs. scikit-learn — perfect agreement

The post computes PC1 in two ways: by hand (six explicit steps: polarity → standardise → covariance → eigen-decompose → project → normalise) and via scikit-learn's PCA(n_components=1). The maximum absolute difference between the two PC1-score vectors is 1.33 × 10⁻¹⁵ — machine precision. The correlation is exactly 1.000000. Same eigenvectors, same scores, same Health Index rankings.

Max |PC1manual − PC1sklearn|
1.33 × 10⁻¹⁵
essentially zero
Correlation
1.000000
all 50 points on the 45° line
Total variance
2.0000
= λ₁ + λ₂ = #variables

Connecting back to Tabs 2 and 3

  • Tab 2's simulator proves the closed form λ₁ = 1 + |r|, λ₂ = 1 − |r| by varying r and reading the eigenvalues live.
  • Tab 3's rankings are computed from the very loadings shown above: 0.7071 × zLE + 0.7071 × zIS, then min-max normalised.
  • The equal-loading result is a quirk of 2 variables. With 3+ indicators the loadings differ, giving more weight to indicators that contribute unique information — and that is where PCA truly earns its keep.