Why PCA? Why a Health Index?
In development economics, no single number captures progress. To rank countries on health, you might consider life expectancy, infant mortality, hospital beds, and disease prevalence — variables in different units that you cannot simply add together. Principal Component Analysis (PCA) compresses many correlated indicators into one elegant index by finding the direction in which countries differ most. That direction is the composite indicator.
This app lets you turn the dials yourself. In four tabs you will: watch a rotating cloud of countries until PCA aligns its arrow with the long axis of the data; vary the correlation between two health indicators and see what fraction of variance PC1 absorbs; rank 50 simulated countries by their actual Health Index from the post; and compare the eigenvector loadings against scikit-learn.
PC1 finds the long axis of the country cloud
The animation below sweeps a candidate direction (orange arrow) around the origin while a steel-blue gauge tracks the variance projected onto that direction. The variance peaks at the diagonal — exactly where PCA places PC1. The teal arrow is the perpendicular PC2, which always captures the leftover variance.
PCA Simulator
Vary the correlation between life expectancy and infant survival. Watch PC1's variance explained slide from 50% (no correlation) toward 100% (perfect alignment).
Country Rankings
The post's actual 50-country leaderboard. Sort, hover, and see how raw indicators translate into a single 0-to-1 Health Index.
Loadings & Variance
The closed-form result for two standardised variables: PC1 loadings are always [0.7071, 0.7071] regardless of correlation. See why, and how scikit-learn agrees.
Glossary (open a card if a term is unfamiliar)
Polarity adjustment
Standardisation (z-score)
Covariance matrix Σ
Eigenvector v
Eigenvalue λ
Variance explained
Component score
Min-max normalisation
PCA Simulator — how correlation drives compression
Generate simulated countries with two indicators. The latent health factor
base_health drives both, but you control how strongly. Slide
the correlation knob from 0 (independent indicators — PCA cannot
compress) toward ±1 (perfectly redundant — PCA squeezes everything into PC1).
The post's setting is r ≈ 0.96, where PC1 absorbs 98% of variance.
What to look for
- At r ≈ 0, the cloud is a fuzzy ball and PC1 explains only 50% — PCA offers no compression. The two indicators carry independent information.
- At |r| ≈ 0.5, PC1 explains 75% — useful but you would still need PC2 to summarise countries faithfully.
- At r ≈ ±0.96 (the post's setting), PC1 absorbs ≈ 98%. The cloud is a thin diagonal cigar — one number captures almost everything.
- The closed form λ₁ = 1 + |r|, λ₂ = 1 − |r| only holds for two standardised variables. With 3+ variables the eigenvalues are not so clean — PCA earns its keep.
Monte-Carlo stability test
How much does the variance-explained estimate move from sample to sample? Run 100 fresh simulations at the current settings and watch the histogram.
The post's 50-country Health Index — interactively
These numbers come straight from health_index_results.csv in the
post's folder — the same PC1 scores and Health Index used in §10 and §11.
Toggle the indicator filter to see how countries with low life expectancy
cluster at the bottom, and how the rank order is almost identical between
raw PC1 and the 0-to-1 Health Index.
What to look for
- The top 10 (Health Index > 0.78) all combine high life expectancy (≥ 79 years) with low infant mortality (≤ 17 per 1,000) — no country reaches the top by excelling on only one indicator.
- Country_05 has Health Index = 0.00 by construction — it is the worst performer, and min-max normalisation pins the minimum to zero.
- Country_28 has Health Index ≈ 0.0015 — its bar is invisible at this scale, not missing data. Life expectancy 57.7 years and infant mortality 58.7 per 1,000 place it almost as low as the floor.
- Hover any bar to read the raw indicators and PC1 score for that country.
Sort by
Filter (life expectancy)
How the Health Index is built — six steps in one line
- Polarity: flip infant mortality → infant survival (multiply by −1).
- Standardise: z-score both indicators (mean 0, SD 1).
- Covariance: compute the 2×2 matrix (off-diagonal = r = 0.9595).
- Eigen-decompose: recover eigenvectors [0.7071, 0.7071] and [0.7071, −0.7071] with eigenvalues 1.9595 and 0.0405.
- Project: PC1 score = 0.7071·zLE + 0.7071·zIS. Range [−2.39, +2.37].
- Normalise: (PC1 − min) / (max − min) → Health Index in [0, 1].
Loadings & Variance — a verification dashboard
The eigenvector entries — called loadings — are the weights PCA multiplies each standardised variable by to compute PC1. For two standardised variables, both loadings are always exactly 1/√2 ≈ 0.7071, regardless of correlation. This is a mathematical certainty. The variance proportions, however, depend strongly on r — and that is what makes PCA useful.
PC1 loadings (this post)
Both indicators contribute equally — PC1 is essentially a scaled average of the two z-scores.
PC2 loadings (this post)
Perpendicular to PC1. Captures countries where the two indicators disagree — only 2% of variance.
Variance explained by component
PC1 captures 97.97% of all variation in the standardised data — the entire purpose of PCA. PC2 captures the leftover 2.03%, almost all of it noise from the data-generating process.
Manual vs. scikit-learn — perfect agreement
The post computes PC1 in two ways: by hand (six explicit steps:
polarity → standardise → covariance → eigen-decompose → project →
normalise) and via scikit-learn's PCA(n_components=1). The
maximum absolute difference between the two PC1-score vectors is
1.33 × 10⁻¹⁵ — machine precision. The correlation is
exactly 1.000000. Same eigenvectors, same scores, same
Health Index rankings.
Connecting back to Tabs 2 and 3
- Tab 2's simulator proves the closed form λ₁ = 1 + |r|, λ₂ = 1 − |r| by varying r and reading the eigenvalues live.
- Tab 3's rankings are computed from the very loadings shown above: 0.7071 × zLE + 0.7071 × zIS, then min-max normalised.
- The equal-loading result is a quirk of 2 variables. With 3+ indicators the loadings differ, giving more weight to indicators that contribute unique information — and that is where PCA truly earns its keep.