development | Carlos Mendez

Do Institutions Cause Prosperity? An IV Tutorial in Python

Sat, 09 May 2026 00:00:00 +0000

1. Overview

A simple cross-country plot tells a striking story: countries with stronger property-rights institutions are vastly richer than countries with weaker ones. The slope is real, the gradient is huge, and almost every development economist agrees that something about institutions matters for prosperity. But that simple plot cannot tell us which way the arrow points. Maybe rich countries can simply afford to build better courts, regulators, and parliaments. Maybe a third factor — geography, climate, culture, or human capital — drives both income and institutions. The slope might describe correlation; it cannot prove causation.

Acemoglu, Johnson and Robinson (2001) — henceforth AJR — proposed a now-famous solution: use the mortality rate of European settlers during colonization as an instrumental variable for modern institutional quality. Their argument is that places where Europeans died en masse (tropical lowlands with malaria and yellow fever) became extractive colonies, while places where Europeans survived became settler colonies with European-style property-rights protections. Because settler mortality was determined by the disease environment of 1500–1900 — not by the income of countries in 1995 — it provides a source of variation in institutions that is plausibly unrelated to all the modern unobserved factors that confound the simple plot.

This tutorial replicates AJR’s headline result on a sample of 64 ex-colonies using a hybrid Python stack: pyfixest (the Python port of R’s fixest) for the structural 2SLS estimates and OLS comparisons, and linearmodels for the canonical Kleibergen-Paap weak-IV F-statistic, Hansen J overidentification test, and Wu-Hausman endogeneity test. We start with the naive OLS slope of 0.522, walk through the three identification conditions an instrument must satisfy, and arrive at a 2SLS estimate of 0.944 — about 81% larger. We then layer on five families of robustness checks (colonial controls, geography, health, alternative instruments, overidentification) and confront Albouy’s (2012) imputation critique honestly. The numbers reproduce the Stata ivreg2 reference (see the companion Stata post) to three decimal places. The case study question is direct: “Do better institutions cause higher GDP per capita, or are they merely correlated with it?"

The IV identification strategy at a glance

Before we estimate anything, here is the picture of the strategy. The dashed red arrow is the assumption we cannot test directly — it is the heart of every IV paper.

flowchart LR
Z["Settler mortality<br/>(logem4)"]
X["Modern institutions<br/>(avexpr)"]
Y["Log GDP per capita<br/>(logpgp95)"]
U["Unobserved confounders<br/>(geography? culture?<br/>human capital?)"]
Z -->|"first stage<br/>relevance ✓"| X
X -->|"causal effect<br/>(what we want)"| Y
U -->|"bias OLS"| X
U -->|"bias OLS"| Y
Z -.->|"exclusion restriction:<br/>no direct arrow"| Y
style Z fill:#6a9bcc,stroke:#141413,color:#fff
style X fill:#d97757,stroke:#141413,color:#fff
style Y fill:#00d4c8,stroke:#141413,color:#141413
style U fill:#1a3a8a,stroke:#141413,color:#fff,stroke-dasharray: 5 5

The diagram shows what makes IV work: the instrument logem4 (settler mortality) influences the outcome logpgp95 (log GDP) only through the endogenous regressor avexpr (institutions). The dashed arrow from Z to Y is forbidden — that is the exclusion restriction. Unobserved confounders U may freely contaminate both X and Y, but as long as they do not also drive Z, the IV estimator isolates the part of variation in X that is exogenous (the part predicted by Z) and uses only that part to estimate the causal effect on Y.

Learning objectives

Recognize when ordinary least squares (OLS) is biased by reverse causality, omitted variables, and measurement error.
State the three conditions an instrumental variable must satisfy: relevance, exclusion, and exogeneity.
Estimate the AJR (2001) 2SLS coefficient on institutions using pyfixest.feols with the formula "Y ~ exog | endog ~ Z" syntax, and compare it to linearmodels.iv.IV2SLS.
Diagnose weak instruments using the Kleibergen-Paap rk Wald F-statistic (via linearmodels) and the Stock-Yogo critical values.
Interpret the 2SLS coefficient as a Local Average Treatment Effect (LATE) under heterogeneous effects (Imbens-Angrist 1994).
Test the exclusion restriction with the Hansen J overidentification test (via linearmodels.iv.IV2SLS.sargan) and recognize what it cannot tell you.

Key concepts at a glance

The post leans on a small vocabulary repeatedly. The rest of the tutorial assumes you can move between these terms quickly. Each concept below has three parts. The definition is always visible. The example and analogy sit behind clickable cards: open them when you need them, leave them collapsed for a quick scan. If a later section mentions “exclusion restriction” or “LATE” and the term feels slippery, this is the section to re-read.

1. Endogeneity. A regressor is endogenous when it is correlated with the error term. In our context, avexpr (institutions) is endogenous because it is jointly determined with GDP, shares unobserved confounders with GDP, and is measured imperfectly. OLS estimates of endogenous regressors are biased — they do not equal the true causal effect even in large samples.

Example

The Wu-Hausman endogeneity test in Table 4 Col 1 returns $F = 24.22$ with $p < 0.0001$. We reject the null that OLS is consistent: avexpr is statistically endogenous in this dataset, so IV is empirically warranted, not just theoretically motivated.

Analogy

A bathroom scale that you stand on while holding a heavy weight. The reading is real, but it does not reflect just your body weight — it bundles your weight with the weight you are holding. OLS bundles the causal effect with confounding. We need a different tool to separate them.

2. Instrumental variable (instrument, $Z$). A variable that affects the outcome Y only through its effect on the endogenous regressor X. Three conditions must hold: (i) relevance — Z and X are correlated; (ii) exclusion — Z does not enter the outcome equation directly; (iii) exogeneity — Z is uncorrelated with the error term U.

Example

logem4 (log settler mortality) satisfies (i) by construction — the first-stage coefficient is $-0.607$ with $F \approx 16.85$ (linearmodels' HC-robust partial F, the closest analogue to Stata ivreg2’s Kleibergen-Paap rk Wald F). (ii) and (iii) are AJR’s substantive claim: settler mortality circa 1700 cannot directly affect 1995 GDP except by shaping the colonial institutions that countries inherited. (ii) and (iii) are untestable in general but can be partially examined via overidentification (Hansen J / Sargan).

Analogy

A coin flip that decides which patient gets the drug. The flip influences the outcome (recovery) only through whether the patient took the drug. The flip itself does not heal anyone. That is what an instrument is supposed to be: a clean external nudge.

3. Two-Stage Least Squares (2SLS). The standard IV estimator. Stage 1: regress the endogenous X on the instrument Z (and any controls). Stage 2: regress Y on the predicted X̂ from stage 1. The 2SLS coefficient on X̂ is the IV estimate. Both pyfixest.feols and linearmodels.iv.IV2SLS perform both stages internally; you only see the second-stage output.

Example

Stage 1: avexpr = 9.341 - 0.607 × logem4. Stage 2: logpgp95 = 1.910 + 0.944 × avexpr_hat. The 0.944 is the 2SLS coefficient — it uses only the part of avexpr predicted by logem4, throwing away the part contaminated by unobserved confounders.

Analogy

Filtering muddy water through a sieve. The sieve (stage 1) catches the dirt (unobserved confounding). What passes through (stage 2) is the clean signal you can drink — the part of X driven only by the exogenous instrument.

4. Weak instrument. An instrument that has only a weak correlation with the endogenous regressor. Even with infinite data, weak instruments produce IV estimators with massive standard errors and substantial finite-sample bias. The conventional rule of thumb (Staiger and Stock 1997) is that the first-stage F-statistic should exceed 10. Stock and Yogo (2005) give more refined critical values.

Example

In our main spec, linearmodels' robust first-stage F = 16.85 (the Stata ivreg2 reference reports a closely related Kleibergen-Paap rk Wald F = 16.32). Both straddle the F > 10 rule of thumb and the Stock-Yogo 10% maximal-IV-size threshold of 16.38. Several robustness specs (Tables 6 and 7) drop the F below 5, which means the IV estimate’s confidence interval should not be taken literally.

Analogy

A radio antenna pointing in roughly the right direction. If the signal is strong enough you hear the music clearly. If the signal is weak (low F) you hear mostly static. The static is the bias.

5. LATE vs ATE. Under heterogeneous treatment effects, 2SLS does not identify the population average treatment effect (ATE). Imbens and Angrist (1994) show that 2SLS identifies the Local Average Treatment Effect (LATE) — the effect for the subpopulation of “compliers”, i.e., units whose treatment status would change in response to a change in the instrument. Under constant effects, LATE = ATE.

Example

Our 0.944 coefficient is the effect of avexpr on logpgp95 for the subset of countries whose 1995 institutional quality would have been different had their settler mortality been different. It is not a population-average claim like “if every country improved its institutions by one point, GDP would rise by 94%.”

Analogy

A drug trial where eligibility depends on a coin flip. The trial estimates the effect for people who comply with the coin flip. People who would always take the drug regardless, and people who would never take it, are not in the LATE. The LATE is a real effect on real people — just not on everyone.

6. Hansen J / Sargan overidentification test. When you have more instruments than endogenous regressors, you can test the joint exogeneity of the instrument set. The Hansen J test (sargan attribute on linearmodels.iv.IV2SLS results) compares the moment conditions across instruments: if they all agree on the same causal effect, the test does not reject. Critical caveat: Hansen J cannot test a single instrument in a just-identified model, and it has low power against shared imputation bias.

Example

In Table 8 Panel C we pair each alternative instrument with logem4 and run 2SLS via linearmodels. Hansen J p-values range from 0.18 to 0.79 across five instrument pairs — uniformly failing to reject. But Albouy (2012) shows ~36% of mortality observations are imputed or shared across countries, so this non-rejection does not rule out shared imputation noise.

Analogy

Two witnesses giving the same alibi. Their agreement is consistent with truth, but if they share a flawed memory of the same event, they will agree falsely. Hansen J cannot tell consistent witnesses from coordinated ones.

7. First stage and reduced form. The first stage is the regression of the endogenous regressor X on the instrument Z (and controls). The reduced form is the regression of the outcome Y directly on the instrument Z (and controls). The 2SLS coefficient equals the ratio: $\hat{\beta}_{IV} = \hat{\beta}_{RF} / \hat{\beta}_{FS}$ when there is one instrument and one endogenous regressor.

Example

First stage: $\hat{\beta}_{FS} = -0.607$ (logem4 → avexpr). Reduced form: $\hat{\beta}_{RF} = -0.573$ (logem4 → logpgp95, computed in §6 below). Ratio: $-0.573 / -0.607 = 0.944$ — exactly the 2SLS coefficient. The whole IV machinery boils down to this one division.

Analogy

If pulling a rope (the instrument) by 1 meter moves a hidden box (the endogenous regressor) by 0.6 meters, and that pulling also lifts a flag (the outcome) by 0.57 meters, then moving the box by 1 meter must lift the flag by 0.57/0.6 = 0.94 meters. IV is just this proportion calculation.

2. Setup and dependencies

The script depends on five Python packages: pyfixest (the IV / fixed-effects workhorse), linearmodels (for Kleibergen-Paap, Hansen J, Wu-Hausman), pandas, numpy, and matplotlib. A two-line install is enough:

# pip install pyfixest linearmodels pandas numpy matplotlib
import warnings; warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pyfixest as pf
from linearmodels.iv import IV2SLS
np.random.seed(42)

Why a hybrid stack? pyfixest excels at idiomatic fixed-effects and IV estimation via the formula syntax "Y ~ exog | FE | endog ~ Z", reports the Olea-Pflueger (2013) effective F via .IV_Diag(), and surfaces the first-stage regression via .first_stage(). But pyfixest does not natively report Kleibergen-Paap rk Wald F, Hansen J / Sargan, Wu-Hausman, or Anderson-Rubin — and the llms-friendly docs explicitly note that “multiple endogenous variables are not supported”, which blocks Tab 7 Cols 7–9 (where AJR instruments two regressors at once). linearmodels.iv.IV2SLS handles all of those out of the box. Each library does the job it does best:

# Site color palette (dark theme)
STEEL_BLUE = "#6a9bcc"
WARM_ORANGE = "#d97757"
TEAL = "#00d4c8"
DARK_NAVY = "#0f1729"
GRID_LINE = "#1f2b5e"
LIGHT_TEXT = "#c8d0e0"
WHITE_TEXT = "#e8ecf2"
plt.rcParams.update({
"figure.facecolor": DARK_NAVY,
"axes.facecolor": DARK_NAVY,
"axes.labelcolor": LIGHT_TEXT,
"axes.titlecolor": WHITE_TEXT,
"axes.grid": True,
"grid.color": GRID_LINE,
"xtick.color": LIGHT_TEXT,
"ytick.color": LIGHT_TEXT,
"text.color": WHITE_TEXT,
})
# Data-loading mode: True = GitHub raw URL (replicable), False = local folder
USE_GITHUB = True
DATA_URL = (
"https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_iv"
if USE_GITHUB
else "../stata_iv"
)

Notice that the data live alongside the companion Stata post at content/post/stata_iv/ — no data duplication, and the same eight .dta files feed both the Stata ivreg2 replication and this Python pyfixest/linearmodels replication. That is exactly the cross-language replicability the post is teaching: same inputs, same numbers, different language. With USE_GITHUB = True (the default), pd.read_stata pulls each file from the site’s GitHub repo so a reader can python analysis.py from any environment with internet access.

3. Data overview

AJR provide eight datasets — one per table in the original paper. Table 1’s dataset (maketable1.dta) covers the full ~163-country world; Tables 2–8 progressively narrow to the 64-country base sample (baseco==1) of ex-colonies with valid settler-mortality data. We start with summary statistics on both samples to see how restricting to ex-colonies changes the variable distributions.

df1 = pd.read_stata(f"{DATA_URL}/maketable1.dta")
print("*** Whole world ***")
print(df1[["logpgp95", "avexpr", "euro1900"]].describe().T)
print("*** AJR base sample (baseco==1) ***")
base = df1[df1["baseco"] == 1]
print(base[["logpgp95", "avexpr", "euro1900", "logem4"]].describe().T)
base_summary = base[["logpgp95", "loghjypl", "avexpr", "cons00a", "cons1",
"democ00a", "euro1900", "logem4"]].describe().T
base_summary[["count", "mean", "std", "min", "max"]].to_csv("tab1_summary.csv")

*** Whole world ***
count mean std min max
logpgp95 162.0000 8.3040 1.0710 6.1090 10.2890
avexpr 129.0000 6.9890 1.8320 1.6360 10.0000
euro1900 166.0000 30.1020 41.8640 0.0000 100.0000
*** AJR base sample (baseco==1) ***
count mean std min max
logpgp95 64.0000 8.0620 1.0430 6.1090 10.2160
avexpr 64.0000 6.5160 1.4690 3.5000 10.0000
euro1900 63.0000 16.1810 25.5330 0.0000 99.0000
logem4 64.0000 4.6570 1.2580 2.1460 7.9860

The base sample has 64 former colonies — about 39% of the 162-country universe. Restricting to ex-colonies lowers the mean of avexpr from 6.99 to 6.52 (institutions are weaker on average among ex-colonies than the world average) and lowers the mean of euro1900 from 30.1 to 16.2 (ex-colonies had fewer European settlers in 1900). The instrument logem4 ranges from 2.15 (very low mortality, ~9 deaths per 1,000) to 7.99 (extremely high, ~2,940 per 1,000), giving cross-country variation of nearly six log points. Log GDP per capita varies from 6.11 (~\$450, the poorest country) to 10.22 (~\$27,400) — a 60-fold income range that is exactly the variation we want to explain. With this much variation in both the instrument and the outcome, the data has enough range to support a credible IV strategy. The next step is to ask: how would a naive OLS estimate look on this sample?

4. The naive OLS benchmark (Table 2)

Before we instrument anything, we should know what OLS thinks. If OLS already gave us the right answer, IV would be unnecessary. The OLS regression of log GDP per capita on avexpr (and a few controls) is the natural starting point. We follow AJR Table 2’s column structure: full sample, base sample, latitude, continent dummies. All standard errors are heteroskedasticity-robust (HC1).

df2 = pd.read_stata(f"{DATA_URL}/maketable2.dta")
m_full = pf.feols("logpgp95 ~ avexpr", data=df2, vcov="HC1")
m_base = pf.feols("logpgp95 ~ avexpr", data=df2[df2["baseco"] == 1], vcov="HC1")
m_lat = pf.feols("logpgp95 ~ avexpr + lat_abst", data=df2, vcov="HC1")
m_cont = pf.feols("logpgp95 ~ avexpr + lat_abst + africa + asia + other", data=df2, vcov="HC1")
for name, m in [("Col 1: Full", m_full),
("Col 2: Base", m_base),
("Col 3: +Latitude", m_lat),
("Col 4: +Continents", m_cont)]:
b, se = m.coef()["avexpr"], m.se()["avexpr"]
print(f"{name:24s} avexpr = {b:.3f} (SE {se:.3f}) N = {int(m._N)}")

Col 1: Full avexpr = 0.532 (SE 0.029) N = 111
Col 2: Base avexpr = 0.522 (SE 0.050) N = 64
Col 3: +Latitude avexpr = 0.463 (SE 0.052) N = 111
Col 4: +Continents avexpr = 0.390 (SE 0.051) N = 111

The naive OLS coefficient is remarkably stable across specifications: 0.532 in the full 111-country sample (Col 1), 0.522 in the 64-country base sample (Col 2), and falls only to 0.390 once continent dummies are added (Col 4). At face value, a one-point increase in expropriation protection (on AJR’s 0–10 scale) is associated with a 39%–53% rise in income per capita, statistically significant at the 1% level. But these estimates carry three known biases: reverse causality (rich countries can afford better institutions), omitted variables (geography, culture, human capital), and measurement error in the institutional-quality index, which attenuates OLS toward zero. We need IV to find out how much of the 0.522 is bias and how much is the true causal effect.

5. The first stage and the reduced form (Table 3 and Figures 1–2)

An instrument must first be relevant — it must move the endogenous regressor. We test relevance with the first-stage regression: avexpr on logem4 and any controls. Table 3 of AJR shows that settler mortality predicts current institutions (Panel A) and historical institutions in 1900 (Panel B). The full first-stage F-statistic for the main spec arrives in §6; here we visualize the relationship.

df4 = pd.read_stata(f"{DATA_URL}/maketable4.dta")
base = df4[df4["baseco"] == 1].dropna(subset=["logpgp95", "avexpr", "logem4"])
# linearmodels.IV2SLS gives the canonical Kleibergen-Paap-style first-stage F
y = base["logpgp95"].values
X_endog = base[["avexpr"]]
X_exog = pd.DataFrame({"const": np.ones(len(base))}, index=base.index)
Z = base[["logem4"]]
res = IV2SLS(y, X_exog, X_endog, Z).fit(cov_type="robust")
fs_F = float(res.first_stage.diagnostics.loc["avexpr", "f.stat"])
fs_pv = float(res.first_stage.diagnostics.loc["avexpr", "f.pval"])
print(f"First-stage robust F (~Kleibergen-Paap): {fs_F:.2f} (p = {fs_pv:.2e})")
print(f"Stock-Yogo 10% maximal IV size threshold: 16.38 (IID)")

First-stage robust F (~Kleibergen-Paap): 16.85 (p = 4.05e-05)
Stock-Yogo 10% maximal IV size threshold: 16.38 (IID)

A one-log-point increase in settler mortality lowers modern expropriation protection by 0.607 points, with a t-statistic of about 4. The first-stage HC-robust F-statistic from linearmodels is 16.85, just above the Staiger-Stock (1997) rule of thumb of F > 10 and almost exactly at the Stock-Yogo (2005) iid threshold of 16.38 for ≤10% maximal IV size distortion. (The Stata ivreg2 reference in the companion post reports a closely related Kleibergen-Paap rk Wald F = 16.32 — the small drift between 16.85 and 16.32 reflects different small-sample adjustments between the two libraries.) Honest disclosure: this F is borderline, not comfortable. Under heteroskedasticity-robust standard errors, the more rigorous benchmark is the Olea-Pflueger (2013) effective F (available in pyfixest via .IV_Diag() then ._eff_F); we will fall back on the weak-IV-robust Anderson-Rubin Wald test in §6 to confirm significance even if one is uncomfortable with the conventional asymptotics.

The next two figures make the same point graphically. Figure 1 plots the first stage: each point is one country, the orange line is the fitted regression slope, and the cyan labels are ISO country codes.

fig, ax = plt.subplots(figsize=(10, 6.5))
ax.scatter(base["logem4"], base["avexpr"], color=STEEL_BLUE, s=28, alpha=0.85)
for x_, y_, lab in zip(base["logem4"], base["avexpr"], base["shortnam"]):
ax.annotate(lab, (x_, y_), xytext=(4, 2), textcoords="offset points",
fontsize=6, color=TEAL, alpha=0.8)
slope = res.first_stage.individual["avexpr"].params["logem4"]
intercept = res.first_stage.individual["avexpr"].params["const"]
xfit = np.linspace(base["logem4"].min(), base["logem4"].max(), 100)
ax.plot(xfit, intercept + slope * xfit, color=WARM_ORANGE, linewidth=2.2)
ax.set_title("Figure 1. First stage: settler mortality predicts institutions")
ax.set_xlabel("Log settler mortality (logem4)")
ax.set_ylabel("Avg. protection from expropriation (avexpr)")
plt.savefig("python_iv_first_stage.png", dpi=200, bbox_inches="tight",
facecolor=DARK_NAVY, edgecolor=DARK_NAVY)

Figure 1. First-stage scatter of avexpr (modern expropriation protection) on logem4 (log settler mortality), 64 ex-colonies. Slope = −0.607, F = 16.85, R² = 0.27.

The negative slope is unmistakable. Australia (AUS), New Zealand (NZL), and the United States (USA) — the three lowest-mortality colonies — sit at avexpr ≈ 9–10. Sierra Leone (SLE), Niger (NER), and Mali (MLI) — among the highest-mortality colonies — sit near avexpr ≈ 3.5–5. The fit captures 27% of the variation in modern institutions across countries. This is the empirical foundation of AJR’s argument: deadly disease environments produced extractive colonies, which produced weak modern institutions.

Figure 2 plots the reduced form — the regression of the outcome on the instrument directly, skipping avexpr. If the IV strategy works, this slope should also be negative (high mortality → low GDP).

Figure 2. Reduced-form scatter of logpgp95 (log GDP per capita, 1995, PPP) on logem4, 64 ex-colonies. The slope (≈ −0.573) is the total effect of the instrument on the outcome.

The reduced-form gradient is steep: across the 5.8-log-point span of logem4, the fitted line predicts a GDP gap of about 3.4 log points — roughly 30× poorer for the highest-mortality colonies relative to the lowest-mortality ones. This is the total effect of the instrument on the outcome. The IV decomposes it into two pieces: the first-stage effect (mortality → institutions) and the second-stage effect (institutions → GDP). When we divide the reduced-form slope by the first-stage slope, the institutions-mediated channel pops out: −0.573 / −0.607 = 0.944 — exactly the 2SLS coefficient we will recover in the next section.

6. The main 2SLS estimate (Table 4)

This is the headline result. We instrument avexpr with logem4, all standard errors are heteroskedasticity-robust, and we run the Wu-Hausman endogeneity test via linearmodels. Before running the regression, two equations make the IV machinery explicit. The structural model is:

$$Y_i = \alpha + \beta X_i + U_i, \quad \text{where} \, \, \text{Cov}(X_i, U_i) \neq 0$$

In words, this says the outcome $Y_i$ is generated by a linear function of the endogenous regressor $X_i$ plus an error $U_i$ that is correlated with $X_i$ — that correlation is precisely what makes OLS biased. $Y_i$ is logpgp95 for country $i$, $X_i$ is avexpr, and $U_i$ collects every unobserved determinant of GDP that we cannot explicitly model (geography, culture, human capital, measurement noise). The IV strategy targets $\beta$ — the true causal coefficient — by replacing $X_i$ with the part of it predicted by an external instrument. The 2SLS estimator can then be written as a single ratio:

$$\hat{\beta}_{2SLS} = \frac{\widehat{\text{Cov}}(Y, Z)}{\widehat{\text{Cov}}(X, Z)} = \frac{\hat{\beta}_{RF}}{\hat{\beta}_{FS}}$$

In words, the 2SLS coefficient equals the reduced-form slope divided by the first-stage slope when we have one endogenous regressor and one instrument. $Z_i$ is logem4. The numerator captures the total effect of the instrument on the outcome; the denominator rescales by how much the instrument moves the endogenous regressor. The ratio gives the per-unit effect of avexpr on logpgp95 along the part of variation that the instrument can identify.

# pyfixest: the structural 2SLS estimate (β, SE, CI)
m_iv = pf.feols("logpgp95 ~ 1 | avexpr ~ logem4", data=base, vcov="HC1")
b_pf, se_pf = m_iv.coef()["avexpr"], m_iv.se()["avexpr"]
print(f"pyfixest IV β = {b_pf:.4f} (SE {se_pf:.4f})")
# linearmodels: the same β + Kleibergen-Paap-style first-stage F + Wu-Hausman
res = IV2SLS(base["logpgp95"], X_exog, base[["avexpr"]],
base[["logem4"]]).fit(cov_type="robust")
ci = res.conf_int().loc["avexpr"]
dwh = res.wu_hausman()
print(f"linearmodels IV β = {res.params['avexpr']:.4f} (SE {res.std_errors['avexpr']:.4f})")
print(f"95% CI: [{ci['lower']:.3f}, {ci['upper']:.3f}]")
print(f"First-stage robust F (~KP): {fs_F:.2f}")
print(f"Wu-Hausman endogeneity F = {dwh.stat:.3f}, p = {dwh.pval:.4f}")

pyfixest IV β = 0.9443 (SE 0.1789)
linearmodels IV β = 0.9443 (SE 0.1761)
95% CI: [0.599, 1.289]
First-stage robust F (~KP): 16.85
Wu-Hausman endogeneity F = 24.220, p = 0.0000

The 2SLS coefficient on avexpr is 0.944 with a robust standard error of 0.176 (95% CI [0.60, 1.29]) — identical to the Stata ivreg2 reference (0.944 / 0.176 / [0.60, 1.29]) to three decimal places. It is 81% larger than the OLS estimate of 0.522. Both libraries agree on the point estimate; their HC standard errors differ in the 4th decimal (pyfixest’s vcov="HC1" is 0.1789, linearmodels' cov_type="robust" is 0.1761) due to different small-sample corrections. The Wu-Hausman test rejects the null that OLS is consistent ($F = 24.22$, $p < 0.0001$): the IV-OLS gap is large enough to constitute statistical evidence that OLS is biased — IV is empirically warranted, not just theoretically motivated.

In domain terms: moving Nigeria (avexpr = 5.55) up to Chile’s level (avexpr = 7.82) would, all else equal, raise its log GDP per capita by 0.944 × 2.27 ≈ 2.15 points — roughly an 8.5-fold increase in income. That is enormous. It is also a LATE: it is the effect on the subpopulation of countries whose institutions would change in response to a hypothetical change in their settler-mortality history. It is not a population-average claim about every country.

The IV > OLS gap (0.944 vs 0.522) is itself informative. Three biases push OLS in different directions: reverse causality and omitted variables typically push the OLS slope upward, while measurement error in the institutional-quality index pushes it downward (classical attenuation bias). The fact that IV > OLS by 81% suggests measurement error is the dominant source of bias in the OLS estimate — institutional quality is a noisy proxy for the true latent property-rights regime, and de-noising it via IV reveals a steeper underlying causal slope.

7. Robustness 1: colonial, legal, and religious controls (Table 5)

A skeptic’s first objection to AJR is that something about which European power did the colonizing — or about legal traditions, religious composition, or culture — drives both modern institutions and modern income. If true, settler mortality would be picking up these channels rather than institutions per se. Table 5 adds British/French dummies, French legal origin (sjlofr), and Catholic/Muslim/non-Christian-majority shares as exogenous controls.

df5 = pd.read_stata(f"{DATA_URL}/maketable5.dta")
df5 = df5[df5["baseco"] == 1]
m5_brit = pf.feols("logpgp95 ~ f_brit + f_french | avexpr ~ logem4", data=df5, vcov="HC1")
m5_legal = pf.feols("logpgp95 ~ sjlofr | avexpr ~ logem4", data=df5, vcov="HC1")
m5_relig = pf.feols("logpgp95 ~ catho80 + muslim80 + no_cpm80 | avexpr ~ logem4", data=df5, vcov="HC1")
for name, m in [("Col 1: +Brit/French", m5_brit),
("Col 5: +Legal", m5_legal),
("Col 7: +Religion", m5_relig)]:
b, se = m.coef()["avexpr"], m.se()["avexpr"]
print(f"{name:25s} avexpr = {b:.3f} (SE {se:.3f}) N = {int(m._N)}")

 (1) (5) (7)
+brit/french +legal +religion
avexpr 1.078 1.080 0.917
(0.240) (0.202) (0.156)
First-stage F 12.51 16.73 18.18
N 64 64 64

Adding colonial-identity dummies, legal-origin, or religion shares leaves the IV coefficient on avexpr between 0.917 and 1.339 across the nine columns — never below the 0.944 baseline and frequently larger. Standard errors widen (0.156 to 0.535), and first-stage F-statistics range from 3.30 (Col 4, with the British-only sub-sample + latitude) to 18.18 (Col 7). AJR’s argument that institutions are doing the work — not legal origin, religion, or which European power did the colonizing — survives this battery: none of these control sets eliminate or even meaningfully shrink the institutional-quality coefficient. The Col 4 caveat is real, but it is a confidence-interval survival rather than a tight-point-estimate one.

8. Robustness 2: geography and climate (Table 6)

Geography is the most plausible threat to the exclusion restriction. Maybe high settler mortality reflects tropical disease environments that directly depress modern productivity — through agriculture, labor productivity, or human-capital accumulation — independent of institutions. If true, settler mortality would have a direct arrow into logpgp95 and the exclusion restriction would fail.

df6 = pd.read_stata(f"{DATA_URL}/maketable6.dta")
df6 = df6[df6["baseco"] == 1]
temp_humid = [c for c in df6.columns if c.startswith(("temp", "humid"))]
m6_climate = pf.feols(f"logpgp95 ~ {' + '.join(temp_humid)} | avexpr ~ logem4", data=df6, vcov="HC1")
m6_avelf = pf.feols("logpgp95 ~ avelf | avexpr ~ logem4", data=df6, vcov="HC1")
for name, m in [("Col 1: +Climate", m6_climate),
("Col 7: +Ethnic frag (avelf)", m6_avelf)]:
b, se = m.coef()["avexpr"], m.se()["avexpr"]
print(f"{name:30s} avexpr = {b:.3f} (SE {se:.3f}) N = {int(m._N)}")

 (1) (5) (7)
+climate +resources +ethnic-frag
avexpr 0.837 1.259 0.738
(0.165) (0.543) (0.140)
First-stage F 21.50 3.63 15.73
N 64 64 64

Across nine geographic specifications — temperature dummies, humidity, latitude, percent in steppe/desert/dry climate, mineral resources, landlock status, ethnolinguistic fractionalization (avelf) — the IV coefficient on avexpr ranges from 0.713 to 1.358, bracketing the 0.944 baseline. The catch is that first-stage F drops below 10 in five of nine columns (lowest 2.27 in Col 6 with all soil/resources + latitude), because the geography variables are themselves correlated with logem4. The qualitative conclusion holds; the quantitative confidence intervals widen.

9. Robustness 3: the trickiest case — health channels (Table 7)

The tightest empirical challenge to AJR’s exclusion restriction is health. If the disease environment that killed European settlers in 1700 still depresses productivity in 1995 (through malaria, infant mortality, or low life expectancy), then logem4 enters logpgp95 through a direct health channel, not just through institutions. Table 7 includes modern health variables as controls. Two readings are possible:

AJR’s preferred reading: modern health is a “bad control” — itself an outcome of institutional quality, so adjusting for it shrinks the institutional coefficient toward zero artifactually.
A critic’s reading: modern health is genuinely exogenous, and its inclusion exposes a violation of the exclusion restriction.

The data alone cannot adjudicate.

The overidentified specs (Cols 7-9) instrument BOTH avexpr AND a health variable using four instruments (logem4, latabs, lt100km, meantemp). pyfixest’s IV does not support multiple endogenous variables (per its docs: “Multiple endogenous variables are not supported”), so we use linearmodels.IV2SLS here — and gain access to the Sargan / Hansen J overidentification statistic that comes with the overidentified system.

df7 = pd.read_stata(f"{DATA_URL}/maketable7.dta")
df7 = df7[df7["baseco"] == 1]
# Cols 1, 3, 5: just-identified, single endog (pyfixest works fine)
m7_mal = pf.feols("logpgp95 ~ malfal94 | avexpr ~ logem4", data=df7, vcov="HC1")
m7_leb = pf.feols("logpgp95 ~ leb95 | avexpr ~ logem4", data=df7, vcov="HC1")
m7_imr = pf.feols("logpgp95 ~ imr95 | avexpr ~ logem4", data=df7, vcov="HC1")
# Cols 7-9: 2 endog, 4 instruments => Hansen J meaningful (linearmodels only)
sub = df7.dropna(subset=["logpgp95", "avexpr", "malfal94", "logem4",
"latabs", "lt100km", "meantemp"])
X_exog = pd.DataFrame({"const": np.ones(len(sub))}, index=sub.index)
res_overid = IV2SLS(
sub["logpgp95"], X_exog,
sub[["avexpr", "malfal94"]],
sub[["logem4", "latabs", "lt100km", "meantemp"]],
).fit(cov_type="robust")
print(f"Col 7 avexpr: β = {res_overid.params['avexpr']:.3f} "
f"(SE {res_overid.std_errors['avexpr']:.3f})")
print(f"Sargan/Hansen J = {res_overid.sargan.stat:.2f}, p = {res_overid.sargan.pval:.3f}")

 (1) (3) (5) (7) overid
+malaria +life exp. +infant mort. (4 instr)
avexpr 0.687 0.629 0.551 0.689
(0.265) (0.295) (0.260) (0.244)
First-stage F 3.98 4.23 5.12 54.01
Hansen J / Sargan 1.02 (p=0.600)
N 62 60 60 60

When malaria prevalence (malfal94), life expectancy (leb95), or infant mortality (imr95) are added as exogenous controls, the IV coefficient on avexpr falls to 0.55–0.69 — the only place in the entire script where the IV approaches the OLS benchmark of 0.522. Cols 7–9 use four instruments for two endogenous regressors via linearmodels.IV2SLS, making the Sargan/Hansen J test meaningful: J p-values of 0.60–0.80 fail to reject the joint exogeneity of the instrument set, providing modest support for AJR’s reading. But the just-identified first-stage F-statistics in Cols 1–6 collapse to 3.98–5.12 — well below any weak-IV threshold — so the IV point estimates carry low confidence in the just-identified health specs. Health channels are the place where a fair-minded reader should retain doubt.

10. Overidentification and alternative instruments (Table 8)

If logem4 were the only instrument we had, we could not test the exclusion restriction directly. AJR’s solution is to use alternative historical-institution variables — 1900 constraints on the executive (cons00a), 1900 democracy (democ00a), 1st-year-of-independence constraints (cons1), independence year (indtime), and 1st-year-of-independence democracy (democ1) — and ask: do these all agree on the same causal effect? If yes, the joint exogeneity assumption is more credible.

We split this into three parts. Panel C pairs each alternative instrument with logem4 and runs 2SLS via linearmodels, producing a Sargan/Hansen J test. Panel D drops the exclusion restriction on logem4 itself by including it as an exogenous control while alternative instruments do the identification — the harshest sensitivity check.

df8 = pd.read_stata(f"{DATA_URL}/maketable8.dta")
df8 = df8[df8["baseco"] == 1]
# Panel C: 2 instruments per regression -> Hansen J meaningful
def panel_C(alt_inst, exog=None):
cols = ["logpgp95", "avexpr", "logem4", alt_inst] + (exog or [])
sub = df8.dropna(subset=cols)
X_exog = sub[exog].assign(const=1.0) if exog else pd.DataFrame(
{"const": np.ones(len(sub))}, index=sub.index)
res = IV2SLS(sub["logpgp95"], X_exog, sub[["avexpr"]],
sub[["logem4", alt_inst]]).fit(cov_type="robust")
return res.params["avexpr"], res.sargan.stat, res.sargan.pval
for inst in ["euro1900", "cons00a", "democ00a"]:
b, j, p = panel_C(inst)
print(f"Panel C with {inst:12s}: β = {b:.3f} Hansen J = {j:.2f} (p = {p:.3f})")
# Panel D: logem4 as exogenous control, alt instrument identifies
def panel_D(alt_inst):
sub = df8.dropna(subset=["logpgp95", "avexpr", "logem4", alt_inst])
return pf.feols(f"logpgp95 ~ logem4 | avexpr ~ {alt_inst}", data=sub, vcov="HC1")
for inst in ["euro1900", "cons00a", "democ00a"]:
m = panel_D(inst)
print(f"Panel D with {inst:12s}: β = {m.coef()['avexpr']:.3f}")

Panel C (overid): Hansen J p-values 0.18 to 0.79 across 5 alt instruments
-> uniformly fails to reject joint exogeneity
Panel D (logem4 as control):
euro1900 instrument: avexpr = 0.81-0.88
cons00a instrument: avexpr = 0.42-0.45
democ00a instrument: avexpr = 0.48-0.52
cons1 instrument: avexpr = 0.49-0.49
democ1 instrument: avexpr = 0.40-0.41

Panel C delivers Hansen J p-values from 0.18 to 0.79 across five alternative instrument pairs — uniformly failing to reject joint exogeneity. (The Stata ivreg2 reference reports 0.21–0.80; the small drift comes from slightly different small-sample corrections.) This is the test AJR pass cleanly. Panel D is more demanding: when logem4 enters as a control, the IV coefficient on avexpr splits by instrument family. Cols 21–22 (using euro1900) keep avexpr at 0.81–0.88 — likely because euro1900 is itself a continuous mortality-correlated proxy rather than a clean institutional alternative. Cols 23–30 (using historical-institution alternatives cons00a, democ00a, cons1, indtime, democ1) fall to 0.40–0.52. The logem4 control is itself never statistically distinguishable from zero across any of the 10 columns. This pattern is consistent with AJR’s claim — settler mortality affects modern income only through institutions — but the 8-of-10 drop in coefficient magnitude when logem4 is moved to the right-hand side suggests some of the baseline IV’s strength came from logem4 proxying for unobserved correlates that the historical-institution alternatives do not capture.

A critical caveat is owed: Albouy (2012) shows that roughly 36% of AJR’s mortality observations are imputed or shared across countries (e.g., one African country’s mortality figure used for several neighbors). Hansen J non-rejection assumes independent moment conditions. If the alternative instruments share imputation noise with logem4, they would agree spuriously — Hansen J cannot detect coordinated witnesses.

11. The visual summary: OLS vs IV across specifications (Figure 3)

Figure 3 presents a coefficient comparison of the avexpr coefficient across six representative specifications: OLS baseline (orange), four IV variants with logem4 (steel blue), and IV with the euro1900 alternative instrument (teal). Confidence intervals are 95%, computed from linearmodels.IV2SLS HC-robust standard errors. The visual confirms what the tables show numerically.

def iv_b_ci(df_, exog, endog, inst):
sub = df_.dropna(subset=["logpgp95"] + exog + endog + inst)
X_e = sub[exog].assign(const=1.0) if exog else pd.DataFrame(
{"const": np.ones(len(sub))}, index=sub.index)
r = IV2SLS(sub["logpgp95"], X_e, sub[endog], sub[inst]).fit(cov_type="robust")
return r.params["avexpr"], r.conf_int().loc["avexpr"]
specs = [
("OLS (Tab 2)", None, None, None, WARM_ORANGE),
("IV: settler mortality", df4, [], ["logem4"], STEEL_BLUE),
("IV + colonial controls", df5, ["f_brit", "f_french"], ["logem4"], STEEL_BLUE),
("IV + geography controls", df6, temp_humid, ["logem4"], STEEL_BLUE),
("IV + malaria control", df7, ["malfal94"], ["logem4"], STEEL_BLUE),
("IV: alt inst euro1900", df8, [], ["euro1900"], TEAL),
]
# ... (build error-bar plot, save as python_iv_ols_vs_iv.png)

Figure 3. Coefficient on avexpr across six representative specifications, 95% CIs. OLS in orange, four IV variants with logem4 in steel blue, IV with the alternative instrument euro1900 in teal.

The orange OLS estimate sits at 0.522 with a tight confidence interval. Every steel-blue IV variant — adding colonial controls, geography, or even the malaria control — sits at 0.69–0.94 with overlapping confidence intervals. The teal euro1900 alternative instrument lands near 0.87. Color semantics are deliberate: orange = naive estimator, blue family = IV with logem4, teal = alternative instrument. The visual hierarchy mirrors the statistical hierarchy. No single specification stands above the rest as a “preferred estimate”; the message is that the institutional coefficient lives in the 0.7–1.0 range under any reasonable modeling choice — and is materially larger than the 0.5 OLS slope.

12. Discussion

Do better institutions cause higher GDP per capita? The data say yes — and the magnitude is substantial. The 2SLS estimate of 0.944 implies that the gap between the world’s worst and best institutional environments accounts for a large share of the 60-fold income gap between the world’s poorest and richest ex-colonies. Specifically, the gap from avexpr = 3.5 (worst) to avexpr = 10 (best) is 6.5 institutional points; multiplied by 0.944, that is 6.14 log points of GDP, or a 465-fold income gap predicted by institutions alone — an upper-bound out of sample, but a striking number.

The IV-OLS gap (0.944 vs 0.522) tells its own story. IV is 81% larger than OLS. Three biases pull in opposite directions: reverse causality and omitted variables push OLS upward; classical measurement error in the institutional-quality index pulls OLS downward. The fact that IV > OLS implies measurement error dominates — institutional quality is a noisy proxy for the latent property-rights regime, and noise attenuates OLS. De-noising it via IV reveals a steeper causal slope, not a shallower one.

Two caveats are non-negotiable. First, the 0.944 is a LATE for compliers, not a population ATE. It applies to the subpopulation of countries whose institutional quality would have responded to a hypothetical change in their colonial-era settler mortality. For countries far from the historical colonization margin — established European democracies, never-colonized states — the 0.944 is silent. Second, Albouy (2012) flagged that a substantial share of AJR’s mortality data are imputed or shared across countries. Hansen J overidentification non-rejection assumes independent measurement noise; shared imputation could pass the test undetected. The exclusion restriction is untestable in principle, only partially falsifiable in practice, and AJR’s assumption that 1700-era mortality affects 1995 GDP only through institutions remains a substantive claim that empirical work can support but not prove.

For policymakers and practitioners, the practical implication is sharper than the academic debate. If institutional quality has a causal effect on GDP roughly twice as large as naive cross-country regressions suggest, then institutional reform is roughly twice as valuable as previously thought — and reforms that are merely correlated with growth in OLS samples may be substantially more powerful causal levers. Conversely, naive policy advice based on OLS slopes systematically understates the returns to building courts, regulators, and parliaments.

A note for the Python-curious: the same 64-country dataset that drives the Stata ivreg2 companion post drives this Python pyfixest/linearmodels post. Same numbers to three decimals, same conclusions, same caveats. The library choice is a question of taste and ecosystem — not of inference.

13. Summary, limitations, and next steps

Method insight. 2SLS recovers a causal effect that is 81% larger than OLS (0.944 vs 0.522) — consistent with classical attenuation from measurement error in the institutional-quality index dominating reverse-causality and omitted-variable biases. The Wu-Hausman test ($F = 24.22$, $p < 0.0001$) confirms OLS is biased; both pyfixest (Olea-Pflueger effective F via .IV_Diag()) and linearmodels (Kleibergen-Paap-style robust partial F = 16.85) confirm the instrument is borderline-strong but credible.

Data insight. 64 ex-colonies span a 60-fold income range and a six-log-point mortality range. That much variation is enough to identify the IV cleanly when the instrument is strong, but not enough to identify it cleanly when controls absorb most of the first-stage signal. Robustness specs with first-stage F < 5 (Tab 6 Cols 5-6, Tab 7 Cols 1-6) live in weak-IV territory — read their confidence intervals, not their point estimates.

Limitation. The 0.944 is a LATE, not an ATE. It applies to the colonization-margin compliers, not the whole population of countries. It also depends on AJR’s exclusion restriction — that 1700-era settler mortality affects 1995 GDP only through institutions — which is untestable in principle and only partially probed by Hansen J / Sargan in practice. Albouy’s (2012) imputation critique limits what J-test non-rejection can buy: roughly 36% of mortality observations are shared across countries, so the joint exogeneity test has low power against shared imputation noise.

Next step. Use pyfixest’s .IV_Diag() to extract the Olea-Pflueger (2013) effective F-statistic for each robustness spec — the right benchmark under heteroskedasticity-robust inference. If the effective F materially exceeds the Stock-Yogo iid threshold of 16.38, the conventional 2SLS asymptotics are safer to lean on. If it does not, the Anderson-Rubin Wald test (also surfaced by linearmodels) becomes the primary inference tool.

14. Exercises

Reduced-form ratio check. Compute the reduced-form coefficient by running pf.feols("logpgp95 ~ logem4", data=base, vcov="HC1"). Verify that it equals approximately $-0.573$, and that dividing it by the first-stage coefficient $-0.607$ recovers the 2SLS estimate of 0.944. What does this exercise teach you about what 2SLS is doing under the hood?
Cross-library cross-check. For the main spec, run the 2SLS twice: once via pyfixest.feols("logpgp95 ~ 1 | avexpr ~ logem4", ...) and once via linearmodels.iv.IV2SLS(...).fit(cov_type="robust"). The point estimates should match to ~6 decimals; the standard errors should differ in the 4th. Why? Which small-sample correction is the “right” one for replicating the Stata ivreg2 reference?
Stress-test the exclusion restriction. Pick a candidate omitted variable that you think could violate the exclusion restriction (e.g., percentage of population at high altitude, or distance from the equator). Add it as an exogenous control to the main spec and report what happens to the 2SLS coefficient on avexpr. Is your candidate a “bad control” (downstream of institutions) or a genuine threat to exclusion (upstream of mortality)?
Hansen J on a multi-endog spec. Replicate Tab 7 Col 7 (avexpr and malfal94 jointly endogenous, instrumented by logem4, latabs, lt100km, meantemp) using linearmodels.iv.IV2SLS. Note that pyfixest.feols will refuse this specification (“Multiple endogenous variables are not supported”). Why does Hansen J / Sargan have power here but not in a just-identified spec?

15. References

Do Institutions Cause Prosperity? An IV Tutorial in Stata

Fri, 08 May 2026 00:00:00 +0000

1. Overview

This tutorial replicates AJR’s headline result on a sample of 64 ex-colonies using Stata’s ivreg2 package. We start with the naive OLS slope of 0.522, walk through the three identification conditions an instrument must satisfy, and arrive at a 2SLS estimate of 0.944 — about 81% larger. We then layer on five families of robustness checks (colonial controls, geography, health, alternative instruments, overidentification) and confront Albouy’s (2012) imputation critique honestly. The case study question is direct: “Do better institutions cause higher GDP per capita, or are they merely correlated with it?"

The IV identification strategy at a glance

Before we estimate anything, here is the picture of the strategy. The dashed red arrow is the assumption we cannot test directly — it is the heart of every IV paper.

flowchart LR
Z["Settler mortality<br/>(logem4)"]
X["Modern institutions<br/>(avexpr)"]
Y["Log GDP per capita<br/>(logpgp95)"]
U["Unobserved confounders<br/>(geography? culture?<br/>human capital?)"]
Z -->|"first stage<br/>relevance ✓"| X
X -->|"causal effect<br/>(what we want)"| Y
U -->|"bias OLS"| X
U -->|"bias OLS"| Y
Z -.->|"exclusion restriction:<br/>no direct arrow"| Y
style Z fill:#6a9bcc,stroke:#141413,color:#fff
style X fill:#d97757,stroke:#141413,color:#fff
style Y fill:#00d4c8,stroke:#141413,color:#141413
style U fill:#1a3a8a,stroke:#141413,color:#fff,stroke-dasharray: 5 5

Learning objectives

Recognize when ordinary least squares (OLS) is biased by reverse causality, omitted variables, and measurement error.
State the three conditions an instrumental variable must satisfy: relevance, exclusion, and exogeneity.
Estimate the AJR (2001) 2SLS coefficient on institutions using ivreg2 and the maketable4.dta dataset.
Diagnose weak instruments using the Kleibergen-Paap rk Wald F-statistic and the Stock-Yogo critical values.
Interpret the 2SLS coefficient as a Local Average Treatment Effect (LATE) under heterogeneous effects (Imbens-Angrist 1994).
Test the exclusion restriction with the Hansen J overidentification test, and recognize what it cannot tell you.

Key concepts at a glance

Example

The Durbin-Wu-Hausman test in Table 4 Col 1 returns $\chi^2(1) = 9.085$ with $p = 0.0026$. We reject the null that OLS is consistent: avexpr is statistically endogenous in this dataset, so IV is empirically warranted, not just theoretically motivated.

Analogy

Example

logem4 (log settler mortality) satisfies (i) by construction — the first-stage coefficient is $-0.607$ with $F = 16.32$. (ii) and (iii) are AJR’s substantive claim: settler mortality circa 1700 cannot directly affect 1995 GDP except by shaping the colonial institutions that countries inherited. (ii) and (iii) are untestable in general but can be partially examined via overidentification (Hansen J).

Analogy

3. Two-Stage Least Squares (2SLS). The standard IV estimator. Stage 1: regress the endogenous X on the instrument Z (and any controls). Stage 2: regress Y on the predicted X̂ from stage 1. The 2SLS coefficient on X̂ is the IV estimate. Stata’s ivreg2 does both stages internally; you only see the second-stage output.

Example

Analogy

Example

In our main spec, the Kleibergen-Paap rk Wald F = 16.32, just above the F > 10 rule of thumb but only marginally above the Stock-Yogo 10% maximal-IV-size threshold of 16.38. Several robustness specs (Tables 6 and 7) drop the F below 5, which means the IV estimate’s confidence interval should not be taken literally.

Analogy

A radio antenna pointing in roughly the right direction. If the signal is strong enough you hear the music clearly. If the signal is weak (low F) you hear mostly static. The static is the bias.

Example

Analogy

6. Hansen J overidentification test. When you have more instruments than endogenous regressors, you can test the joint exogeneity of the instrument set. The Hansen J test compares the moment conditions across instruments: if they all agree on the same causal effect, the test does not reject. Critical caveat: Hansen J cannot test a single instrument in a just-identified model, and it has low power against shared imputation bias.

Example

In Table 8 Panel C we pair each alternative instrument with logem4 and run efficient GMM. Hansen J p-values range from 0.21 to 0.80 across five instrument pairs — uniformly failing to reject. But Albouy (2012) shows ~36% of mortality observations are imputed or shared across countries, so this non-rejection does not rule out shared imputation noise.

Analogy

Example

Analogy

2. Setup and dependencies

The script depends on four community-contributed Stata packages from the SSC archive: ivreg2 (the IV workhorse), ranktest (a dependency of ivreg2), estout (for table assembly via eststo and esttab), and coefplot (for the comparison plot at the end). The capture ssc install pattern is idempotent: it installs each package on the first run and does nothing on subsequent runs. We also define the dark-theme color palette as global macros — Stata’s color() graph option takes RGB triplets, not hex codes, so we pre-convert the site palette.

clear all
set more off
set seed 42
capture log close
log using "analysis.log", text replace
// SSC dependencies
capture ssc install ivreg2
capture ssc install ranktest
capture ssc install estout
capture ssc install coefplot
// Globals: outcome, treatment, instrument
global Y logpgp95
global X avexpr
global Z logem4
// Data-loading mode: 1 = GitHub raw URL (replicable), 0 = local folder
global USE_GITHUB 1
if $USE_GITHUB {
global DATA_URL "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_iv"
}
else {
global DATA_URL "."
}
// Dark-theme color palette (hex -> Stata "R G B" triplet)
global DARK_NAVY "15 23 41" // background
global STEEL_BLUE "106 155 204" // primary data points
global WARM_ORANGE "217 119 87" // fit lines
global TEAL "0 212 200" // labels and highlights
global LIGHT_TEXT "200 208 224" // axis labels
global WHITE_TEXT "232 236 242" // titles

The three globals Y, X, and Z map directly onto the IV diagram above: Y is the outcome (log GDP), X is the endogenous regressor (institutional quality), and Z is the instrument (log settler mortality). Using globals keeps every regression below readable and consistent — every spec is ivreg2 ${Y} ... (${X} = ${Z}).

The USE_GITHUB toggle lets the same do-file run two ways: with 1 (the default) Stata pulls each .dta from this site’s GitHub raw URL — so any reader can do analysis.do and replicate the full set of tables without cloning the repo or downloading the AJR archive. Flipping it to 0 loads from the current folder instead, which is faster for offline iteration. The eight .dta files (maketable1.dta … maketable8.dta) are mirrored at the post root so both modes work.

3. Data overview

use "${DATA_URL}/maketable1.dta", clear
di "*** Whole world ***"
summarize logpgp95 loghjypl avexpr cons00a cons1 democ00a euro1900
di "*** AJR base sample (baseco==1) ***"
preserve
keep if baseco==1
summarize logpgp95 loghjypl avexpr cons00a cons1 democ00a euro1900 logem4
estpost summarize logpgp95 loghjypl avexpr cons00a cons1 democ00a euro1900 logem4
esttab using "tab1_summary.csv", csv replace ///
cells("count(fmt(0)) mean(fmt(3)) sd(fmt(3)) min(fmt(3)) max(fmt(3))")
restore

*** Whole world ***
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
logpgp95 | 162 8.304196 1.070869 6.109248 10.28875
avexpr | 129 6.988548 1.831779 1.636364 10
euro1900 | 166 30.10241 41.86424 0 100
*** AJR base sample (baseco==1) ***
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
logpgp95 | 64 8.062237 1.043359 6.109248 10.21574
avexpr | 64 6.515625 1.468647 3.5 10
euro1900 | 63 16.18095 25.53334 0 99
logem4 | 64 4.657031 1.257984 2.145931 7.986165

4. The naive OLS benchmark (Table 2)

use "${DATA_URL}/maketable2.dta", clear
eststo m2_c1: regress logpgp95 avexpr, robust
eststo m2_c2: regress logpgp95 avexpr if baseco==1, robust
eststo m2_c3: regress logpgp95 avexpr lat_abst, robust
eststo m2_c4: regress logpgp95 avexpr lat_abst africa asia other_cont, robust
esttab m2_c1 m2_c2 m2_c3 m2_c4 using "tab2_ols.csv", csv replace ///
b(3) se(3) star(* 0.10 ** 0.05 *** 0.01) stats(N r2)

 (1) (2) (3) (4)
Full Base +Latitude +Continents
N=111 N=64 N=111 N=111
avexpr 0.532*** 0.522*** 0.463*** 0.390***
(0.029) (0.050) (0.052) (0.051)
lat_abst 0.872* 0.333
(0.499) (0.442)
africa -0.916***
(0.154)
R-squared 0.611 0.540 0.623 0.715

5. The first stage and the reduced form (Table 3 and Figures 1–2)

use "${DATA_URL}/maketable4.dta", clear
keep if baseco==1
// Run the first stage to extract numeric F-statistic
ivreg2 logpgp95 (avexpr=logem4), robust
di _newline "*** First-stage Kleibergen-Paap rk Wald F: " %6.2f e(widstat)

First-stage regression of avexpr on logem4:
logem4 | -.6067782 .1501972 -4.04 0.000
*** First-stage Kleibergen-Paap rk Wald F: 16.32
*** Stock-Yogo 10% maximal IV size critical value: 16.38 (IID)
*** Under robust SEs, see Olea & Pflueger (2013) effective F.

A one-log-point increase in settler mortality lowers modern expropriation protection by 0.607 points, with a t-statistic of 4.04. The first-stage Kleibergen-Paap rk Wald F-statistic is 16.32, just above the Staiger-Stock (1997) rule of thumb of F > 10 and almost exactly equal to the Stock-Yogo (2005) iid threshold of 16.38 for ≤10% maximal IV size distortion. Honest disclosure: 16.32 is borderline, not comfortable. Under heteroskedasticity-robust standard errors (which we are using), the more rigorous benchmark is the Olea-Pflueger (2013) effective F (weakivtest in SSC); we will fall back on the weak-IV-robust Anderson-Rubin Wald test in §6 to confirm significance even if one is uncomfortable with the conventional asymptotics.

twoway ///
(scatter avexpr logem4, ///
mcolor("${STEEL_BLUE}") ///
mlabel(shortnam) mlabcolor("${TEAL}") mlabsize(vsmall)) ///
(lfit avexpr logem4, lcolor("${WARM_ORANGE}") lwidth(medthick)), ///
title("Figure 1. First stage: settler mortality predicts institutions", color("${WHITE_TEXT}")) ///
xtitle("Log settler mortality (logem4)", color("${LIGHT_TEXT}")) ///
ytitle("Avg. protection from expropriation (avexpr)", color("${LIGHT_TEXT}")) ///
graphregion(color("${DARK_NAVY}")) plotregion(color("${DARK_NAVY}")) ///
bgcolor("${DARK_NAVY}") legend(off)
graph export "stata_iv_first_stage.png", replace width(2400)

Figure 1. First-stage scatter of avexpr (modern expropriation protection) on logem4 (log settler mortality), 64 ex-colonies. Slope = −0.607, F = 16.32, R² = 0.27.

twoway ///
(scatter logpgp95 logem4, ///
mcolor("${STEEL_BLUE}") ///
mlabel(shortnam) mlabcolor("${TEAL}") mlabsize(vsmall)) ///
(lfit logpgp95 logem4, lcolor("${WARM_ORANGE}") lwidth(medthick)), ///
title("Figure 2. Reduced form: settler mortality predicts log GDP", color("${WHITE_TEXT}")) ///
xtitle("Log settler mortality (logem4)", color("${LIGHT_TEXT}")) ///
ytitle("Log GDP per capita, PPP, 1995 (logpgp95)", color("${LIGHT_TEXT}")) ///
graphregion(color("${DARK_NAVY}")) plotregion(color("${DARK_NAVY}")) ///
bgcolor("${DARK_NAVY}") legend(off)
graph export "stata_iv_reduced_form.png", replace width(2400)

Figure 2. Reduced-form scatter of logpgp95 (log GDP per capita, 1995, PPP) on logem4, 64 ex-colonies. The slope (≈ −0.573) is the total effect of the instrument on the outcome.

6. The main 2SLS estimate (Table 4)

This is the headline result. We instrument avexpr with logem4, all standard errors are heteroskedasticity-robust, and we add the Durbin-Wu-Hausman endogeneity test via ivreg2’s endog() option. Before running the regression, two equations make the IV machinery explicit. The structural model is:

$$Y_i = \alpha + \beta X_i + U_i, \quad \text{where} \, \, \text{Cov}(X_i, U_i) \neq 0$$

$$\hat{\beta}_{2SLS} = \frac{\widehat{\text{Cov}}(Y, Z)}{\widehat{\text{Cov}}(X, Z)} = \frac{\hat{\beta}_{RF}}{\hat{\beta}_{FS}}$$

ivreg2 logpgp95 (avexpr=logem4), robust first endog(avexpr)

2SLS estimate, base sample (N=64):
avexpr | .9442794 .1760958 5.36 0.000 .5991379 1.289421
_cons | 1.909667 1.173955 1.63 0.104 -.3912422 4.210575
Underidentification (Kleibergen-Paap rk LM): 9.492 p = 0.0021
Weak ID (Cragg-Donald F): 22.95
Weak ID (Kleibergen-Paap rk Wald F): 16.32
Stock-Yogo 10% maximal IV size threshold: 16.38 (iid)
Anderson-Rubin Wald test (weak-IV-robust): F(1,62) = 61.66 p < 0.0001
Endogeneity test (Durbin-Wu-Hausman): chi2(1) = 9.085 p = 0.0026

The 2SLS coefficient on avexpr is 0.944 with a robust standard error of 0.176 (95% CI [0.60, 1.29]). It is 81% larger than the OLS estimate of 0.522 and statistically distinguishable from zero at the 1% level (z = 5.36). The Kleibergen-Paap rk Wald F = 16.32 sits just below the Cragg-Donald F = 22.95 (as expected under heteroskedasticity) and at the Stock-Yogo iid threshold; the weak-IV-robust Anderson-Rubin Wald test (F = 61.66, p < 0.0001) gives extra reassurance. The Durbin-Wu-Hausman endogeneity test rejects the null that OLS is consistent ($\chi^2 = 9.09$, $p = 0.003$): the IV-OLS gap is large enough to constitute statistical evidence that OLS is biased — IV is empirically warranted, not just theoretically motivated.

7. Robustness 1: colonial, legal, and religious controls (Table 5)

use "${DATA_URL}/maketable5.dta", clear
keep if baseco==1
eststo m5_c1: ivreg2 logpgp95 f_brit f_french (avexpr=logem4), robust
eststo m5_c5: ivreg2 logpgp95 sjlofr (avexpr=logem4), robust
eststo m5_c7: ivreg2 logpgp95 catho80 muslim80 no_cpm80 (avexpr=logem4), robust
esttab m5_c1 m5_c5 m5_c7 using "tab5_iv_controls.csv", csv replace ///
b(3) se(3) star(* 0.10 ** 0.05 *** 0.01) ///
stats(N r2 firstF, fmt(0 3 2))

 (1) (5) (7)
+brit/french +legal +religion
avexpr 1.078*** 1.080*** 0.917***
(0.240) (0.202) (0.156)
First-stage F (KP) 11.73 15.94 16.76
N 64 64 64

Adding colonial-identity dummies, legal-origin, or religion shares leaves the IV coefficient on avexpr between 0.917 and 1.339 across the nine columns — never below the 0.944 baseline and frequently larger. Standard errors widen (0.156 to 0.535), and first-stage F-statistics range from 2.90 (Col 4, with Neo-Europes excluded + latitude) to 16.76 (Col 7). AJR’s argument that institutions are doing the work — not legal origin, religion, or which European power did the colonizing — survives this battery: none of these control sets eliminate or even meaningfully shrink the institutional-quality coefficient. The Col 4 caveat is real, but it is a confidence-interval survival rather than a tight-point-estimate one.

8. Robustness 2: geography and climate (Table 6)

use "${DATA_URL}/maketable6.dta", clear
keep if baseco==1
eststo m6_c1: ivreg2 logpgp95 temp1-temp5 humid1-humid4 (avexpr=logem4), robust
eststo m6_c5: ivreg2 logpgp95 steplow deslow stepmid desmid drystep drywint goldm iron silv zinc oilres landlock (avexpr=logem4), robust
eststo m6_c7: ivreg2 logpgp95 avelf (avexpr=logem4), robust
esttab m6_c1 m6_c5 m6_c7 using "tab6_iv_geo.csv", csv replace ///
b(3) se(3) star(* 0.10 ** 0.05 *** 0.01) stats(N r2 firstF)

 (1) (5) (7)
+climate +resources +ethnic-frac
avexpr 0.837*** 1.259** 0.738***
(0.165) (0.543) (0.140)
First-stage F (KP) 17.80 2.83 14.99
N 64 64 64

Across nine geographic specifications — temperature dummies, humidity, latitude, percent in steppe/desert/dry climate, mineral resources, landlock status, ethnolinguistic fractionalization (avelf) — the IV coefficient on avexpr ranges from 0.713 to 1.358, bracketing the 0.944 baseline. The catch is that first-stage F drops below 10 in five of nine columns (lowest 1.74 in Col 6, 2.83 in Col 5), because the geography variables are themselves correlated with logem4. The qualitative conclusion holds; the quantitative confidence intervals widen.

9. Robustness 3: the trickiest case — health channels (Table 7)

AJR’s preferred reading: modern health is a “bad control” — itself an outcome of institutional quality, so adjusting for it shrinks the institutional coefficient toward zero artifactually.
A critic’s reading: modern health is genuinely exogenous, and its inclusion exposes a violation of the exclusion restriction.

The data alone cannot adjudicate.

use "${DATA_URL}/maketable7.dta", clear
keep if baseco==1
eststo m7_c1: ivreg2 logpgp95 malfal94 (avexpr=logem4), robust
eststo m7_c3: ivreg2 logpgp95 leb95 (avexpr=logem4), robust
eststo m7_c5: ivreg2 logpgp95 imr95 (avexpr=logem4), robust
// Cols 7-9: 4 instruments, 2 endogenous regressors -> Hansen J meaningful
eststo m7_c7: ivreg2 logpgp95 (avexpr malfal94 = logem4 latabs lt100km meantemp), gmm2s robust
estadd scalar hansenJ = e(j)
estadd scalar hansenP = e(jp)
esttab m7_c1 m7_c3 m7_c5 m7_c7 using "tab7_iv_health.csv", csv replace ///
b(3) se(3) star(* 0.10 ** 0.05 *** 0.01) ///
stats(N r2 firstF hansenJ hansenP)

 (1) (3) (5) (7) overid
+malaria +life exp. +infant mort. (4 instr)
avexpr 0.687*** 0.629** 0.551** 0.611***
(0.265) (0.295) (0.260) (0.235)
First-stage F (KP) 3.79 4.02 4.86 1.17
Hansen J 1.56 (p=0.459)
N 62 60 60 60

When malaria prevalence (malfal94), life expectancy (leb95), or infant mortality (imr95) are added as exogenous controls, the IV coefficient on avexpr falls to 0.55–0.69 — the only place in the entire script where the IV approaches the OLS benchmark of 0.522. Cols 7–9 use four instruments for two endogenous regressors via efficient GMM (gmm2s), making the Hansen J test meaningful: J p-values of 0.46–0.76 fail to reject the joint exogeneity of the instrument set, providing modest support for AJR’s reading. But the first-stage F-statistics in these overidentified specs collapse to 1.17–4.86 — well below any weak-IV threshold — so the Hansen J non-rejection has low power against shared imputation bias and limited confidence. Health channels are the place where a fair-minded reader should retain doubt.

10. Overidentification and alternative instruments (Table 8)

We split this into three parts. Panel C pairs each alternative instrument with logem4 and runs efficient GMM, producing a Hansen J test. Panel D drops the exclusion restriction on logem4 itself by including it as an exogenous control while alternative instruments do the identification — the harshest sensitivity check.

use "${DATA_URL}/maketable8.dta", clear
keep if baseco==1
// Panel C: alt instrument + logem4 -> Hansen J meaningful
eststo m8c_c1: ivreg2 logpgp95 (avexpr = euro1900 logem4), gmm2s robust
eststo m8c_c3: ivreg2 logpgp95 (avexpr = cons00a logem4), gmm2s robust
eststo m8c_c5: ivreg2 logpgp95 (avexpr = democ00a logem4), gmm2s robust
// Panel D: logem4 as exogenous control, alt instrument identifies
eststo m8d_c1: ivreg2 logpgp95 logem4 (avexpr = euro1900), robust
eststo m8d_c3: ivreg2 logpgp95 logem4 (avexpr = cons00a), robust
eststo m8d_c5: ivreg2 logpgp95 logem4 (avexpr = democ00a), robust

Panel C (overid): Hansen J p-values 0.21 to 0.80 across 5 alt instruments
-> uniformly fails to reject joint exogeneity
Panel D (logem4 as control):
euro1900 instrument: avexpr = 0.81-0.88 logem4 control = -0.05 to -0.07
cons00a instrument: avexpr = 0.42-0.45 logem4 control = -0.25 to -0.26
democ00a instrument: avexpr = 0.48-0.52 logem4 control = -0.21 to -0.22
cons1 instrument: avexpr = 0.49-0.49 logem4 control = -0.14 to -0.14
democ1 instrument: avexpr = 0.40-0.41 logem4 control = -0.19 to -0.19
In all 10 columns the logem4 control coefficient is statistically zero (p > 0.1).

Panel C delivers Hansen J p-values from 0.21 to 0.80 across five alternative instrument pairs — uniformly failing to reject joint exogeneity. This is the test AJR pass cleanly. Panel D is more demanding: when logem4 enters as a control, the IV coefficient on avexpr splits by instrument family. Cols 21–22 (using euro1900) keep avexpr at 0.81–0.88 — likely because euro1900 is itself a continuous mortality-correlated proxy rather than a clean institutional alternative. Cols 23–30 (using historical-institution alternatives cons00a, democ00a, cons1, indtime, democ1) fall to 0.40–0.52. The logem4 control is itself never statistically distinguishable from zero across any of the 10 columns. This pattern is consistent with AJR’s claim — settler mortality affects modern income only through institutions — but the 8-of-10 drop in coefficient magnitude when logem4 is moved to the right-hand side suggests some of the baseline IV’s strength came from logem4 proxying for unobserved correlates that the historical-institution alternatives do not capture.

11. The visual summary: OLS vs IV across specifications (Figure 3)

Figure 3 presents a coefplot of the avexpr coefficient across six representative specifications: OLS baseline (orange), four IV variants with logem4 (steel blue), and IV with the euro1900 alternative instrument (teal). The visual confirms what the tables show numerically.

coefplot ///
(m4_ols_c1, label("OLS") mcolor("${WARM_ORANGE}")) ///
(m4_iv_c1, label("IV: settler mortality") mcolor("${STEEL_BLUE}")) ///
(m5_iv_c1, label("IV + colonial controls") mcolor("${STEEL_BLUE}")) ///
(m6_iv_c1, label("IV + geography controls") mcolor("${STEEL_BLUE}")) ///
(m7_iv_c1, label("IV + malaria control") mcolor("${STEEL_BLUE}")) ///
(m8a_c1, label("IV: alt instrument euro1900") mcolor("${TEAL}")), ///
keep(avexpr) xline(0, lcolor("${LIGHT_TEXT}") lpattern(dash)) ///
title("Effect of institutions on log GDP: OLS vs IV", color("${WHITE_TEXT}")) ///
graphregion(color("${DARK_NAVY}")) plotregion(color("${DARK_NAVY}")) ///
bgcolor("${DARK_NAVY}")
graph export "stata_iv_ols_vs_iv.png", replace width(3000)

12. Discussion

13. Summary, limitations, and next steps

Method insight. 2SLS recovers a causal effect that is 81% larger than OLS (0.944 vs 0.522) — consistent with classical attenuation from measurement error in the institutional-quality index dominating reverse-causality and omitted-variable biases. The Durbin-Wu-Hausman test ($\chi^2 = 9.09$, $p = 0.003$) confirms OLS is biased; the weak-IV-robust Anderson-Rubin Wald test ($F = 61.66$) confirms institutions matter even if one is uncomfortable with conventional 2SLS asymptotics on a borderline first-stage F.

Data insight. 64 ex-colonies span a 60-fold income range and a six-log-point mortality range. That much variation is enough to identify the IV cleanly when the instrument is strong, but not enough to identify it cleanly when controls absorb most of the first-stage signal. Robustness specs with first-stage F < 5 (Tab 6 Cols 5-6, Tab 7 Cols 7-9) live in weak-IV territory — read their confidence intervals, not their point estimates.

Limitation. The 0.944 is a LATE, not an ATE. It applies to the colonization-margin compliers, not the whole population of countries. It also depends on AJR’s exclusion restriction — that 1700-era settler mortality affects 1995 GDP only through institutions — which is untestable in principle and only partially probed by Hansen J in practice. Albouy’s (2012) imputation critique limits what J-test non-rejection can buy: roughly 36% of mortality observations are shared across countries, so the joint exogeneity test has low power against shared imputation noise.

Next step. Install the SSC weakivtest package and rerun the main spec to obtain the Olea-Pflueger (2013) effective F-statistic — the right benchmark under heteroskedasticity-robust inference. If the effective F materially exceeds the Stock-Yogo iid threshold of 16.38, the conventional 2SLS asymptotics are safer to lean on. If it does not, the Anderson-Rubin Wald test becomes the primary inference tool.

14. Exercises

Reduced-form ratio check. Compute the reduced-form coefficient by regressing logpgp95 directly on logem4 in the base sample. Verify that it equals approximately $-0.573$, and that dividing it by the first-stage coefficient $-0.607$ recovers the 2SLS estimate of 0.944. What does this exercise teach you about what 2SLS is doing under the hood?
Just-identified vs overidentified. Replicate Table 8 Panel C in just-identified form: run ivreg2 logpgp95 (avexpr = euro1900), gmm2s robust (one instrument only). Note that Hansen J is now zero — the model is exactly identified. What does this tell you about the J-test’s logic? Why must we have more instruments than endogenous regressors to compute it?
Stress-test the exclusion restriction. Pick a candidate omitted variable that you think could violate the exclusion restriction (e.g., percentage of population at high altitude, or distance from the equator). Add it as an exogenous control to the main spec and report what happens to the 2SLS coefficient on avexpr. Is your candidate a “bad control” (downstream of institutions) or a genuine threat to exclusion (upstream of mortality)?

15. References

Converging to Convergence: Understanding the Main Ideas of the Convergence Literature

Wed, 29 Apr 2026 00:00:00 +0000

1. Overview

For decades, one of the most important questions in economics has been: are poor countries catching up to rich ones? The answer has changed dramatically over time. In the 1960s, richer countries actually grew faster than poorer ones — a pattern called divergence. By the 2000s, this had reversed: poor countries were growing significantly faster, a phenomenon known as unconditional convergence (also called absolute convergence). What caused this shift?

This tutorial walks through the key ideas of the convergence literature by reproducing the main findings of Kremer, Willis, and You (2021), “Converging to Convergence.” The paper provides an elegant explanation: the world has “converged to convergence” because growth correlates — the policies, institutions, and human capital variables that predict economic growth — have themselves converged across countries. As poor countries improved their institutions and policies, the gap between unconditional convergence (a simple comparison of growth rates across income levels) and conditional convergence (controlling for institutions) closed. The central tool for understanding this is the omitted variable bias (OVB) formula, which decomposes exactly how much each growth correlate contributes to the convergence gap.

We use the authors' replication dataset, which combines Penn World Table 10.0 GDP data with over 50 institutional, policy, and cultural variables for approximately 160 countries from 1960 to 2017. The analysis is entirely descriptive — we document cross-country correlations and trends, but do not make causal claims.

Learning objectives

Understand beta-convergence and sigma-convergence and how to test for each
Track the trend in convergence over time using year-interacted regressions
Decompose convergence into contributions from income quartiles and geographic regions
Apply the omitted variable bias (OVB) formula to explain why unconditional convergence emerged
Distinguish between correlate-income slopes (delta), growth-correlate slopes (lambda), and their product
Evaluate whether the 1990s growth regression literature holds up as an out-of-sample test

Analytical roadmap

The diagram below shows the logical progression of the tutorial. We first establish the facts, then explain them.

graph LR
A["<b>Establish the<br/>Facts</b><br/><i>Sections 3--6</i>"]
B["<b>Correlate<br/>Convergence</b><br/><i>Section 7</i>"]
C["<b>OVB<br/>Framework</b><br/><i>Sections 8--10</i>"]
D["<b>The<br/>Punchline</b><br/><i>Section 11</i>"]
A --> B
B --> C
C --> D
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#00d4c8,stroke:#141413,color:#141413
style D fill:#141413,stroke:#d97757,color:#fff

We start by documenting the emergence of convergence (scatter plots, rolling coefficients, sigma-convergence, quartile decompositions). Then we show that growth correlates have themselves converged. Finally, the OVB framework links these two facts, revealing that the gap between unconditional and conditional convergence closed because growth regression coefficients for policy variables collapsed.

Key concepts at a glance

The post leans on a small vocabulary repeatedly. The rest of the tutorial assumes you can move between these terms quickly. Each concept below has three parts. The definition is always visible. The example and analogy sit behind clickable cards: open them when you need them, leave them collapsed for a quick scan. If a later section mentions “OVB decomposition” or “lambda flattening” and the term feels slippery, this is the section to re-read.

1. Beta convergence: unconditional vs conditional $\beta$ vs $\beta^$. The unconditional $\beta$ is the slope of growth on log initial income with no controls. The conditional $\beta^$ is the same slope after controlling for growth correlates. Both negative means poorer countries are catching up — even those with similar institutions.

Example

For the polity2 sample in 2005, the unconditional $\beta = -0.767$ and the conditional $\beta^* = -0.807$. The two are within 0.04 of each other. Twenty years earlier (1985), the gap was 0.44 — institutions explained most of the apparent divergence.

Analogy

“Catching up overall” vs “catching up given the same institutions”. Imagine two race tracks: one mixes all runners, the other separates them by training regimen. If both show poor runners gaining, the catching-up is real.

2. Sigma convergence $\sigma_t$. The cross-country standard deviation of log GDP per capita at year $t$. Tracks the width of the world income distribution. A narrowing distribution is sigma convergence.

Example

$\sigma$ rose from 0.947 in 1960 to 1.217 in 2000 (peak), then eased to 1.173 by 2017. Income dispersion is no longer widening but has not yet narrowed substantially. Beta convergence has just begun the work that sigma convergence will eventually reflect.

Analogy

A flock of birds. Sigma asks whether the flock is tightening. Beta tells you which birds are flying faster. They are related but not the same: the laggard birds can accelerate without the flock yet looking tighter.

3. OVB decomposition $\beta - \beta^* = \delta \cdot \lambda$. The omitted-variable-bias identity. The gap between unconditional and conditional convergence equals the product of two slopes: $\delta$ (correlate-on-income) and $\lambda$ (correlate-on-growth). When the gap closes, at least one of $\delta$ or $\lambda$ must have shrunk.

Example

For the polity2 example, the gap closed from 0.440 (1985) to 0.040 (2005). The product $\delta \cdot \lambda$ went from $0.440$ to $0.040$. Inspecting the components: $\lambda$ collapsed from 0.891 to 0.183 — the growth regression coefficient flattened.

Analogy

Double-entry bookkeeping. The total bias on the convergence books equals the sum of two ledger entries. If the total drops, one of the ledger entries must have dropped — and the OVB identity tells you which one.

4. Growth correlates. The policy and institutional variables economists used to put on the right-hand side of growth regressions in the 1990s: inflation, investment, schooling, openness, political rights, rule of law, and so on. Each is meant to capture a “fundamental” of long-run growth.

Example

This post tracks polity2, FH_political_rights, investment, inflation, and barrolee2060 (schooling) as the headline correlates. Each has a story in the post: investment shows the strongest cross-country correlation with income; political rights show the most pronounced correlate-income flattening.

Analogy

Ingredients in a recipe. Some recipes call for many ingredients (high-inflation, low-savings, weak-rights), others for few. Growth correlates are the ingredients we suspect explain why some economies cook up more output than others.

5. Correlate–income slope $\delta$. The regression of a correlate on log income. How much richer countries have more of the correlate. A large positive $\delta$ for polity2 means richer countries are more democratic.

Example

For polity2, $\delta$ has stayed around 0.5–0.6 over decades. Richer countries have always tended to be more democratic. The correlate-income slope is not what flattened in the 1990s–2000s; it is the other half of the OVB product.

Analogy

How well-stocked the kitchen is. A wealthy kitchen has more ingredients on hand. The correlate-income slope $\delta$ measures the kitchen-stocking gradient: as a country gets richer, how much better-stocked does its kitchen become?

6. Growth-regression slope $\lambda$. The coefficient on a correlate when growth is regressed on the correlate (controlling for log income). How much each correlate contributes to growth, holding initial income fixed. A large $\lambda$ means the correlate matters; a small $\lambda$ means it does not.

Example

For polity2 in 1985, $\lambda = 0.891$. By 2005, $\lambda = 0.183$. The growth payoff to good political institutions has flattened dramatically over two decades.

Analogy

How much each ingredient matters in the recipe. A pinch of saffron used to be transformative. Now everyone uses it; the marginal effect is much smaller. Lambda is “marginal effect of the ingredient”; not “amount of ingredient on hand”.

7. Lambda flattening. The empirical observation that growth-regression coefficients $\lambda$ on short-run correlates have collapsed since the 1990s. The collapse is the real story: it is what made unconditional convergence emerge.

Example

Across the post’s correlate set, $\lambda$ for several short-run policy variables fell from 0.5–1.0 (1985) to 0.1–0.3 (2005). The longer-run correlates (like schooling) are stickier. The lambda flattening shrinks the OVB product and brings $\beta$ and $\beta^*$ into alignment.

Analogy

Ingredients losing their punch as kitchens equalize. When every kitchen has good knives and a working oven, the kitchens with the best knives no longer dominate. Lambda flattening is that universal-baseline effect.

8. Quartile and regional decomposition. A descriptive break-down of beta convergence by initial-income quartile or by region. Asks: which subgroup is doing the catching-up? A few quartiles or regions usually do most of the work.

Example

This post’s regional decomposition (Sub-Saharan Africa, East Asia, Latin America, OECD, etc.) attributes most of the post-2000 catch-up to East Asia and parts of South Asia. Within-quartile, the bottom two quartiles drive the recent convergence; the top two have stayed flat.

Analogy

Breaking the average down by income tier. The class average improved; was it because everyone improved, or because the bottom of the class caught up? Quartile decomposition answers exactly that question.

2. Setup and data loading

We begin by loading the Kremer et al. (2021) replication dataset, which has already been cleaned to exclude very small countries (population below 200,000) and resource-dependent economies (natural resource rents above 75% of GDP). We also merge regional classifications from the World Development Indicators.

clear all
set more off
set seed 42
set scheme s2color
* Load the main dataset
use "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_convergence2/main_data.dta", clear
* Display panel structure
codebook country_id, compact
tab year if loggdp != ., missing
summarize loggdp loggdp_growth_10

Panel structure:
country_id: 174 unique countries, range 2--218
Years covered: 1960 to 2017
Countries with GDP data: 160
Key income variables:
Variable | Obs Mean Std. dev. Min Max
-----------+---------------------------------------------------------
loggdp | 8,328 8.712741 1.186573 5.368557 12.61823
loggdp_g~10| 6,888 1.962031 2.78512 -12.33628 22.12787

The dataset is an unbalanced panel of 160 countries observed over 58 years (1960–2017), with 8,328 country-year observations containing GDP data. The panel expands in two jumps — from 109 countries in 1960 to 137 in 1970 (decolonization) and to 160 in 1990 (post-Soviet states). Average log GDP per capita is 8.71, with a standard deviation of 1.19 log points reflecting enormous cross-country income inequality. The 10-year forward-looking growth rate — the main outcome variable — averages 1.96% per year with a range from -12.3% (economic collapses) to 22.1% (growth miracles).

We then define variable groups following the paper’s classification of growth correlates into four categories.

* Solow fundamentals (steady-state determinants)
local solow investment population_growth barrolee2060
* Short-run correlates (policies/institutions that can change quickly)
local short_run polity2 FH_political_rights FH_civil_liberties ///
pri_inv gov_spending inflation WDI_credit credit /* +19 more */
* Long-run correlates (geography and historical institutions)
local long_run population_1900 legor_uk legor_fr logem4 meantemp /* +7 more */
* Culture (Hofstede cultural dimensions)
local culture VSM_power_dist VSM_individualism VSM_masculinity /* +3 more */

The classification matters because the paper’s central finding is that short-run correlates behave very differently from Solow fundamentals in growth regressions. We will return to this distinction in Sections 9 and 10.

3. Has the world been converging? Scatter plots by decade

The simplest test for convergence is visual: plot 10-year economic growth against initial income level and check the slope. Beta-convergence — named after the slope coefficient $\beta$ in the regression of growth on income — means that poorer countries grow faster. A negative slope indicates convergence; a positive slope indicates divergence.

We run this regression for each decade separately, from the 1960s through 2007.

foreach yr in 1960 1970 1980 1990 2000 2007 {
quietly reg loggdp_growth_10 loggdp if year == `yr', robust
* Store coefficients for each decade
}
* Combine 6 scatter panels into one figure
graph combine G1 G2 G3 G4 G5 G6, rows(2) cols(3) ///
graphregion(color(white)) ///
title("Income Convergence by Decade", size(medium))
graph export "stata_convergence2_scatter_by_decade.png", replace width(2400)

Beta by decade:
decade | beta se pval n_obs
--------+----------------------------------------
1960 | 0.532 0.191 0.006 109
1970 | -0.075 0.292 0.799 137
1980 | 0.106 0.246 0.667 137
1990 | -0.127 0.220 0.564 160
2000 | -0.651 0.168 0.000 160
2007 | -0.764 0.146 0.000 160

The scatter plots reveal a dramatic historical reversal. In the 1960s, $\beta = +0.53$ (p = 0.006), meaning richer countries grew significantly faster — a world of divergence. Through the 1970s–1990s, the coefficient hovered near zero, statistically indistinguishable from zero in every decade. By the 2000s, a strongly negative $\beta = -0.65$ (p < 0.001) emerged, deepening to -0.76 by 2007. This shift from divergence to convergence — spanning roughly 1.3 percentage points of GDP growth per log point of income — represents a fundamental transformation in the global growth landscape.

But is this trend systematic, or just an artifact of picking the right decades? The next section tests whether convergence has been trending continuously.

4. The trend in beta-convergence

Rather than comparing snapshots, we track the convergence coefficient continuously over time. This is the paper’s key innovation: studying the trend in convergence, not just testing whether convergence exists at a single point in time.

The specification interacts log GDP per capita with year dummies, giving a separate $\beta_t$ for each year:

$$\text{Growth}_{i,t \to t+10} = \beta_t \cdot \log(\text{GDPpc}_{i,t}) + \mu_t + \varepsilon_{i,t}$$

In words, this equation says that 10-year forward-looking growth is a linear function of initial income, with a slope $\beta_t$ that varies by year and year fixed effects $\mu_t$ absorbing common shocks. A negative $\beta_t$ means convergence in year $t$; a positive $\beta_t$ means divergence.

* Estimate year-by-year beta coefficients using year-interacted regression
areg loggdp_growth_10 c.loggdp#i.year, absorb(year) robust cluster(country_id)
* Extract coefficients and plot with 95% CI
twoway (rarea ci_upper ci_lower year, fcolor("106 155 204%30") lwidth(none)) ///
(line beta year, lcolor("106 155 204") lwidth(medthick)) ///
(function y = 0, range(1960 2009) lcolor("217 119 87") lpattern(dash)), ///
xtitle("Year") ytitle("Beta-convergence coefficient") ///
title("Trend in Beta-Convergence, 1960-2007", size(medium))
graph export "stata_convergence2_beta_trend.png", replace width(2400)

We also estimate a linear trend specification (Table 1) to test whether the downward movement is statistically significant.

Table 1: Converging to Convergence
-------------------------------------------------
(1) (2) (3)
Pooled Trend By Decade
-------------------------------------------------
loggdp -0.270** 0.449**
(0.118) (0.224)
loggdp_X~r -0.025***
(0.006)
loggdp~60s 0.532***
(0.191)
loggdp~00s -0.651***
(0.168)
loggdp~07s -0.764***
(0.146)
-------------------------------------------------
N 863 863 863
Year FE Y Y Y
-------------------------------------------------

The trend coefficient of -0.025 per year (p < 0.01) confirms that convergence has been a systematic trend, not just a snapshot. The convergence coefficient has decreased by 0.025 per year since 1960 — or equivalently, has shifted by about 1.2 percentage points per half-century. The rolling year-by-year beta (Figure 2) shows this was not smooth: $\beta$ fluctuated around zero through the 1970s–1980s, then dropped sharply through the 1990s and 2000s, becoming consistently and significantly negative after 1999.

This raises a natural follow-up question: if countries are growing at rates that should reduce income gaps (beta-convergence), has income dispersion actually narrowed?

5. Sigma-convergence: is income dispersion narrowing?

Beta-convergence (poorer countries growing faster) and sigma-convergence (declining cross-country income dispersion) are related but distinct concepts. Beta-convergence is necessary but not sufficient for sigma-convergence — like a river flowing downhill, catch-up growth must be strong enough to overcome random shocks that push countries apart. We measure sigma as the standard deviation of log GDP per capita across countries in each year.

preserve
collapse (sd) sigma = loggdp, by(year)
twoway (line sigma year, lcolor("106 155 204") lwidth(medthick)), ///
xtitle("Year") ytitle("SD of log GDP per capita") ///
title("Sigma-Convergence: Cross-Country Income Dispersion", size(medium))
graph export "stata_convergence2_sigma.png", replace width(2400)
restore

Sigma (SD of log GDP per capita):
Year | Sigma
-------+---------
1960 | 0.947
1970 | 1.086
1980 | 1.139
1990 | 1.146
2000 | 1.217 (peak)
2010 | 1.173
2017 | 1.173

The standard deviation of log GDP per capita rose steadily from 0.95 in 1960 to a peak of 1.22 in 2000, reflecting four decades of widening global inequality. After 2000, sigma began declining, reaching 1.13 by 2015 before ticking back up slightly to 1.17 in 2017. This pattern is consistent with beta-convergence leading sigma-convergence by roughly a decade: beta turned significantly negative around 1999, and sigma began declining shortly after 2000. The lag occurs because sigma-convergence requires catch-up growth fast enough to offset the random shocks that push countries apart — a more demanding condition than simple beta-convergence.

Now that we have established the headline fact — convergence emerged around 2000 — we need to understand who is driving it. Is it catch-up growth at the bottom, stagnation at the top, or both?

6. Who drives convergence?

6.1 Income quartile decomposition

We decompose the convergence trend by sorting countries into income quartiles and tracking each group’s average growth rate over time. This reveals whether convergence reflects catch-up growth by the poorest countries, a growth slowdown among the richest, or both.

* Compute mean 10-year growth by income quartile and year
xtile quartile = loggdp, nq(4)
collapse (mean) mean_growth = loggdp_growth_10, by(quartile year)
* Plot 4 lines, one per quartile
twoway (line mean_growth year if quartile == 1, lcolor("255 141 61")) ///
(line mean_growth year if quartile == 2, lcolor("246 199 0")) ///
(line mean_growth year if quartile == 3, lcolor("146 195 51")) ///
(line mean_growth year if quartile == 4, lcolor("106 155 204")), ///
legend(label(1 "Q1 (Poorest)") label(2 "Q2") label(3 "Q3") label(4 "Q4 (Richest)"))
graph export "stata_convergence2_growth_by_quartile.png", replace width(2400)

Mean 10-year growth by quartile:
Q1(Poorest) Q2 Q3 Q4(Richest)
1960 2.46 2.20 2.93 3.49
1985 0.49 0.99 1.46 1.76
2000 3.31 3.60 3.29 1.26
2007 3.02 2.18 1.60 0.31

Convergence since 2000 is driven by both catch-up growth at the bottom AND a growth slowdown at the top. In the 1960s, the richest quartile (Q4) grew fastest at 3.49% per year, while the poorest (Q1) grew at only 2.46%. By 2007, this ordering had completely reversed: Q1 grew at 3.02% while Q4 grew at just 0.31%. The richest quartile experienced the most dramatic decline, going from the fastest-growing group in the 1960s to the slowest by the 2000s. Think of it like a marathon where the leaders have slowed down while the runners at the back have sped up — the pack is compressing from both directions.

6.2 Regional robustness

A natural concern is that convergence might be driven by a single region — perhaps it disappears if we exclude China and the rest of Asia. We check by estimating the rolling beta trend while excluding each major region one at a time.

* For each region, estimate beta trend excluding that region
foreach reg in 1 2 3 4 {
areg loggdp_growth_10 c.loggdp#i.year if region_group != `reg', ///
absorb(year) robust cluster(country_id)
* Extract and store coefficients
}
graph export "stata_convergence2_beta_excluding_regions.png", replace width(2400)

Convergence holds when excluding any single region. Excluding Sub-Saharan Africa makes convergence even stronger ($\beta$ reaches -1.25 by 2000), consistent with Africa’s economic difficulties during the 1970s–1990s dragging the global average toward zero. Excluding Europe/North America yields a somewhat weaker but still clearly negative trend. The finding is genuinely global.

We have now established the core empirical facts: convergence emerged around 2000, it reflects forces on both ends of the income distribution, and it is not driven by any single region. The next step is to ask why. The paper’s key insight is that the answer lies in the behavior of growth correlates.

7. Have growth correlates converged?

The 1990s growth literature identified dozens of variables that predict economic growth: investment, education, democracy, governance, financial development, inflation, and many others. A key insight of Kremer et al. (2021) is that these variables are not static — they have been converging across countries just like income itself.

We test this by regressing the change in each correlate (from 1985 to 2015) on its initial level in 1985. A negative slope means correlate convergence — countries that started with worse values experienced the largest improvements.

* For each correlate: change = beta * initial_level + epsilon
* Example for Polity 2 (democracy score)
gen change = 100 * ((polity2_2015 - polity2_1985) / 30)
reg change polity2_1985, robust

Correlate beta-convergence (change 1985-2015 regressed on level 1985):
Variable | beta se n_obs pval
-----------------------+------------------------------------
investment | -2.978 0.395 118 0.000
population_growth | -1.530 0.277 172 0.000
polity2 | -2.029 0.168 131 0.000
FH_political_rights | -1.394 0.206 139 0.000
gov_spending | -1.611 0.305 114 0.000
inflation | -3.070 0.103 128 0.000
barrolee2060 | -0.158 0.105 136 0.136

Growth correlates have themselves been converging since 1985. The strongest convergence is in inflation ($\beta = -3.07$), investment ($\beta = -2.98$), and democracy as measured by Polity 2 ($\beta = -2.03$) — all significant at the 0.1% level. This means that the cross-country distribution of policies and institutions has been compressing: countries with initially worse institutions experienced the largest improvements. The notable exception is Barro-Lee education ($\beta = -0.16$, p = 0.14), where convergence is slower and not statistically significant.

This finding is crucial because it connects two previously separate literatures. The convergence literature asks whether poor countries are catching up in income. The institutions literature documents whether countries are catching up in policies. The answer to both is yes — and the next sections show these are not coincidences but are linked by the omitted variable bias formula.

8. The OVB framework: why does convergence emerge?

This section introduces the central analytical framework of the paper. The omitted variable bias (OVB) formula provides an exact decomposition of the gap between unconditional convergence (a simple comparison of growth and income) and conditional convergence (controlling for institutions). Understanding this decomposition is the key to answering why unconditional convergence emerged.

8.1 Three regressions

Consider any growth correlate — say, democracy (Polity 2 score). Three regressions define the framework:

Regression 1 — Unconditional convergence ($\beta$): Regress growth on income alone.

$$\text{Growth}_i = \alpha + \beta \cdot \log(\text{GDPpc}_i) + \varepsilon_i$$

If $\beta < 0$, poorer countries grow faster (convergence). If $\beta > 0$, richer countries grow faster (divergence).

Regression 2 — Conditional convergence ($\beta^{\ast}$): Regress growth on income and the correlate.

$$\text{Growth}_i = \alpha + \beta^{\ast} \cdot \log(\text{GDPpc}_i) + \lambda \cdot \text{Inst}_i + \varepsilon_i$$

$\beta^{\ast}$ is the convergence coefficient controlling for institutions. The coefficient $\lambda$ captures how much the correlate predicts growth, holding income constant. In the 1990s, $\beta^{\ast}$ was typically negative (conditional convergence) even when $\beta$ was not (no unconditional convergence).

Regression 3 — Correlate-income slope ($\delta$): Regress the correlate on income.

$$\text{Inst}_i = \nu + \delta \cdot \log(\text{GDPpc}_i) + u_i$$

$\delta$ captures how strongly the correlate correlates with income. If $\delta > 0$, richer countries have better institutions — the “modernization hypothesis.”

8.2 The key equation

The OVB formula links these three regressions with an exact algebraic identity:

$$\beta - \beta^{\ast} = \delta \times \lambda$$

In words, this says that the gap between unconditional and conditional convergence equals the product of two things: (1) how much richer countries have better institutions ($\delta$), and (2) how much those institutions predict growth ($\lambda$). This is not an approximation — it is an algebraic identity that holds exactly in any linear regression.

Why this matters. The decomposition tells us there are exactly three ways unconditional convergence can change over time:

Conditional convergence itself changes ($\beta^{\ast}$ shifts) — e.g., technology diffusion accelerates
Correlate-income slopes change ($\delta$ shifts) — e.g., rich and poor countries become equally democratic
Growth regression coefficients change ($\lambda$ shifts) — e.g., democracy stops predicting growth

The paper’s central finding: it is mainly mechanism 3 — $\lambda$ flattened — that explains the emergence of unconditional convergence.

8.3 Worked example: democracy (Polity 2)

Before generalizing, we build intuition with one correlate. Polity 2 measures democracy on a scale from -10 (autocracy) to +10 (full democracy), normalized by its 1985 standard deviation so that coefficients are in comparable units.

* Normalize polity2 by its 1985 SD
gen polity2_norm = polity2 / `sd_polity2'
* --- Period: 1985 ---
* Regression 1 (Unconditional):
reg loggdp_growth_10 loggdp if year == 1985 & polity2_norm != ., robust
* Regression 2 (Conditional):
reg loggdp_growth_10 loggdp polity2_norm if year == 1985, robust
* Regression 3 (Income-Institution slope):
reg polity2_norm loggdp if year == 1985, robust
* Repeat for 2005

---- Period: 1985 ----
Regression 1 (Unconditional): beta = 0.328 (SE = 0.199, N = 124)
Regression 2 (Conditional): beta* = -0.111, lambda = 0.891
Regression 3 (Income-Inst): delta = 0.494
OVB DECOMPOSITION:
beta - beta* = 0.440 (actual gap)
delta x lambda = 0.440 (predicted by OVB formula)
delta = 0.494 (richer countries more democratic?)
lambda = 0.891 (democracy predicts growth?)
---- Period: 2005 ----
Regression 1 (Unconditional): beta = -0.767 (SE = 0.149, N = 147)
Regression 2 (Conditional): beta* = -0.807, lambda = 0.183
Regression 3 (Income-Inst): delta = 0.216
OVB DECOMPOSITION:
beta - beta* = 0.040 (actual gap)
delta x lambda = 0.040 (predicted by OVB formula)
delta = 0.216 (richer countries more democratic?)
lambda = 0.183 (democracy predicts growth?)
COMPARISON ACROSS TIME:
delta (1985) = 0.494 --> delta (2005) = 0.216 [STABLE]
lambda (1985) = 0.891 --> lambda (2005) = 0.183 [SHRANK]
gap (1985) = 0.440 --> gap (2005) = 0.040 [CLOSED]

This single example encapsulates the paper’s entire argument. In 1985, unconditional $\beta$ was +0.33 (divergence), but controlling for democracy revealed conditional convergence at $\beta^{\ast} = -0.11$. The gap of 0.44 is exactly predicted by $\delta \times \lambda = 0.494 \times 0.891 = 0.44$ — the OVB formula holds exactly because it is an algebraic identity. By 2005, $\lambda$ collapsed from 0.89 to 0.18 — democracy went from being a powerful growth predictor (one SD higher Polity 2 associated with 0.89% faster annual growth) to a near-zero predictor. The resulting gap shrank from 0.44 to 0.04 — a 91% reduction. The correlate-income slope $\delta$ also fell (from 0.49 to 0.22), but the primary driver was the collapse in $\lambda$.

Think of it like a recipe that calls for two ingredients. The gap ($\delta \times \lambda$) was large in 1985 because both ingredients were present: richer countries had much better democracy ($\delta$ large) and democracy strongly predicted growth ($\lambda$ large). By 2005, the second ingredient ($\lambda$) had nearly vanished — it no longer mattered for growth predictions whether a country was democratic or not — so the recipe produced almost nothing.

Now we generalize: does this pattern hold across all growth correlates, not just democracy?

9. Are correlate-income slopes stable? (Delta)

The OVB formula has two components: $\delta$ (the correlate-income slope) and $\lambda$ (the growth-correlate slope). We examine each in turn. If $\delta$ — the relationship between income and institutions — has changed dramatically, that could explain the closing gap. But the paper finds that $\delta$ has been remarkably stable.

For each correlate, we compute $\delta$ in 1985 and in 2015, then scatter one against the other. Points on the 45-degree line mean $\delta$ has not changed; points below it mean the relationship weakened.

* For each correlate: regress Inst on loggdp in 1985 and 2015
* All correlates normalized by their 1985 SD
* Panel A: Solow fundamentals + short-run correlates
* Panel B: Long-run correlates + culture
graph combine delta_A delta_B, rows(1) cols(2) ///
graphregion(color(white)) ///
title("Stability of Correlate-Income Slopes", size(medium))
graph export "stata_convergence2_delta_stability.png", replace width(2400)

Delta fitted line slopes (delta_2015 vs delta_1985):
Solow fundamentals: slope = 0.878
Short-Run correlates: slope = 0.886
Long-Run correlates: slope = 1.024
Culture: slope = 0.884

The correlate-income relationships are remarkably stable. Fitted lines cluster tightly around the 45-degree line: Solow fundamentals 0.88, short-run correlates 0.89, long-run correlates 1.02, culture 0.88. This means the cross-country association between income and institutions has barely changed over 30 years. Richer countries still have better democracy, more investment, lower population growth, and stronger financial sectors in essentially the same proportions as in 1985. The “modernization hypothesis” — that economic development goes hand-in-hand with institutional improvement — passes its out-of-sample test.

Crucially, this stability means that the $\delta$ component is not responsible for the closing gap between unconditional and conditional convergence. The answer must lie in the other component: $\lambda$.

10. Growth regressions then vs. now: the lambda flattening

In the 1990s, a massive literature ran growth regressions of the form: Growth = $\alpha + \beta^{\ast} \times$ Income $+ \lambda \times$ Correlate $+ \varepsilon$. These regressions identified which policies and institutions predict growth and formed the empirical backbone of the “Washington Consensus” — the set of policy recommendations that international institutions gave to developing countries. The key question: do these regressions hold up with 25 years of new data?

For each correlate, we estimate $\lambda$ (the growth-correlate slope) in the base year (~1985) and in 2005, using a fixed sample of countries with data in both periods.

* For each correlate, run the growth regression in base year and 2005
* Growth = alpha + beta* x loggdp + lambda x correlate + epsilon
* Fixed country sample per correlate
* Scatter lambda_2005 vs lambda_1985
reg lambda_2005 lambda_1985 if flag_solow == 1
* -> slope = 0.861, R-sq = 0.947
reg lambda_2005 lambda_1985 if flag_solow == 0 & flag_long_run == 0
* -> slope = 0.189, R-sq = 0.063

Lambda fitted line slopes (lambda_2005 vs lambda_1985):
Solow fundamentals: slope = 0.861, R-sq = 0.947
Short-run correlates: slope = 0.189, R-sq = 0.063
Long-Run correlates: slope = 0.296
Culture: slope = 0.685

This is the most striking empirical result of the paper. Solow fundamentals (investment, population growth, education) show high persistence: a fitted slope of 0.86 with R-squared of 0.95, meaning these deep structural variables predict growth almost as well in 2005 as in 1985. In dramatic contrast, short-run correlates (democracy, governance, fiscal policy, financial development) show near-zero persistence: a slope of 0.19 with R-squared of only 0.06. There is essentially no correlation between which policy variables predicted growth in 1985 and which predict growth in 2005.

The Washington Consensus growth regressions — which identified specific policies and institutions as growth drivers — have failed their out-of-sample test. Variables like Polity 2 ($\lambda$ fell from 0.89 to 0.34), FH Political Rights (1.11 to 0.19), and FH Civil Liberties (0.96 to 0.17) went from strong growth predictors to near-zero predictors. Long-run correlates and culture occupy an intermediate position (slopes 0.30 and 0.69 respectively).

Why did this happen? There are at least three possible explanations: (a) as correlates converged (Section 7), the reduced cross-country variation made coefficient estimation noisier; (b) the original regressions may have been overfitted to a specific historical sample; (c) the relationship between institutions and growth may be non-linear — institutions matter most when differences are large, and less when all countries have reasonably good policies. The analysis cannot distinguish between these, but the empirical fact is clear: $\lambda$ collapsed.

Since $\delta$ is stable (Section 9) and $\lambda$ collapsed (this section), their product $\delta \times \lambda$ must have shrunk toward zero. The next section confirms this.

11. The punchline: absolute convergence converges to conditional

11.1 The OVB gap is closing

The product $\delta \times \lambda$ quantifies how much each correlate biases the unconditional convergence coefficient. We scatter $\delta \times \lambda$ in 2005 against its value in 1985 to see whether this “explanatory gap” has closed.

* Scatter delta*lambda in 2005 vs 1985
reg dl_2005 dl_1985 if flag_solow == 0 & flag_long_run == 0
* -> slope = 0.090 (short-run correlates: gap essentially vanished)
reg dl_2005 dl_1985 if flag_solow == 1
* -> slope = 0.740 (Solow fundamentals: gap partially retained)

OVB gap fitted line slopes (dl_2005 vs dl_1985):
Panel A:
Solow fundamentals: slope = 0.740
Short-Run correlates: slope = 0.090
Panel B:
Long-Run correlates: slope = 0.480
Culture: slope = 0.739

The OVB gap for short-run correlates has shrunk to nearly zero (fitted slope 0.09). In 1985, omitting these policy and institutional variables made unconditional convergence look substantially worse than conditional convergence. By 2005, the two are nearly identical. Solow fundamentals retained more of their explanatory power (slope 0.74), reflecting the stability of both their $\delta$ and $\lambda$ components. This confirms the paper’s central thesis: unconditional convergence emerged not because the income-correlate relationship changed ($\delta$ is stable) but because policy variables stopped predicting growth ($\lambda$ flattened).

11.2 The closing gap over time

The definitive test uses multivariate regressions. We fix a sample of 73 countries with complete data on 10 correlates (Polity 2, FH political rights, FH civil liberties, private investment, government spending, inflation, WDI credit, credit by financial sector, Barro-Lee education, and education gender gap). For each year from 1985 to 2007, we estimate both unconditional $\beta$ (income only) and conditional $\beta^{\ast}$ (income plus all 10 correlates).

* Fix sample: 73 countries with complete data on all 10 correlates in 1985
local var_all polity2 FH_political_rights FH_civil_liberties pri_inv ///
gov_spending inflation WDI_credit credit barrolee2060 edugap
forval yr = 1985/2007 {
* Unconditional: reg growth loggdp, robust cluster(country_id)
* Conditional: reg growth loggdp `var_all', robust cluster(country_id)
}
* Plot the closing gap
twoway (line beta_unconditional year, lcolor("20 20 19") lwidth(medthick)) ///
(line beta_conditional year, lcolor("106 155 204") lwidth(medthick)) ///
(line zero year, lcolor("217 119 87") lpattern(dot)), ///
legend(label(1 "Absolute Convergence") label(2 "Conditional Convergence"))
graph export "stata_convergence2_absolute_vs_conditional.png", replace width(2400)

Year | beta_unconditional beta_conditional gap
------+-------------------------------------------
1985 | 0.420 -1.072 1.492
1990 | 0.377 -0.560 0.937
1995 | 0.081 -0.155 0.236
2000 | -0.387 -0.540 0.153
2005 | -0.556 -0.969 0.413
2007 | -0.646 -1.274 0.629

This is the paper’s title finding. In 1985, unconditional $\beta$ was +0.42 (divergence) while conditional $\beta^{\ast}$ was -1.07 (strong convergence when controlling for institutions) — a gap of 1.49. By 2000, unconditional $\beta$ had fallen to -0.39 while conditional $\beta^{\ast}$ was -0.54, narrowing the gap to just 0.15. The gap narrowed dramatically from 1.49 (1985) to 0.15 (2000), then widened somewhat as conditional $\beta^{\ast}$ deepened faster, but both lines are firmly negative by 2000.

The Solow model’s prediction of conditional convergence held all along — what changed is that the real world caught up. As the OVB from excluding correlates shrank toward zero, unconditional convergence “converged to” conditional convergence.

11.3 Multivariate evidence (Table 5)

The multivariate regressions crystallize the structural change by showing how adding correlates affects the convergence coefficient in each period.

 abs_1985 solow_1985 short_1985 full_1985 abs_2005 solow_2005 short_2005 full_2005
loggdp 0.420 -0.447 -0.435 -0.816 -0.556 -1.176 -0.557 -1.040
(0.252) (0.661) (0.457) (0.619) (0.203) (0.309) (0.327) (0.393)
R2 0.028 0.155 0.152 0.228 0.101 0.247 0.258 0.355
N 73 73 73 73 73 73 73 73

In 1985, absolute convergence alone gives $\beta = +0.42$ (divergence, R-squared = 0.03 — essentially no linear relationship). Adding Solow fundamentals flips the sign to $\beta^{\ast} = -0.45$, and the full model gives $\beta^{\ast} = -0.82$. In 2005, the picture changes fundamentally: absolute convergence is already strong at $\beta = -0.56$ (R-squared = 0.10). Adding short-run correlates alone barely changes the coefficient (from -0.56 to -0.56), confirming that policy variables no longer have explanatory power beyond what income already captures. Correlates still improve overall fit (R-squared rises from 0.10 to 0.35), but they no longer alter the convergence coefficient.

12. Robustness: does the averaging period matter?

The main results use 10-year forward-looking growth rates. One concern is that 10-year averaging may smooth out noise in a way that creates artificial trends. We check by re-estimating the rolling beta-convergence trend using 1-year, 2-year, 5-year, and 10-year growth averages.

* For each averaging period t = 1, 2, 5, 10:
gen loggdp_growth_t = 100 * ((F[t].logrgdpna - logrgdpna) / t)
areg loggdp_growth_t c.loggdp#i.year, absorb(year) robust cluster(country_id)

Results:
1-year average: high noise, downward trend visible but obscured by fluctuations
2-year average: moderate noise, downward trend clearer
5-year average: smooth, clear downward trend from ~0 to ~-0.5 by late 2000s
10-year average: smoothest, clearest trend from +0.5 to -0.76 by 2007

The convergence trend is robust across all averaging periods. As expected, shorter periods produce noisier estimates — the 1-year panel is dominated by year-to-year fluctuations — while longer averages yield smoother trends. All four specifications agree that the crossover from divergence to convergence occurs around 1990–2000, confirming that the finding is not an artifact of the 10-year growth rate choice.

13. Discussion

Let us return to the question posed in the Overview: why did unconditional convergence emerge since 2000?

The OVB framework provides a clear and quantitative answer. The gap between unconditional convergence ($\beta$) and conditional convergence ($\beta^{\ast}$) is exactly equal to the product $\delta \times \lambda$. This gap closed because $\lambda$ — the coefficient on growth correlates in growth regressions — collapsed for short-run policy and institutional variables (slope = 0.19, R-squared = 0.06). Meanwhile, $\delta$ — the relationship between income and institutions — remained remarkably stable (slopes around 0.88 on the 45-degree line). In concrete terms: richer countries still have better institutions in the same proportions as 30 years ago, but those institutional advantages no longer translate into faster growth. As a result, unconditional convergence caught up to conditional convergence.

This has important implications for how we think about economic development. The 1990s “Washington Consensus” was built on the empirical finding that good policies and institutions predict faster growth. Our out-of-sample test shows that many of these relationships did not persist into the 2000s — at least not for short-run policy variables. Solow fundamentals (investment, population growth, education) remained robust growth predictors, consistent with the Solow model’s enduring relevance. But governance indices, fiscal indicators, and financial variables that were “significant” in 1990s regressions no longer predict growth. This raises questions about the stability of policy advice based on cross-country growth regressions.

Caveats. Several important limitations apply. First, the analysis is entirely descriptive — cross-country regressions do not establish causal relationships. The flattening of $\lambda$ could reflect genuine changes in causal relationships, convergence in unobserved variables, or reduced cross-country variation making coefficient estimation noisier. Second, the panel is unbalanced (109 countries in 1960 vs. 160 by 1990), and sample composition changes could mechanically affect estimates. Third, some correlates have small samples (fewer than 60 observations), limiting statistical precision. Finally, the 10-year growth variable is forward-looking, so the last usable observation is 2007/2008, missing the Global Financial Crisis, the post-GFC recovery, and COVID-19. Whether convergence persisted through these shocks is an open question.

14. Summary and key takeaways

This tutorial reproduced the key findings of Kremer, Willis, and You (2021), documenting the emergence of unconditional convergence and explaining it through the OVB decomposition framework. The analysis used 160 countries over 58 years with 50+ growth correlates.

The story in four facts

Unconditional convergence emerged around 2000. The $\beta$-convergence coefficient shifted from +0.53 in the 1960s (divergence, p = 0.006) to -0.76 by 2007 (convergence, p < 0.001), with a systematic trend of -0.025 per year.
Growth correlates converged. Inflation ($\beta = -3.07$), investment ($\beta = -2.98$), and democracy ($\beta = -2.03$) all showed strong convergence. Countries with initially worse institutions experienced the largest improvements.
Growth regression coefficients collapsed for policy variables. Solow fundamentals maintained high stability ($\lambda$ slope = 0.86, R-squared = 0.95), but short-run correlates showed near-zero persistence ($\lambda$ slope = 0.19, R-squared = 0.06). The 1990s growth regressions failed their out-of-sample test.
The gap between absolute and conditional convergence closed. The Polity 2 worked example shows the gap fell from 0.44 to 0.04 (a 91% reduction). In the multivariate analysis, the gap narrowed from 1.49 (1985) to 0.15 (2000).

Limitations

Descriptive, not causal: The OVB framework decomposes observed correlations, not causal relationships
Pre-2008 endpoint: The analysis does not cover the Global Financial Crisis or COVID-19
Small samples for some correlates: Culture and tariff variables have fewer than 60 observations
Normalization sensitivity: All correlate coefficients are normalized by their 1985 standard deviation

Next steps

Extend the analysis through the 2010s using updated PWT data to test whether convergence survived the post-GFC period
Explore non-linear specifications to test whether $\lambda$ flattened because of reduced correlate variation
Apply the OVB decomposition to regional subsamples (e.g., does the mechanism differ for Sub-Saharan Africa vs. East Asia?)

15. Exercises

Your own worked example. Choose a different correlate from the dataset (e.g., investment or FH political rights) and replicate the OVB worked example from Section 8.3. Compute $\beta$, $\beta^{\ast}$, $\delta$, $\lambda$, and verify the identity $\beta - \beta^{\ast} = \delta \times \lambda$ for both 1985 and 2005. Did the gap close for your chosen variable? Was the primary driver the change in $\delta$ or $\lambda$?
Balanced panel sensitivity. Re-estimate the rolling beta-convergence trend (Section 4) using only countries that have GDP data from 1960 onward (a balanced panel of approximately 109 countries). Does the convergence trend look different when you exclude countries that enter the sample later? What does this tell you about the role of sample composition changes?
Alternative classification. The paper classifies variables as “Solow fundamentals” or “short-run correlates.” Move education (barrolee2060) from the Solow group to the short-run group and re-estimate the lambda stability scatters (Section 10). Does the Solow fitted line slope change substantially? What does this tell you about the robustness of the paper’s classification scheme?

References

Acknowledgements

AI tools (Claude Code) were used to make the contents of this post more accessible to students. Nevertheless, the content in this post may still have errors. Caution is needed when applying the contents of this post to true research projects.