Treatment Effects in Stata — Interactive Lab

A pedagogical companion to Treatment Effects in Stata: A Beginner's Tour of Six Estimators ↗ Back to the post

Six routes to the same causal question

Does maternal smoking during pregnancy cause lower birth weight, or do smokers and non-smokers simply differ on other characteristics that happen to predict birth weight? The naive gap is −275 g. Adjust for observable confounders and the gap shrinks to about −230 g. That 45-gram correction is exactly what causal-inference machinery buys us.

This app lets you watch the six estimators in the post in action. Slide a confounder strength, regenerate data, and see how regression adjustment, IPW, doubly robust estimators, and matching all converge on a similar answer when the identification assumptions hold — and how they pull apart when they don't.

The shrinking penalty — L1 vs. L2 (a useful analogy)

Adjusted estimators all shrink the naive estimate toward zero by accounting for confounders. The animation below shows two ways of shrinking a coefficient: L1 (LASSO) kills it abruptly, while L2 (Ridge) only decays it. The same intuition applies to causal adjustment: each method pulls the naive estimate toward the truth — by a different amount, through a different route.

Tab 2

Confounding Lab

See how a back-door confounder distorts a naive comparison, and how propensity-score overlap diagnoses the damage.

Tab 3

Estimator Simulator

Slide confounder strength and watch Naive, RA, IPW, and AIPW converge — or diverge — on the true effect.

Tab 4

Forest Plot

The post's headline figure, interactive. Toggle ATE vs ATT, drop NNM, hover for SEs, CIs, and covariate counts.

Glossary (open a card if a term is unfamiliar)

Potential outcomes Y(0), Y(1)
For every mother, the two parallel-universe birth weights — one if she smoked, one if she didn't. We observe one; the other is counterfactual and must be estimated.
ATE vs ATT
ATE = average effect over everyone. ATT = average effect on the treated. They diverge when smokers' covariate profile differs from the population average.
Conditional independence (CIA)
After conditioning on observed X, treatment D is "as good as random" with respect to the potential outcomes. The strong, untestable identifying assumption.
Overlap (positivity)
For every X-profile, both treated and control units exist. Diagnosed visually with teffects overlap. Without overlap, no comparable counterfactual exists.
Propensity score e(X)
P(D=1|X) — the conditional probability of treatment given covariates. Rosenbaum & Rubin (1983): matching on the scalar e(X) balances every X that entered the model.
Regression Adjustment (RA)
Fit an outcome model for treated and another for control. Predict both potential outcomes for every unit. Average the gap. Models the outcome only.
Inverse-Probability Weighting (IPW)
Reweight by 1/ê(X) for treated, 1/(1−ê(X)) for control. Recovers a pseudo-randomized sample. Models the treatment only.
Doubly robust (IPWRA, AIPW)
Combine outcome and treatment models. Consistent if either is right. AIPW also attains the semiparametric efficiency bound. Belt and suspenders.
Nearest-Neighbor Matching (NNM)
Find each treated unit's statistical twin in covariate space (Mahalanobis distance). No parametric model. Most assumption-light, but trades precision.
Propensity-Score Matching (PSM)
Same matching idea as NNM, but match on the scalar propensity ê(X). One-dimensional matching distance. Still needs the propensity model to be right.

Confounding Lab — see the back-door path

A confounder X drives both the treatment decision D and the outcome Y. In the post, X is maternal age / education / marital status, D is smoking, Y is birth weight. Slide the confounder strength and watch the naive comparison drift further from the true effect. The overlap histogram below shows whether smokers and non-smokers occupy similar regions of X-space — the prerequisite for any adjustment to work.

Larger n shrinks confidence intervals but does not remove bias.
How strongly X drives D. γ = 0 ⇒ random treatment (no confounding). γ large ⇒ severe selection bias.
How much X affects Y (independent of D). The back-door pathway from X to Y.
true ATE (set by us)
−200
grams (the answer we're trying to recover)
naive gap Ȳ₁ − Ȳ₀
what an unadjusted study would report
selection bias
naive − true. The gap adjustment must close.
% smokers in sample
at γ = 0 this hovers near 50%; large γ skews it.

What to look for

  • Set γ = 0. The two propensity-score densities collapse onto each other. Treatment is unconfounded and the naive gap recovers the true effect.
  • Crank γ up. The two densities pull apart. Smokers' propensity-score density shifts right, non-smokers' left. The naive gap grows in magnitude — pure selection bias.
  • Watch the bias counter. At γ = 1.5, δ = 120, the bias is on the order of 50–100 g of fake effect contributed by the confounder. That is the gap RA, IPW, and matching are designed to close.
  • If the two densities barely overlap, no adjustment will save you. That is what the post calls a violation of the overlap assumption.

Estimator Simulator — four routes, one truth

Generate confounded data with a known true ATE (so you can grade each estimator). Compute four methods live: Naive (no adjustment), RA (outcome model), IPW (propensity reweight), and AIPW (doubly robust). When confounding is strong, the three adjusted estimators should cluster near the truth and the naive estimate should drift away.

Capped at 500 so the "Run 100 sims" button finishes in < 300 ms.
Higher γ ⇒ stronger selection bias for naive, more work for adjusted estimators.
How much X affects Y. Drives the bias term Cov(X,D)·δ.
The number every adjusted estimator should recover.

Adjusted Estimators

Each conditions on X. The doubly robust ones (AIPW) are consistent if either model is right.

RA (outcome model)
IPW (propensity)
AIPW (doubly robust)
true τ

Naive Comparison

Difference of group means. Biased whenever X drives both D and Y.

Ȳ smokers
Ȳ non-smokers
naive gap
bias (naive − true)

What to look for

  • Set γ = 0. All four estimates collapse onto each other and onto the true τ. There is nothing to adjust for, so the methods agree by construction.
  • Raise γ. The naive bar pulls away from the truth. RA, IPW, AIPW stay close to τ — that's the post's headline lesson, on simulated data this time.
  • Compare RA and IPW. They model entirely different sides of the data — outcome vs. treatment — and yet land within sampling noise of each other. The agreement is the §11 "convergence at −230 g" story in miniature.
  • AIPW sits between RA and IPW. By construction it borrows strength from both. If one model is misspecified, AIPW is rescued by the other.

Bias vs. variance over many simulations

Single runs are noisy. Run the pipeline 100 times with fresh draws (same parameters, different ε) to see whether the naive bias is systematic — and how tight the adjusted estimators are around the truth.

The post's forest plot — interactively

These numbers come straight from ate_estimates.csv in the post folder — the same numbers used to produce Figure 9 (the post's headline comparison). Toggle ATE vs ATT, drop NNM to focus on the tight cluster, hover any point for SE, CI, and number of covariates used.

What to look for

  • The naive bar is the outlier. At −275 g it sits well below every adjusted estimator. Selection bias of ~45 g is the gap that motivates the whole exercise.
  • Five adjusted estimators cluster between −229 and −240 g. RA, IPW, IPWRA, AIPW, PSM. They use different functional forms and different identification arguments and they still land within ±10 g of each other.
  • NNM is the only outlier among adjusted methods. Its ATE (−210 g) is a touch closer to zero, its CI a touch wider — the non-parametric tradeoff. But its CI overlaps every other estimator's.
  • Toggle to ATT. Four of the five matching/IPW methods give ATTs slightly closer to zero than their ATE. NNM reverses the pattern — its ATT (−238 g) is larger in magnitude than its ATE (−210 g).

Estimand

Methods

Why does the naive estimate overshoot?

Smokers in the Cattaneo sample are younger, less educated, less likely to be married, and less likely to have first-trimester prenatal care than non-smokers. Each of those covariates independently predicts lower birth weight. The naive gap therefore mixes the genuine effect of smoking with the contribution of every X-difference between groups. The six adjusted methods all block the back-door path X → D in some way — and they all land in the −230 g neighborhood, ±10 g.

Why does NNM disagree slightly?

NNM matches each smoker to her closest non-smoker by Mahalanobis distance over the full covariate vector. It is the only method that fits no parametric model — neither an outcome equation nor a propensity score. That non-parametric freedom buys it robustness to functional-form errors but costs it precision: a wider CI and a point estimate (−210 g) that sits a few standard errors away from the parametric cluster. The CI overlaps every other estimator's, so the disagreement is well within sampling noise. NNM's reversed ATE/ATT pattern (ATT > ATE in magnitude) is a real feature: matching around the treated mothers weights the data toward the covariate region where smoking does more damage.

Connecting back to Tab 3

In Tab 3 you watched RA, IPW, and AIPW converge toward a known true effect on simulated data while the naive estimate drifted away. The forest plot above is the same story, but with real data and the truth unknown. The convergence of five methods near −230 g is our strongest evidence that we have correctly closed the back-door pathway from X. The remaining uncertainty — and what the post calls out in §15 — is everything that isn't in X.