DiD-101 — Interactive Lab

A pedagogical companion to Introduction to Difference-in-Differences (DiD) in Python ↗ Back to the post

Why Difference-in-Differences?

A school district rolls out an after-school tutoring program in 10 of its 35 high schools. After one year, average GPA in tutored schools jumps from 60.17 to 96.37 — a 36.20-point increase. Case closed? Not quite. Over the same period, GPA also rose in the 25 untreated schools from 71.22 to 82.10. Difference-in-Differences strips away that common trend and reveals the program's true causal effect: an ATT of 25.32 GPA points.

This app lets you turn the dials yourself. In four tabs you will: watch parallel trends collide with a treatment shock; simulate your own 2×2 design and compare the naive before-after estimate to the DiD ATT; explore the post's actual estimates across specifications and SE types; and inspect the event-study coefficients that test the parallel-trends assumption.

Parallel trends — the identifying assumption

DiD works only if treated and control units would have moved on parallel paths in the absence of the treatment. The animation below sweeps a "treatment" knob: when it is zero, both lines move in lockstep. When you flip it on, the treated line jumps. The vertical gap at the end is the ATT — exactly what DiD recovers.

Tab 2

DiD Simulator

Slide the parallel-trends slope, the true ATT, and noise. Watch the naive before-after estimate diverge from the DiD ATT as the secular trend grows. Run 100 simulations to see the bias picture.

Tab 3

Forest Plot

The post's actual estimates, interactively. Toggle between the 2×2 specification comparison and the inference-method (iid/HC1/CRV1/CRV3) comparison. Hover for SEs and CIs.

Tab 4

Event Study

The post's event-study coefficients with 95% CIs across 7 periods relative to treatment. Pre-treatment leads near zero validate parallel trends; post-treatment lags trace the dynamic effect.

Glossary (click a card if a term is unfamiliar)

Difference-in-Differences (DiD)
Compare the change in outcomes for the treated group to the change for an untreated control group. The difference of those two differences is the causal estimate. Nets out time-invariant level differences AND shared time trends.
Parallel trends
In the absence of treatment, treated and control would have followed the same trajectory. Differences in levels are fine; differences in slopes invalidate DiD.
ATT
Average Treatment effect on the Treated, E[Y(1)−Y(0) | D=1]. The mean causal effect for units that actually got treatment. DiD identifies the ATT (not the ATE) under parallel trends.
Counterfactual
The hypothetical outcome the treated would have had without treatment. Never observed. DiD constructs it as: treated pre-level + control's secular change.
Two-Way Fixed Effects (TWFE)
The regression implementation of DiD. Includes a fixed effect per unit and per period. The coefficient on the treatment-period interaction is the DiD estimate.
Naive before-after
The biased estimator: treated group's pre-vs-post change, ignoring the control group. Conflates the treatment effect with secular drift.
Event study
A dynamic DiD specification that estimates a separate treatment effect for each period relative to treatment. Pre-treatment leads test parallel trends; post-treatment lags trace dynamic effects.
SUTVA
Stable Unit Treatment Value Assumption: no spillovers between units, and a single version of the treatment. Required for the potential-outcomes framework to make sense.

DiD Simulator — see why DiD beats the naive comparison

We simulate a 2×2 panel: a treated group and a control group, each observed in a pre and post period. You set the true ATT, the secular trend (the common change that happens to both groups), and noise. The app then estimates the program's effect three ways: naive before-after, manual DiD, and TWFE regression. Watch the naive estimate inflate by exactly the secular trend while the DiD ATT stays unbiased.

The ground-truth effect of the program. DiD should recover this.
Common change in GPA over time (would have happened anyway). The naive estimator wrongly counts this as part of the effect.
Idiosyncratic shock to each school-period observation.
Number of schools in each group. Larger = tighter estimates.

Naive Before-After

Treated group's post-mean minus pre-mean — ignores the control group.

Estimate
SE
Bias (vs true ATT)

DiD (manual 2×2)

Treated change minus control change — strips out the secular trend.

Estimate
SE
Bias (vs true ATT)

What to look for

  • Set the secular trend to 0. The naive estimator and DiD agree — there is no common drift to confuse them.
  • Crank the secular trend up. The naive estimate inflates by exactly the trend amount. The DiD estimate stays anchored at the true ATT.
  • Set the true ATT to 0 with a positive secular trend. Naive says the program "worked" when it did nothing. DiD correctly reports zero. This is the §4–5 message of the post: the comparison group is the rescue.

Bias vs. variance over many simulations

Single runs are noisy. Run the pipeline 100 times with fresh draws (same parameters, different noise) to see whether the naive bias is systematic.

The post's estimates — interactively

These numbers come straight from the post's regressions on the Corral & Yang (2024) tutoring panel. Toggle between the two outcome groups — the 2×2 specification comparison and the SE-type comparison — and toggle individual methods to see how the point estimate and CI move. Hover any point for its SE and 95% CI.

What to look for

  • In "DiD estimate (2×2)", watch the Naive Before-After bar sit ~11 points above the others. That gap (36.20 − 25.32 = 10.88) is exactly the secular trend. DiD is what removes it.
  • Toggle to "TWFE under 4 SE types": all four point estimates collapse to 25.315. Only the CIs differ. Inference choice matters less than research design when the signal is this strong.
  • CRV3 widens the CI most (Bell-McCaffrey small-sample correction). With only 35 clusters, it is the safer default — but the conclusion is unchanged.

Outcome groups

Methods

Why does the Naive estimator inflate so much?

The treated group's GPA rose by 36.20 points over the same period that control schools rose by 10.88 points. The naive before-after attributes all of the 36.20 to the program. DiD attributes only the extra 25.32 — the part that exceeds what would have happened anyway. That is a 43% overstatement avoided by adding a single comparison group.

Connecting back to Tab 2

The naive-vs-DiD gap you just explored on simulated data is exactly the gap that shows up on the real Corral & Yang panel:

  • Treated change: 36.20 GPA points (60.17 → 96.37).
  • Control change (= secular trend): 10.88 GPA points (71.22 → 82.10).
  • DiD ATT: 36.20 − 10.88 = 25.32 GPA points.

The simulator lets you set the trend yourself; the forest plot lets you verify that the post's actual numbers obey the same logic.

Event Study — testing parallel trends and tracing dynamics

The 2×2 design tells you whether the program worked. The event study tells you when the effect kicked in and whether it grew or faded. We estimate a separate treatment effect for each period relative to treatment, with t = −1 as the omitted reference. Pre-treatment coefficients near zero validate parallel trends; post-treatment coefficients trace the dynamic effect.

What to look for

  • Pre-treatment leads (t = −4, −3, −2) hover around zero with CIs that comfortably include zero (p = 0.40, 0.47, 0.17). Treated and control schools were on parallel trajectories before the program.
  • Post-treatment lags (t = 0, 1, 2, 3) jump to ≈25 immediately at t = 0 and stay there. No fade-out, no delayed onset. The program's effect is both immediate and sustained.
  • The omitted reference is t = −1 (normalised to zero by construction). All other coefficients are differences from that baseline.

Why is t = −1 missing?

Event-study designs always omit one period to avoid perfect collinearity with the unit fixed effects. The convention is to drop t = −1 (the period just before treatment) and read every other coefficient as a difference from that baseline. In the chart above, t = −1 is fixed at 0 with no CI; it is the anchor.

Parallel-trends sanity check

Strong evidence against parallel trends would look like steadily rising (or falling) pre-treatment coefficients — a "smoking gun" pretrend. Here, the three pre-treatment leads bounce around zero (0.342, −0.322, 0.593) with no monotone pattern. The magnitudes are also tiny relative to the post-treatment jumps (≈25), so even if a small pretrend existed, it could not explain the 25-point gap.

In real applications, you should also look at the trajectories themselves (not just the regression coefficients): plot treated and control means by period and check that the lines really do run parallel before treatment. The regression is a test; the plot is the sanity check.