difference-in-differences | Carlos Mendez

Difference-in-Differences with Geocoded Microdata: When Distance Defines Treatment

Mon, 18 May 2026 00:00:00 +0000

1. Overview

What happens to home prices when a registered sex offender moves into a neighborhood — and, just as important, how do we know we measured it right? In a famous 2008 paper, Linden and Rockoff used a clever idea: compare homes very close to the offender’s address with homes a little farther away, before and after arrival. They concluded that prices inside one tenth of a mile dropped by about 7.5 %. But that conclusion rested on a single research design choice — the radius of the “treated” ring — and changing that radius changed the answer.

This tutorial reproduces and extends their analysis using two estimators in increasing order of flexibility. The first is the parametric ring DiD: collapse the data into “inner ring” (treated) and “outer ring” (control), first-difference the outcome, and fit a one-line regression. The second is the nonparametric ring DiD of Butts (2023), which uses the partitioning-based binscatter of Cattaneo, Crump, Farrell, and Feng to estimate a whole treatment-effect curve over distance instead of a single number. We will see that on the Linden-Rockoff data, the parametric ring DiD returns a price drop of −5.78 % at the canonical 0.1-mile cutoff. The nonparametric estimator, by contrast, says homes inside the first 300 feet drop by −20.6 %, and the effect fades to noise beyond ~0.094 mile. Both numbers are correct; they answer slightly different questions.

The post follows the methodology of Butts (2023) and reuses the cleaned Linden-Rockoff data from his replication archive. Where the paper is research-grade and compact, we trade some compactness for pedagogy — the same methods, the same data, but rearranged so a reader who has only seen the textbook 2 × 2 DiD can follow the argument step by step.

Learning objectives. After working through this tutorial you will be able to:

Understand why a point in space can serve as a natural experiment and what the “ring” approach is doing in plain language.
Implement the parametric ring DiD in R as a one-line feols() regression on first-differenced outcomes.
Estimate a treatment-effect curve nonparametrically with binsreg, without committing to a ring cutoff up front.
Assess the fragility of the parametric ring estimator when the inner-ring choice changes, on both simulated and real data.
Compare the parametric headline number with its nonparametric counterpart and articulate why the two can differ by a factor of two.

Key concepts at a glance

The tutorial leans on a small vocabulary repeatedly. The body sections assume you can move between these terms quickly. Each concept below has three parts. The definition is always visible. The example and analogy sit behind clickable cards: open them when you need them, leave them collapsed for a quick scan. If a later section mentions “ring choice” or “local parallel trends” and the term feels slippery, this is the section to re-read.

1. Ring DiD. A difference-in-differences design where the “treated” and “control” groups are defined by distance to a treatment point, not by policy assignment. Treated units sit inside a small radius around the point; control units sit in a donut just outside that radius.

Example

In the Linden-Rockoff data, an “offender” is the point. “Treated” homes are those sold within 0.1 mile of the offender’s eventual address ($\mathcal{D}_t$ in Butts’s notation); “control” homes are those between 0.1 and 0.3 mile ($\mathcal{D}_c$). The analysis sample inside 1/3 mile has 9,092 transactions; 1,093 of them are in the inner ring.

Analogy

A speaker on a stage is loud nearby and inaudible across the building. To measure how much louder the room got, compare the people sitting in the first five rows (“treated”) with the people in rows six through twenty (“control”) just before and just after the speaker started — not with the people in another building entirely.

2. Parametric ring estimator. A one-line regression of the first-differenced outcome on a “treated ring” indicator. Returns a single number: the average treatment effect inside the chosen inner ring, measured against the chosen outer ring as the counterfactual trend.

Example

In R: feols(delta_log_price ~ inside_0_1_mi | srn_year, cluster = "neighborhood"). On the Linden-Rockoff sample with inner ring (0, 0.1] and outer ring (0.1, 0.3], the coefficient is −0.0595 log-points = −5.78 % with cluster-robust SE 0.0225.

Analogy

It is like answering “how much did the average classroom temperature change when we opened a window” with one number for the rows near the window and one for the back of the room. You get a clean summary — but you have already decided where the “near” zone ends.

3. Nonparametric ring estimator (binsreg). Instead of one inner-ring number, the estimator partitions distance into a sequence of data-driven, quantile-spaced bins and reports a separate $\hat{\tau}$ in each bin. The output is a step function over distance.

Example

On the Linden-Rockoff data, binsreg carves the (0, 0.3] mile sample into 23 quantile-spaced bins. Bin 1 (roughly the first 300 feet) returns $\hat{\tau} = -20.6\%$; bin 2 returns $-15.2\%$; bins 3 through 4 are not significantly different from zero.

Analogy

Instead of asking “is it warmer near the window, yes or no?”, you walk a thermometer from window to wall in equal-population steps and write down the reading at each step. You end with a temperature curve rather than a single label.

4. ATT and the ring choice. The parameter estimated by the ring DiD is the average treatment effect among the treated, $E[\tau(d) \mid d \le \bar{d}]$. Crucially, $\bar{d}$ enters this expression. Change the inner-ring cutoff and you have changed the estimand, not just the precision.

Example

On Linden-Rockoff, the parametric ATT goes from −6.40 % at cutoff 0.05 mi, to −5.45 % at 0.10 mi, to −4.21 % at 0.15 mi — a 52 % relative spread driven entirely by the researcher’s choice of $\bar{d}$.

Analogy

“What fraction of voters in the city support a policy?” depends on where you draw the city limits. Move the boundary by a few blocks and you can change the answer. The boundary is not nuisance; it is part of the question.

5. Local parallel trends. The identifying assumption for the ring approach: absent treatment, the average change in outcomes would have been the same in the inner and outer ring. Formally (Butts 2023, Assumption 2), $E[\Delta Y_{i}(0) \mid d \le \bar{d}] = E[\Delta Y_{i}(0) \mid d > \bar{d}]$.

Example

For the Linden-Rockoff design to identify the causal effect of arrival, the neighborhood trend in inner-ring prices — absent the offender — must match the trend in outer-ring prices. The nonparametric estimator’s behavior past 0.1 mile (point estimates oscillating around zero) is the closest informal pre-trend test the cross-sectional data admit.

Analogy

Two students sitting in the same lecture hall normally take notes at similar speeds. If one is suddenly handed a coffee, you can compare their notes — as long as nothing else differentially affected the two seats that day. Local parallel trends is the “nothing else” part.

6. Sample-weighted ATT. When summarizing a step function into a single inner-ring scalar, average $\hat{\tau}(d)$ weighted by the number of observations in each bin, not by the number of bins. Two estimators that look similar on the curve can give noticeably different scalars if one bin is very wide and another is very narrow.

Example

A bin-equal-weight average of the first four nonparametric bins yields −11.4 %. Re-weighting by observations inside 0.1 mile (the sample-weighted ATT used in this post) shifts it to −12.4 %. Same data, different summary, third significant figure moves.

Analogy

If you average the temperatures of three rooms in a building, the answer depends on whether you weight each room equally or weight by how many people are in each room. A packed lecture hall counts more than an empty closet.

Methodological flow

The diagram below is the roadmap for everything that follows. The script (and the body of this post) starts in the safe world of simulation, where we know the right answer, and only then steps onto Linden and Rockoff’s real-world data, where we do not.

flowchart TD
A["Step 1<br/>Toy ring geometry"] --> B["Step 2<br/>2×2 DiD recap"]
B --> C["Step 3<br/>Simulated DGP<br/>true τ-curve known"]
C --> D["Step 4<br/>Parametric ring DiD<br/>one number per ring"]
C --> E["Step 5<br/>Ring-choice fragility<br/>same data, 3 answers"]
C --> F["Step 6<br/>Nonparametric ring DiD<br/>whole TE curve"]
D --> G["Step 7<br/>Linden-Rockoff data<br/>9,092 home sales"]
E --> G
F --> G
G --> H["Steps 8–10<br/>Bandwidth, parametric,<br/>nonparametric on real data"]
H --> I["Result<br/>−5.78% parametric<br/>−20.6% nonparametric (bin 1)"]
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#6a9bcc,stroke:#141413,color:#fff
style C fill:#d97757,stroke:#141413,color:#fff
style D fill:#6a9bcc,stroke:#141413,color:#fff
style E fill:#d97757,stroke:#141413,color:#fff
style F fill:#00d4c8,stroke:#141413,color:#141413
style G fill:#d97757,stroke:#141413,color:#fff
style H fill:#6a9bcc,stroke:#141413,color:#fff
style I fill:#00d4c8,stroke:#141413,color:#141413

The first two steps build the spatial intuition and recall the textbook 2 × 2 DiD so we can re-cast the ring DiD as the same machinery with distance-defined groups. Steps 3–6 use a simulated data-generating process (DGP) where we know the true treatment-effect curve, so the estimators can be judged against ground truth. Steps 7–10 carry the same estimators onto the Linden-Rockoff data and reconcile what the two estimators say about a real neighborhood.

2. Setup and packages

The script uses pacman::p_load() so that any missing package is installed from CRAN on first run. We set a single global seed at the top, so every simulated number in the post is reproducible.

set.seed(42)
if (!require("pacman")) {
install.packages("pacman", repos = "https://cloud.r-project.org")
}
pacman::p_load(
tidyverse, fixest, haven, data.table,
binsreg, KernSmooth, lpridge,
ggplot2, patchwork, sf, glue, scales, broom
)

The two workhorse packages are fixest for fast fixed-effects regressions (the feols() function) and binsreg for the data-driven binscatter that powers the nonparametric estimator.

The data live in Butts’s replication archive. The script reads them from GitHub raw, with a local-file fallback so the code runs even before this post is pushed:

data_url <- paste0(
"https://raw.githubusercontent.com/cmg777/",
"starter-academic-v501/master/content/post/",
"r_did_ring/linden_rockoff.dta"
)
linden_rockoff <- tryCatch(
haven::read_dta(data_url),
error = function(e) haven::read_dta("linden_rockoff.dta")
)

This pattern — try GitHub, fall back to local — means the same script runs in three places without edits: on a fresh clone, in a Quarto notebook, or in a Google Colab session.

3. Step 1 — Picturing the design: who is treated, who is control, who is irrelevant

Before any regression, it helps to see the design on paper. We scatter 2,000 random “homes” inside a 1.5 × 1.5 unit square, drop a treatment point at the center, and color homes by their ring membership: inside the treated disk of radius 0.2, inside the control donut from 0.2 to 0.5, or too far away to enter the comparison.

n_points <- 2000
points <- tibble(
x = runif(n_points, -0.75, 0.75),
y = runif(n_points, -0.75, 0.75)
) |>
mutate(
dist = sqrt(x^2 + y^2),
group = case_when(
dist <= 0.2 ~ "Treated (inner ring)",
dist <= 0.5 ~ "Control (outer ring)",
TRUE ~ "Not used"
)
)

[Section 1] Toy spatial layout
Total points: 2000
Control (outer ring) Not used Treated (inner ring)
566 1308 126

Toy ring geometry: 126 treated, 566 control, 1,308 dropped out of 2,000 random points.

Out of 2,000 random homes, only 126 (6.3 %) fall inside the treated ring and 566 (28.3 %) fall inside the outer control ring; the remaining 1,308 (65.4 %) are too far away to enter the analysis. This 6 / 28 / 65 split is the price of the ring approach: identification rests on a small treated group, a moderate control group, and a large number of “irrelevant” observations whose only role here is to remind us that distance, not policy assignment, defines who is in and who is out. With smaller samples this can hurt; with the Linden-Rockoff data set (170,239 home sales, of which 9,092 are within 1/3 mile of some offender), the inner ring still has hundreds of transactions and the design is feasible.

4. Step 2 — A quick refresher: the 2 × 2 DiD in 4 cells

Every ring DiD is built on the same 2 × 2 difference-in-differences logic you have probably seen for a textbook policy reform. The estimand is the average treatment effect among the treated:

$$\tau = E[\Delta Y \mid \text{treated}] - E[\Delta Y \mid \text{control}].$$

In words, this says: the average change in outcome for the treated group, minus the average change in outcome for the control group — a difference of differences. Mapped to code, $\Delta Y$ is delta_y (the first-differenced outcome) and “treated” is a 0/1 indicator. There are two algebraically equivalent ways to estimate $\tau$:

$$Y_{it} = \alpha_i + \gamma_t + \tau \cdot D_i \cdot P_t + \varepsilon_{it}.$$

This two-way fixed-effects (TWFE) form says: each unit $i$ has its own price level $\alpha_i$, each period $t$ has its own trend $\gamma_t$, and $\tau$ captures the extra movement experienced by treated units in the post period. The TWFE coefficient on the interaction $D_i \cdot P_t$ is the same number you would get by regressing $\Delta Y$ on $D$ alone on a first-differenced panel. Section 2 of the script verifies this on a 500-unit panel with a true effect of 0.30:

panel <- tibble(
i = rep(1:500, each = 2),
t = rep(c(0, 1), 500),
treat = rep(rbinom(500, 1, 0.5), each = 2),
y = rnorm(1000) + 0.3 * (treat * (t == 1))
)
fd <- feols(I(y[t==1] - y[t==0]) ~ treat, data = panel |> distinct(i, treat))
twfe <- feols(y ~ I(treat * (t == 1)) | i + t, data = panel)

[Section 2] Classical 2x2 DiD (true effect = 0.3)
(a) first-differences coefficient: 0.31 (SE 0.026)
(b) two-way FE coefficient : 0.31 (SE 0.026)

Estimator	Estimate	SE	True effect
First-differences (`feols(delta_y ~ treat)`)	0.3097	0.0258	0.30
Two-way FE (`feols(y ~ treat:post	i + t)`)	0.3097	0.0258

The two estimators return numerically identical point estimates (0.3097 to four decimals) and SEs (0.0258), both within one SE of the true 0.30. The equivalence is algebraic, not approximate, and it is the reason the ring DiD can be written as a one-line regression on first-differenced outcomes (next section). Everything that follows is “2 × 2 DiD, but the groups are defined by distance instead of by policy assignment.”

5. Step 3 — A simulated world where we know the right answer

To judge the estimators fairly, we first build a world where the truth is known. We draw 10,000 units, give each a distance $d$ uniform on $[0, 1.5]$ miles, and define the true treatment-effect curve as a smooth exponential that vanishes exactly at 0.75 mile:

$$\tau(d) = 1.5 \cdot \exp(-2.3 \cdot d) \cdot \mathbf{1}{d \le 0.75}.$$

In words, this says: the treatment effect is largest right at the offender ($\approx +1.5$ at $d = 0$), decays smoothly with distance, and is exactly zero beyond 0.75 mile. The number 0.75 is what Butts calls $d_t$ — the maximum distance at which treatment effects are felt. The average true effect across the affected region $[0, 0.75]$ is the integral of $\tau(d)$ divided by 0.75, which evaluates to 0.726. That number is the benchmark every estimator below has to recover.

True treatment-effect curve $\tau(d) = 1.5 \cdot \exp(-2.3 \cdot d)$, zero past 0.75 mile; mean over the affected region equals 0.726.

[Section 3] Simulated DGP for the parametric ring estimator
n units: 10000
Average true TE among d <= 0.75 mi: 0.726

The orange curve in the figure is $\tau(d)$, and the grey baseline is the counterfactual trend (zero everywhere in this simulation). Pedagogically, this is the cleanest case: the treatment effect is monotonically decreasing in distance, strictly positive out to $d_t = 0.75$, and exactly zero beyond. A real-world spatial treatment will rarely have such a clean shape, but the point is to ask: do our estimators recover this benchmark when the answer is known?

6. Step 4 — The parametric ring estimator on simulated data

The parametric ring DiD is a one-line feols() call on first-differenced outcomes (or, equivalently, the TWFE form). Given a correct inner-ring choice — inner $= (0, 0.75]$, outer $= (0.75, 1.5]$ — the estimator should average the true $\tau(d)$ across the inner ring and return 0.726.

The body of parametric_ring_panel() (and its Linden-Rockoff sibling parametric_ring_lr(), plus the nonparametric helper nonparametric_ring_cs() used later) lives in analysis.R; each is a thin wrapper around a single feols() or binsreg::binsreg() call. The snippets below show the call signature, not the helper body.

ring_dgp <- ring_data |>
mutate(treat_ring = as.integer(dist <= 0.75)) |>
feols(delta_y ~ treat_ring, cluster = "neighborhood", data = _)

 Parametric ring DiD (rings = 0, 0.75, 1.5):
tau_hat = 0.726 SE = 0.005 truth = 0.726

Parametric ring DiD at the correct cutoff recovers the truth: $\hat{\tau} = 0.726$, 95 % CI $[0.716, 0.736]$.

Bin	Distance interval (mi)	τ̂	SE	95% CI
1	(0, 0.75]	0.726	0.005	[0.716, 0.736]
2	(0.75, 1.5]	0.000	0.000	[0.000, 0.000]

Given the correct ring choice, the parametric estimator recovers the true average treatment effect to three decimal places: $\hat{\tau} = 0.726$, $\mathrm{SE} = 0.005$, with a 95 % CI of $[0.716, 0.736]$ centered exactly on the truth. The outer-ring coefficient is normalized to zero by construction, because the outer ring is what the estimator defines as the counterfactual trend. This is the strongest possible internal validity check: when the inner ring is set to the exact distance at which treatment effects vanish, the parametric ring DiD is unbiased. The catch is that we know 0.75 only because we wrote the DGP ourselves. In a real application, $d_t$ is the very thing we are trying to learn.

7. Step 5 — Why ring choice is part of the question

Hold the data, the seed, and the regression fixed, and re-run the same parametric estimator with three different inner-ring cutoffs: $\bar{d} = 0.30$ (too narrow), $\bar{d} = 0.75$ (correct), and $\bar{d} = 1.20$ (too wide).

choices <- tibble(
cut_inner = c(0.30, 0.75, 1.20),
label = c("Too narrow", "Correct", "Too wide")
)
ringchoice <- choices |>
rowwise() |>
mutate(fit = list(parametric_ring_panel(ring_data, cut_inner)))

[Section 4] Ring-choice sensitivity on simulated data
# A tibble: 3 × 5
choice tau_hat se ci_lower ci_upper
1 Correct: (0, 0.75] 0.726 0.00512 0.716 0.736
2 Too narrow: (0, 0.30] 0.913 0.00598 0.902 0.925
3 Too wide: (0, 1.20] 0.456 0.0102 0.436 0.476

Same data, three ring choices: 0.913 (too narrow), 0.726 (correct), 0.456 (too wide). All three 95 % CIs exclude the truth in the bad cases.

Choice	τ̂	SE	95% CI	Direction of bias
Correct: (0, 0.75]	0.726	0.005	[0.716, 0.736]	none — recovers the truth
Too narrow: (0, 0.30]	0.913	0.006	[0.902, 0.925]	upward: averages the steepest part of $\tau(d)$
Too wide: (0, 1.20]	0.456	0.010	[0.436, 0.476]	toward zero: absorbs unaffected units

Same data, three answers. With a too-narrow inner ring the estimator returns 0.913 — a +25.7 % upward bias, because we are averaging only the steepest part of the $\tau(d)$ curve and missing the slower decay. With a too-wide inner ring the estimator returns 0.456 — a −37.1 % attenuation, because we are absorbing many units with literally zero treatment effect into the “treated” group and diluting the average. Neither number is sampling noise: both 95 % CIs strictly exclude the truth (0.726). The lesson the simulated experiment teaches before we even touch Linden-Rockoff is that ring choice is part of the estimand, not just a precision lever. Pick a different ring, and the parametric estimator literally answers a different causal question. This is why we need a second estimator.

8. Step 6 — Letting the data choose: the nonparametric estimator

Where the parametric estimator gives one number, Butts’s nonparametric estimator gives a whole step function. The idea, formalized in Cattaneo, Crump, Farrell, and Feng (2024), is to partition the support of distance into $L$ quantile-spaced bins, fit a flat constant inside each bin, and difference each bin’s average from the average of the last (presumed-untreated) bin. The number of bins $L$ is chosen by the data via a mean-squared-error criterion in binsreg.

np_sim <- binsreg::binsreg(
y = ring_data$delta_y,
x = ring_data$dist,
randcut = NULL,
cb = c(3, 3),
noplot = TRUE
)

The nonparametric estimator recovers the whole TE curve from data alone — 53 quantile-spaced bins, no cutoff committed up front; left-most bin $\hat{\tau} = 1.461$ vs truth 1.5.

[Section 5] Nonparametric ring estimator on simulated DGP
Number of distance bins: 53
TE estimate in left-most bin: 1.461

On the simulated DGP with $n = 10{,}000$ units, binsreg chooses 53 quantile-spaced bins. The left-most bin (about $[0, 0.025]$ mi) returns $\hat{\tau} = 1.461$ — within one SE of the truth at $d = 0$, which is 1.5. Successive bins step monotonically downward as we move outward, eventually crossing zero around 0.75 mile where the true $\tau(d)$ vanishes. We never had to commit to a ring cutoff up front; the data revealed the shape of the curve. The price is that we now have 53 noisy bin estimates instead of one tidy headline, and CIs widen as the bins get narrower in the tails. But the methodological payoff is exactly the rebuttal to Step 5: when the data are rich enough, the answer to “which ring should I pick?” is “you don’t have to.”

9. Step 7 — Linden and Rockoff: a real neighborhood, a real arrival

We now leave the safe world of simulation and walk the same estimators onto Linden and Rockoff’s data: 170,239 home transactions in North Carolina, geocoded relative to the eventual addresses of registered sex offenders. The analysis sample is the 9,092 sales within 1/3 mile of an offender’s address. Each transaction records the log sale price, the distance to the offender, and whether the sale closed before or after the offender’s arrival.

linden_rockoff <- haven::read_dta("linden_rockoff.dta") |>
filter(offender == 1) |>
mutate(
dist_mi = dist / 5280, # original distance in feet
inner = as.integer(dist_mi <= 0.1),
post = as.integer(t_to_arrival > 0)
)

[Section 6.2] Linden-Rockoff data
Rows: 170239 Cols: 51
Analysis sample (offender == 1): 9092
Mean log price: 11.73
Distance summary (miles): min 0.009 median 0.224 max 0.333

Ring	Pre-arrival	Post-arrival	Total
Inner (≤ 0.1 mi)	499	594	1,093
Outer (0.1 – 0.3 mi)	3,998	4,001	7,999
Total	4,497	4,595	9,092

The 2 × 2 cell counts above are the entire foundation of the analysis. Only 1,093 sales (12 %) fall in the inner treated ring at or under 0.1 mile, split nearly evenly between pre- and post-arrival (499 vs 594). The outer control ring carries 7,999 sales (88 %), also nearly balanced across the cutoff date. Median distance is 0.224 mile and the support runs from 0.009 mile (essentially adjacent to the offender’s address) to 0.333 mile (the outer boundary). The treated cells are small but not tiny; this is what makes the nonparametric estimator viable even on a single neighborhood’s worth of data.

Linden-Rockoff raw price gradient: a \$20–25K gap inside 0.1 mile, closing monotonically with distance.

Before any estimator runs, the raw price gradient already tells the story. Inside 0.1 mile of the offender’s eventual address, the pre-arrival kernel-smoothed average home price stays near \$145–\$150K out to the treated-ring boundary. The post-arrival smoother dips to roughly \$122K at $d \approx 0.01$ mi and climbs back to about \$140K by 0.1 mile, a visible gap of \$20–25K at the offender’s address that closes monotonically with distance. Outside 0.1 mile the two curves overlap. The descriptive plot is the visual argument that motivates the entire ring DiD design: the pre curve is what inner-ring sales “would have looked like” absent the offender; the post curve is what they actually look like; the area between them inside 0.1 mile is the treatment effect. The plot also justifies the choice of ~0.1 mile as the conventional treated radius — it is the eyeball point where the two curves reconverge.

10. Step 8 — Bandwidth fragility: why eyeballing the cutoff is risky

The raw-gradient plot above used one specific bandwidth choice (0.075 mile). What happens if we move it?

The snippet below is illustrative — dist, price_pre, price_post, and grid are placeholder names for the distance vector, the pre- and post-arrival prices, and the evaluation grid; analysis.R defines them concretely.

bws <- c(0.025, 0.075, 0.125)
smooth_panels <- bws |>
map_dfr(function(b) {
pre <- lpridge::lpepa(dist, price_pre, bw = b, x.out = grid)
post <- lpridge::lpepa(dist, price_post, bw = b, x.out = grid)
tibble(dist = grid, pre = pre$y, post = post$y, bw = b)
})

Same data, three smoothing bandwidths — implied treated radius shifts from ~0.10 mi (bw 0.025) to ~0.20 mi (bw 0.125).

At bandwidth 0.025 mi (very local), the post curve dips sharply below the pre curve only inside about 0.10 mile and recovers fast — you might read off a treated radius of 0.10 by eye. At bandwidth 0.075 mi (the default used above), the gap extends out to about 0.15 mile before closing. At bandwidth 0.125 mi (heavy smoothing), the curves diverge gently across the entire panel out to 0.30 mile, suggesting a treated radius of about 0.20 mile. Same data, three smoothers, three different visual answers about how far the treatment effect extends. This is the bandwidth-version of the ring-choice fragility lesson from Step 5, now staring at us in real-world data. The figure is the empirical case for not picking a ring cutoff by inspection of a smoothed gradient — and the motivation for the more principled methods that follow.

11. Step 9 — Parametric ring DiD on Linden-Rockoff (and the ring-choice wobble)

We now run the parametric estimator on the real data at the canonical inner-ring cutoff of 0.1 mile.

lr_default <- feols(
delta_log_price ~ close_post_move | srn_year,
cluster = "neighborhood",
data = linden_rockoff
)

[Section 6.5] Parametric ring DiD on Linden-Rockoff
close_post_move coefficient: -0.0595 SE = 0.0225
Interpreted as a percent change: -5.78%

Parametric ring DiD on Linden-Rockoff at the canonical 0.1 mi: ATT = −5.78 %, 95 % CI $[-10.4\%,\, -1.5\%]$, n = 9,029.

Inner ring	Outer ring	ATT (log)	ATT (%)	SE	95% CI	N
(0, 0.1]	(0.1, 0.3]	−0.0595	−5.78 %	0.0225	[−10.4 %, −1.5 %]	9,029

At the canonical 0.1-mile inner ring (matching Linden and Rockoff’s original choice and Butts’s replication setup), the parametric ring DiD delivers a −0.0595 log-point coefficient on close_post_move, with cluster-robust SE 0.0225 (clustered at the neighborhood level) and a 95 % CI of $[-10.4\%,\, -1.5\%]$ that strictly excludes zero. Here “cluster-robust” means the standard-error formula allows residuals to be correlated within neighborhoods rather than assuming every transaction is statistically independent; cluster-robust SEs are usually a little larger than the default feols() SEs and are the right choice when nearby homes plausibly share unobserved local shocks. In percent terms, this is an average price drop of −5.78 % for homes inside 0.1 mile of an offender’s address after the offender arrives. Butts (2023, p. 5) reports this magnitude as “homes between 0 and 0.1 miles decline in value by about 7.5%"; our −5.78 % sits about 1.7 percentage points below his approximate number, comfortably within the cluster-robust CI and well inside the spread we will see across reasonable ring choices in the next paragraph. The qualitative answer agrees with the published paper; the headline magnitude is within rounding of it.

Now we redraw the inner-ring cutoff at 0.05, 0.10, and 0.15 mile, holding the outer ring fixed at 0.3 mile, to test how much that headline depends on the cutoff choice.

ringchoice_lr <- tibble(cut_inner = c(0.05, 0.10, 0.15)) |>
rowwise() |>
mutate(fit = list(parametric_ring_lr(linden_rockoff, cut_inner)))

[Section 6.6] Ring-choice sensitivity (Linden-Rockoff)
cut_inner att_log att_pct se ci_lower ci_upper n
1 0.05 -0.0661 -6.40 0.0383 -0.141 0.00888 7534
2 0.1 -0.0560 -5.45 0.0239 -0.103 -0.00919 7534
3 0.15 -0.0431 -4.21 0.0180 -0.0784 -0.00768 7534

Three inner-ring cutoffs on the same data: ATT moves from −6.40 % (0.05 mi) to −4.21 % (0.15 mi) — a 52 % relative spread driven entirely by the cutoff choice.

Inner-ring cutoff	ATT (log)	ATT (%)	SE	95% CI	N
0.05 mi	−0.0661	−6.40 %	0.0383	[−14.1 %, +0.9 %]	7,534
0.10 mi	−0.0560	−5.45 %	0.0239	[−10.3 %, −0.9 %]	7,534
0.15 mi	−0.0431	−4.21 %	0.0180	[−7.8 %, −0.8 %]	7,534

The headline number wobbles from −4.21 % (cutoff 0.15) to −6.40 % (cutoff 0.05) — a relative spread of about 52 % of the central estimate. The sign is stable across choices, and every estimate is statistically distinguishable from zero (or borderline so) at conventional levels. But the magnitude moves enough that a reader who only ever sees one of these three numbers gets a noticeably different impression of the policy-relevant effect. This is the same fragility lesson the simulated DGP taught us in Step 5, now reproduced on real data. As Butts (2023, p. 5) puts it: “the choice of 0.1 miles is an untestable assumption." The parametric ring DiD is a perfectly fine estimator — conditional on a researcher choice that has no obvious right answer.

12. Step 10 — The nonparametric estimator on Linden-Rockoff

The nonparametric ring DiD frees us from the cutoff. We hand binsreg the first-differenced log-price outcome and distance to the offender, and let the algorithm decide how to partition the (0, 0.3]-mile support.

np_lr <- nonparametric_ring_cs(
data = linden_rockoff,
outcome = "delta_log_price",
dist = "dist_mi",
cb = c(3, 3)
)

[Section 6.7] Nonparametric ring on Linden-Rockoff
Number of distance bins: 23
Estimated TE averaged inside d <= 0.1 mi: -0.132 (-12.4%)

Nonparametric ring DiD on Linden-Rockoff: 23 bins, two closest bins at −20.6 % and −15.2 %; curve crosses zero at $d \approx 0.094$ mi.

Bin	Distance interval (mi)	τ̂ (log)	τ̂ (%)	SE	95% CI (log)
1	[0.011, 0.053]	−0.231	−20.6 %	0.056	[−0.340, −0.121]
2	[0.054, 0.076]	−0.165	−15.2 %	0.045	[−0.254, −0.077]
3	[0.077, 0.094]	−0.030	−2.9 %	0.048	[−0.124, +0.064]
4	[0.095, 0.110]	+0.006	+0.6 %	0.047	[−0.087, +0.099]
5	[0.111, 0.127]	−0.013	−1.3 %	0.048	[−0.108, +0.081]
6	[0.127, 0.140]	−0.100	−9.5 %	0.048	[−0.194, −0.006]
…	… (23 bins total)

binsreg partitions the Linden-Rockoff inner sample into 23 quantile-spaced bins. The two closest bins — homes within roughly the first 300 feet of the offender’s address — show steep price declines: bin 1 at −20.6 % with 95 % CI $[-34.0\%,\, -12.1\%]$, and bin 2 at −15.2 % with CI $[-25.4\%,\, -7.7\%]$. By bin 3 (about 0.08 mile) the point estimate has collapsed to −2.9 % with a CI that includes zero, and bin 4 (about 0.10 mile) is essentially zero (+0.6 %). Butts (2023, p. 6) describes this exact pattern: “homes in the two closest rings i.e. within a few hundred feet, are most affected by sex-offender arrival with an estimated decline of home value of around 20%." Our bin-1 estimate of −20.6 % lands on his “around 20 %” claim almost exactly.

Averaged across observations inside 0.1 mile (sample-weighted, so that bins with more transactions count more), the nonparametric ATT is −0.132 log-points = −12.4 % — about 2.1× the parametric estimate of −5.78 % at the same boundary. The reconciliation is not mysterious. The parametric estimator forces a single coefficient across the entire (0, 0.1] inner ring. That single coefficient averages over a very strong effect right at the offender’s address (bin 1 at −20.6 %) and a near-zero effect at the ring’s outer edge (bin 4 at +0.6 %). When we let the curve flex, we recover the concentration of the effect in the closest few hundred feet that the parametric average hides. The two estimators are not in disagreement; they answer slightly different questions, and the gap between them is itself informative.

A final detail worth noticing: the nonparametric curve crosses zero between bins 3 and 4, at about $d \approx 0.094$ mi — strikingly close to the 0.1-mile cutoff that Linden and Rockoff chose by eyeballing the smoothed gradient. The data-driven estimator validates their cutoff as an output of the analysis, not as an input to it. Butts (2023, p. 6) makes the same point: “After 0.1 miles, the estimated treatment effect curve becomes centered at zero consistently."

13. Discussion

So: what happens to home prices when a registered sex offender moves into a neighborhood, and how do we know we measured it right? The substantive answer, on Linden and Rockoff’s North Carolina data, is that homes within a few hundred feet of the offender’s eventual address drop by about 20 % after arrival, and the effect fades to noise beyond roughly 0.1 mile. A reader who is told only the parametric ring DiD — “prices inside 0.1 mile drop by about 6 %” — gets a correct but attenuated picture, because the parametric estimator averages a steep close-in effect with a near-zero outer-ring effect. A reader who is told only the leftmost nonparametric bin — “prices inside 300 feet drop by 20 %” — gets a correct but localized picture that does not describe the average inner-ring home. Both numbers belong in the conversation, and both come out of the same data.

The methodological lesson is that the parametric ring estimator’s headline number is conditional on the ring choice. On the real data, that choice can move the magnitude from −4.2 % to −6.4 % — a 52 % relative spread driven entirely by the researcher’s pick of $\bar{d}$. The nonparametric estimator avoids the choice by letting binsreg partition the data, and it has the further advantage of revealing the shape of the treatment-effect curve — not just its average. In the Linden-Rockoff case, that shape is exactly what one would expect from a hyper-local externality: very strong at zero distance, fading quickly, indistinguishable from zero past about 0.1 mile. This pattern is the empirical case in favor of the data-driven approach, and it is why a reader who has only ever seen the parametric ring DiD should add the nonparametric tool to their kit.

Two identification caveats are worth flagging before any of this is taken too literally. First, the design rests on local parallel trends: absent the offender, the average price change inside 0.1 mile would have matched the average price change in the 0.1–0.3 mile band. There is no formal pre-trends test in this cross-sectional setting, but the nonparametric estimator’s behavior past 0.1 mile (point estimates oscillating around zero, with CIs that include zero) is suggestive evidence that the assumption is not wildly violated. Second, the design implicitly assumes no anticipation: home buyers do not price the offender’s arrival into transactions before the arrival becomes public. With a cross-section, this assumption is also untestable, and any anticipation effects would attenuate the post-arrival drop. Both caveats are present in Butts (2023) and in Linden and Rockoff (2008); the estimators here cannot resolve them.

14. Summary and takeaways

1. Headline number depends on the estimator, not just the data. On the same 9,092 sales, the parametric ring DiD at 0.1 mile returns −5.78 %; the leftmost nonparametric bin returns −20.6 %; the sample-weighted nonparametric ATT inside 0.1 mile is −12.4 %. All three describe the same dataset; they answer slightly different questions about “the effect of an offender arriving.”

2. Ring choice is part of the estimand. Moving the inner-ring cutoff from 0.05 to 0.15 mile changes the parametric ATT from −6.40 % to −4.21 % — a 52 % relative spread that has nothing to do with statistical noise. A parametric ring DiD without a sensitivity analysis is reporting one corner of an answer surface and calling it the answer.

3. The data-driven approach validates and refines the classical setup. The nonparametric estimator does not contradict Linden and Rockoff’s 0.1-mile cutoff — it corroborates it, because the treatment-effect curve crosses zero at about $d \approx 0.094$ mile. The data-driven approach disciplines the cutoff instead of guessing it, and in this case it endorses the original authors' eyeballed choice.

4. The simulation should always come first. Steps 3–6 used a known DGP to confirm that the parametric ring estimator is unbiased when the cutoff is right and biased otherwise, and that the nonparametric ring estimator recovers the shape of the true τ-curve. Without the simulation, the −20.6 % bin-1 estimate on the real data would look implausible. With the simulation, we understand why the parametric ring estimator must be attenuated whenever the true effect is concentrated near the treatment point.

15. Exercises

Sensitivity to the outer ring. Re-run the parametric ring DiD on Linden-Rockoff with the outer ring fixed at 0.25 mile and 0.40 mile (instead of 0.30), keeping the inner ring at 0.10 mile. How much does the headline ATT move? Does the sign survive?
Placebo offender. Pick a random non-offender address in the data and treat it as if an offender had arrived at that location. Run the parametric ring DiD as usual. The placebo coefficient should be near zero and statistically indistinguishable from zero. What does it tell you when it is not?
Bin-equal vs sample-weighted ATT. Compute the inner-0.1-mile nonparametric ATT two ways: (i) as a simple mean of $\hat{\tau}_j$ over bins inside 0.1 mile (bin-equal weight), and (ii) as the sample-weighted average used in this post. Which weighting is more defensible if you want to communicate the “average effect on the average treated home” rather than the “average effect on the average bin”?

References

Linden, Leigh, and Jonah E. Rockoff (2008). Estimates of the Impact of Crime Risk on Property Values from Megan’s Laws. American Economic Review 98(3), 1103–1127.
Butts, Kyle (2023). JUE Insight: Difference-in-Differences with Geocoded Microdata. Journal of Urban Economics 133, 103493.
Cattaneo, Matias D., Richard K. Crump, Max H. Farrell, and Yingjie Feng (2024). On Binscatter. American Economic Review 114(5), 1488–1514.
Bergé, Laurent (2018). Efficient estimation of maximum likelihood models with multiple fixed-effects: the R package FENmlm. (fixest package documentation.)
Cattaneo, Matias D., Richard K. Crump, Max H. Farrell, and Yingjie Feng (2024). binsreg: Binscatter Estimation and Inference. CRAN R package.

AI Podcast: Ring DiD with Geocoded Microdata

Click play to load

0:00 0:00

Difference-in-Differences for Regional Data: Did Medicaid Expansion Reduce Mortality?

Sun, 17 May 2026 00:00:00 +0000

1. Overview

Did the Affordable Care Act’s Medicaid expansion reduce adult mortality? Between 2014 and 2019, twenty-nine states (plus DC) opened Medicaid eligibility to low-income adults who had previously been uncovered; the remaining states did not. That staggered roll-out is a natural experiment, and Difference-in-Differences (DiD) is the standard tool for turning it into a causal estimate of how the program affected the death rate of working-age adults. The empirical question matters: roughly twenty million people gained insurance under the expansion, and a reduction of even a few deaths per 100,000 adults would translate into thousands of lives saved each year.

The challenge is that the unit of analysis here is the county, not the individual — and U.S. counties differ in size by three orders of magnitude. Los Angeles County has more adults than Wyoming, Vermont, and Alaska combined. When you compute an average treatment effect, you must decide whether each county should count equally (an unweighted average across counties), or whether each adult should count equally (an average weighted by county population). This is not just a precision choice. Weighting changes the target parameter. The unweighted answer estimates the effect on the typical treated county; the weighted answer estimates the effect on the typical treated adult. When treatment effects vary across counties of different sizes, those two parameters can disagree — sometimes dramatically.

This tutorial is inspired by the empirical example from Baker, Callaway, Cunningham, Goodman-Bacon and Sant’Anna’s (2025) Difference-in-Differences Designs: A Practitioner’s Guide (arXiv:2503.13323). We walk through eight stages of the modern DiD pipeline using R, and at every stage we compute the answer twice, once unweighted and once weighted by county adult population in 2013. The headline finding previews what is coming: in the simplest possible four-cell 2x2 calculation, the unweighted DiD is $+0.12$ deaths per 100,000 (suggesting Medicaid did nothing, or even raised mortality slightly), while the population-weighted DiD is $-2.56$ deaths per 100,000 (suggesting it saved lives). The remainder of the post examines whether that sign reversal survives covariate adjustment, staggered cohorts, and a HonestDiD sensitivity analysis. Spoiler: it largely does — and the punchline is that the two estimands are not in competition. They answer different policy questions.

Learning objectives. After working through this tutorial you will be able to:

Understand the parallel-trends assumption and why it is the only identifying restriction needed for a 2x2 DiD with two cohorts and two periods.
Estimate the 2x2 cell-means DiD, three equivalent TWFE specifications, and the full Callaway-Sant’Anna $\text{ATT}(g, t)$ design in R using fixest and the did package.
Adjust for covariates via outcome regression (OR), inverse propensity weighting (IPW), and the Sant’Anna-Zhao doubly robust DiD (DRDID).
Compare unweighted and population-weighted estimands at every stage, and read the gap between them as a difference in target parameter, not in precision.
Assess robustness to violations of parallel trends using the Rambachan-Roth HonestDiD package, and identify the smallest pre-trend violation that would overturn the conclusion.

Key concepts at a glance

The post leans on a small vocabulary repeatedly. The rest of the tutorial assumes you can move between these terms quickly. Each concept below has three parts. The definition is always visible. The example and analogy sit behind clickable cards: open them when you need them, leave them collapsed for a quick scan. If a later section mentions “parallel trends” or “M-bar” and the term feels slippery, this is the section to re-read.

1. Parallel-trends assumption. Counterfactually, treated and control groups would have moved together. If Medicaid expansion had not happened, the mortality trend in expansion counties would have matched the trend in never-expansion counties.

Example

Between 2013 and 2014, never-expansion counties saw mortality rise by $9.15$ deaths per 100,000 (unweighted) or $6.30$ (weighted). The parallel-trends assumption says expansion counties would have seen the same change had they not expanded. The 2x2 DiD measures the actual deviation from that counterfactual trend: $+0.12$ unweighted, $-2.56$ weighted.

Analogy

Two identical twins grow up in different households. We assume their height curves would have stayed in sync had nothing changed. Then one twin starts a growth-hormone treatment. The height gap that opens up after the treatment, minus any gap that was already there, is the treatment effect. Parallel trends says the gap would have stayed constant absent the intervention.

2. 2x2 DiD $\text{ATT}(2014) = (\bar{Y}_{T, \text{post}} - \bar{Y}_{T, \text{pre}}) - (\bar{Y}_{C, \text{post}} - \bar{Y}_{C, \text{pre}})$. The treated group’s change minus the control group’s change. Two groups, two periods, four means — no regression required.

Example

Treated cell means: $419.23$ (2013) and $428.50$ (2014); control cell means: $474.00$ (2013) and $483.15$ (2014). The treated trend is $+9.27$; the control trend is $+9.15$; the 2x2 DiD is the difference, $+0.12$. (All values are unweighted; population-weighted versions appear in table_2x2_means.csv.)

Analogy

Two restaurants raise prices, but only one adds a delivery service. We compare the change in revenue at the delivery restaurant to the change at the non-delivery restaurant. The price increase affects both equally; the delivery effect is the extra change at the treated restaurant.

3. Estimand: ATT under weighting. The Average Treatment effect on the Treated, evaluated under a specific weighting scheme. Equal weights give the ATT for the typical treated county; population weights give the ATT for the typical treated adult.

Example

In our 2x2, the equal-weight ATT is $+0.12$ deaths per 100,000 (an estimate averaged across the 978 expansion counties as units). The population-weight ATT is $-2.56$ (an estimate averaged across the 84 million adults living in those counties). Both are causal parameters; they just describe different averaging targets.

Analogy

If you survey “the average classroom” you ask each classroom one question. If you survey “the average student” you give each student one vote. A classroom of 30 students moves the second average thirty times as much as the first. Same data, different question.

4. Staggered adoption $G_i \in \{2014, 2015, 2016, 2019, \infty\}$. Different units start treatment in different years. There is no single “post” period for the whole sample; each cohort has its own clock.

Example

In this study, $978$ counties expanded in 2014, $171$ in 2015, $93$ in 2016, and $140$ in 2019. A further $1{,}222$ counties never expanded ($G_i = \infty$). The Callaway-Sant’Anna design estimates a separate $\text{ATT}(g, t)$ for each cohort-year cell, then aggregates them.

Analogy

Four cohorts of swimmers enter a relay race at staggered start times, plus a fifth cohort that never swims. We measure each cohort’s improvement from start to finish separately, then average. We never make a swimmer who is mid-race serve as the “control” for a swimmer who hasn’t started yet — a mistake that two-way fixed effects can quietly make.

5. Doubly-robust DiD (DRDID). A 2x2 estimator that uses both an outcome model (control-group regression) and a propensity-score model (treatment-group balancing weights). It is consistent if either model is correctly specified.

Example

For the 2014 cohort, our population-weighted estimates are: outcome regression (OR) $-3.46$, inverse propensity weighting (IPW) $-3.84$, doubly robust (DRDID) $-3.76$. DRDID sits between OR and IPW because it is essentially a weighted combination; if both were correctly specified it would agree with both.

Analogy

Belt-and-suspenders insurance. The belt (outcome model) holds your pants up if it works; the suspenders (propensity model) hold your pants up if they work. As long as at least one is functional, you stay decent. DRDID is the same idea applied to causal identification.

6. HonestDiD sensitivity with parameter $\bar{M}$. A robustness analysis that asks “how big could a post-period parallel-trends violation be — relative to the largest violation seen in the pre-period — before the conclusion changes?” Smaller $\bar{M}$ is a stricter assumption; larger $\bar{M}$ is more permissive.

Example

At $\bar{M} = 0$ (exact parallel trends), the unweighted dynamic ATT bound is $[+2.01, +14.09]$ — entirely positive — while the weighted bound is $[-6.07, +6.07]$ — straddling zero. By $\bar{M} = 0.25$ both bounds cross zero; by $\bar{M} = 1$ they saturate at the package’s grid limits of $\pm 66.7$.

Analogy

A stress test. We do not believe parallel trends holds exactly. We ask: “If next year’s deviation is at most as large as the biggest deviation we observed in the past, would our answer change?” The smallest violation that flips the conclusion is the breakdown value.

Methodological roadmap

This tutorial walks through eight estimation stages. Each stage estimates the same ATT but under a slightly more general design; the figure below shows how the stages compose into a single pipeline. The thread that runs through all eight stages is the unweighted-vs-weighted contrast: at every step, we compute both versions side-by-side.

graph LR
A[Raw data:<br/>2604 counties<br/>x 11 years] --> B[2x2 cell means<br/>headline sign reversal]
B --> C[2x2 TWFE<br/>three specs, two weights]
C --> D[Covariate balance<br/>+ propensity scores]
D --> E[OR / IPW / DRDID<br/>covariate-adjusted 2x2]
E --> F[2xT event study<br/>2014 cohort dynamics]
F --> G[GxT staggered design<br/>all 4 cohorts pooled]
G --> H[HonestDiD<br/>parallel-trends sensitivity]
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#d97757,stroke:#141413,color:#fff
style D fill:#1a3a8a,stroke:#141413,color:#fff
style E fill:#1a3a8a,stroke:#141413,color:#fff
style F fill:#00d4c8,stroke:#141413,color:#141413
style G fill:#00d4c8,stroke:#141413,color:#141413
style H fill:#141413,stroke:#6a9bcc,color:#fff

The first two stages (orange) are the simplest possible DiD; they recover the headline result with arithmetic plus a regression. The next two (navy) introduce covariate adjustment, useful when treated and control groups differ on observables. The 2xT and GxT stages (teal) extend the design to multiple post-periods and multiple cohorts. The final stage (black) asks the only question a 2x2 cannot answer: how robust is the answer to violations of the assumption that buys identification in the first place?

2. Setup and imports

The R session needs nine packages: tidyverse for data manipulation and ggplot2, fixest for fast fixed-effects regression, did for the Callaway-Sant’Anna group-time estimator, DRDID for the doubly-robust DiD engine, HonestDiD for the Rambachan-Roth sensitivity analysis, broom for tidy regression output, scales for percentage labels, here for project-anchored paths, and pacman to handle installation if any of those are missing. We also fix the random seed and set the bootstrap iteration count to 2,000 (the manuscript’s reference scripts use 25,000; this is fine for tutorial-grade results).

set.seed(42)
if (!require("pacman")) install.packages("pacman", repos = "https://cloud.r-project.org")
pacman::p_load(
tidyverse, # data manipulation + ggplot
fixest, # fast fixed-effects regression (`feols`, `feglm`)
did, # Callaway & Sant'Anna group-time ATT(g,t) estimator
DRDID, # the doubly-robust DiD engine used inside `did`
HonestDiD, # Rambachan-Roth sensitivity analysis
broom, # tidy regression output
scales, # nice axis labels
here # filepath helper (project root anchored)
)
BITERS <- 2000

Three of these packages deserve a brief introduction. The did package implements Callaway and Sant’Anna’s (2021) group-time ATT estimator via att_gt() and aggregates the resulting cells via aggte(). The DRDID package is the doubly-robust DiD engine that lives underneath did’s est_method = "dr" option. The HonestDiD package implements the Rambachan-Roth (2023) sensitivity analysis for parallel-trends violations. All three are CRAN-published.

A dark-themed ggplot palette is registered once at the top of the script so every figure inherits it; this keeps the eight figures visually consistent without per-plot styling.

BG_DARK <- "#0f1729"
GRID_DARK <- "#1f2b5e"
TEXT_LIGHT <- "#c8d0e0"
TEXT_WHITE <- "#e8ecf2"
BLUE <- "#6a9bcc" # unweighted series
ORANGE <- "#d97757" # population-weighted series
TEAL <- "#00d4c8" # highlights
theme_dark_dampoostle <- function(base_size = 12) {
theme_minimal(base_size = base_size) +
theme(
plot.background = element_rect(fill = BG_DARK, color = NA),
panel.background = element_rect(fill = BG_DARK, color = NA),
panel.grid.major = element_line(color = GRID_DARK, linewidth = 0.35),
axis.text = element_text(color = TEXT_LIGHT),
axis.title = element_text(color = TEXT_WHITE),
legend.position = "bottom"
)
}
theme_set(theme_dark_dampoostle())

The two weighting regimes are color-coded throughout: steel blue (#6a9bcc) for unweighted, warm orange (#d97757) for population-weighted. That convention makes every comparison figure visually self-documenting. Every figure in this post uses the dark-navy theme above; if your browser is in light mode and the figures look unexpectedly dark, that is by design rather than a rendering bug.

3. Data: CDC mortality + ACA expansion timing

The source data are CDC county-level mortality counts (deaths per 100,000 adults aged 20–64) merged with state-level Medicaid-expansion timing. We follow the manuscript’s inclusion criteria: drop the five jurisdictions that expanded before 2014 (DC, DE, MA, NY, VT) because they cannot serve cleanly as either treated or control in a 2014-centered design; require full mortality coverage 2009–2019; and require full covariate coverage in 2013 and 2014.

covs <- c("perc_female", "perc_white", "perc_hispanic",
"unemp_rate", "poverty_rate", "median_income")
DATA_URL <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_did2/reference/data/county_mortality_data.csv"
df_raw <- read_csv(DATA_URL, show_col_types = FALSE, na = c("", "NA"))
df_prep <- df_raw %>%
mutate(
state_abb = str_sub(county, nchar(county) - 1, nchar(county)),
perc_white = population_20_64_white / population_20_64 * 100,
perc_hispanic = population_20_64_hispanic / population_20_64 * 100,
perc_female = population_20_64_female / population_20_64 * 100,
unemp_rate = unemp_rate * 100,
median_income = median_income / 1000,
yaca = suppressWarnings(as.numeric(yaca))
) %>%
filter(!(state_abb %in% c("DC", "DE", "MA", "NY", "VT"))) %>%
select(state_abb, county, county_code, year, population_20_64, yaca,
crude_rate_20_64, all_of(covs)) %>%
drop_na(!yaca) %>%
group_by(county_code) %>%
filter(sum(year %in% c(2013, 2014)) == 2) %>%
filter(sum(!is.na(crude_rate_20_64)) == 11) %>%
ungroup() %>%
group_by(county_code) %>%
mutate(set_wt = population_20_64[which(year == 2013)]) %>%
ungroup() %>%
mutate(
treat_year = if_else(!is.na(yaca) & yaca <= 2019, yaca, 0),
Treat_2014 = if_else(!is.na(yaca) & yaca == 2014, 1L, 0L),
Post = if_else(year >= 2014, 1L, 0L)
)

The cleaning produces a balanced panel of 2,604 counties across 11 years (28,644 county-year rows). The treat_year column follows the did package convention: it holds the actual expansion year for treated counties and a literal $0$ for never-treated counties. The set_wt column is each county’s 2013 adult population, held constant across all 11 years so that weighting does not conflate population growth with mortality change. After the cleaning, the breakdown of cohorts is:

Loaded 31843 rows x 22 cols from county_mortality_data.csv
After cleaning: 2604 counties x 11 years = 28644 county-year rows
Treatment cohorts (treat_year):
treat_year n_counties
1 0 1222
2 2014 978
3 2015 171
4 2016 93
5 2019 140

The five cohorts hide an asymmetry that is the seed of everything that follows. Built on county counts, the never-expansion cohort makes up 46.9% of the sample and the 2014 cohort 37.6%. Built on 2013 adult population, the never-expansion cohort makes up only 38.2% while the 2014 cohort makes up 49.5%. Switching weighting regimes silently swings 11 percentage points of mass between the two largest cohorts. The three smaller cohorts (2015, 2016, 2019) shrink even further under weighting, from 6.6 / 3.6 / 5.4% of counties down to 7.0 / 2.0 / 3.4% of adults.

treat_year	n_counties	n_states	pop_adult (2013)	share_counties	share_pop
0 (never)	1,222	17	65,171,521	46.9%	38.2%
2014	978	22	84,421,489	37.6%	49.5%
2015	171	3	11,906,556	6.6%	7.0%
2016	93	2	3,329,529	3.6%	2.0%
2019	140	2	5,811,224	5.4%	3.4%

The 11-percentage-point gap between county shares and population shares for the two largest cohorts is the proximate cause of the sign reversal that the next section produces. When you switch from equal weighting to population weighting, you are quietly rebalancing the comparison toward larger, more urban expansion counties and smaller, more rural never-expansion counties. That is not a precision change; it is a different comparison.

4. The headline 2x2 DiD — four cell means

The simplest possible DiD uses only four numbers: mean mortality in (Expansion, Never-Expansion) $\times$ (2013, 2014). The treatment effect is the treated group’s pre-to-post change minus the control group’s pre-to-post change. We do this twice, once with equal weights and once with population weights, using a small helper that takes the weighting column as an argument.

short_data <- df_prep %>%
filter(year %in% c(2013, 2014),
(treat_year == 2014) | (treat_year == 0)) %>%
mutate(D = Treat_2014)
cell_means <- function(d, wt = NULL) {
if (is.null(wt)) {
d %>% group_by(D, year) %>%
summarise(y = mean(crude_rate_20_64), .groups = "drop")
} else {
d %>% group_by(D, year) %>%
summarise(y = weighted.mean(crude_rate_20_64, w = .data[[wt]]),
.groups = "drop")
}
}
cells_unw <- cell_means(short_data)
cells_wt <- cell_means(short_data, wt = "set_wt")
att_2x2 <- function(cells) {
T_pre <- cells$y[cells$D == 1 & cells$year == 2013]
T_post <- cells$y[cells$D == 1 & cells$year == 2014]
C_pre <- cells$y[cells$D == 0 & cells$year == 2013]
C_post <- cells$y[cells$D == 0 & cells$year == 2014]
list(T_pre = T_pre, T_post = T_post, C_pre = C_pre, C_post = C_post,
trend_T = T_post - T_pre, trend_C = C_post - C_pre,
att = (T_post - T_pre) - (C_post - C_pre))
}
e_unw <- att_2x2(cells_unw)
e_wt <- att_2x2(cells_wt)
cat(sprintf("Unweighted 2x2 ATT(2014) = %.3f\n", e_unw$att))
cat(sprintf("Weighted 2x2 ATT(2014) = %.3f\n", e_wt$att))

Unweighted 2x2 ATT(2014) = 0.122
Weighted 2x2 ATT(2014) = -2.563

The estimand here is the average treatment effect on the treated (ATT) for the 2014 expansion cohort, evaluated either under equal weights across counties or under population weights. Formally:

$$\text{ATT}_\omega(2014) = \Big( \mathbb{E}_\omega[Y_{i, 2014} \mid D_i = 1] - \mathbb{E}_\omega[Y_{i, 2013} \mid D_i = 1] \Big) - \Big( \mathbb{E}_\omega[Y_{i, 2014} \mid D_i = 0] - \mathbb{E}_\omega[Y_{i, 2013} \mid D_i = 0] \Big)$$

In words, this is the treated group’s change in mean mortality from 2013 to 2014, minus the control group’s change over the same period — where both means are computed under weighting scheme $\omega$. The weight $\omega$ is either equal across counties or proportional to 2013 adult population. The subscript on the expectation $\mathbb{E}_\omega$ is what carries the weighting choice through the definition of the parameter itself; the manuscript discusses this at lines 169–170. In our code, the four conditional means map to T_pre, T_post, C_pre, C_post; the ATT is (T_post - T_pre) - (C_post - C_pre).

The full four-cell table makes the arithmetic transparent:

row	unw_T	unw_C	unw_gap	wt_T	wt_C	wt_gap
2013 (pre)	419.23	474.00	$-54.77$	322.72	376.40	$-53.68$
2014 (post)	428.50	483.15	$-54.65$	326.46	382.70	$-56.25$
Trend (post-pre)	$+9.27$	$+9.15$	$+0.12$	$+3.74$	$+6.30$	$-2.56$

The cell-means visualization (Figure 1) plots the four points and connects them by group; the gap between the two slopes is the DiD.

fig1_df <- bind_rows(
cells_unw %>% mutate(weighting = "Unweighted"),
cells_wt %>% mutate(weighting = "Population-weighted")
) %>%
mutate(group = if_else(D == 1, "2014 Expansion counties",
"Never-expansion counties"),
weighting = factor(weighting,
levels = c("Unweighted", "Population-weighted")))
p1 <- ggplot(fig1_df, aes(x = year, y = y, color = group, group = group)) +
geom_line(linewidth = 1.2) +
geom_point(size = 3.2) +
scale_color_manual(values = c("2014 Expansion counties" = ORANGE,
"Never-expansion counties" = BLUE)) +
facet_wrap(~ weighting) +
labs(title = "The 2x2 DiD flips sign when you use population weights",
subtitle = "Mortality (per 100,000 adults aged 20-64): 2014 expanders vs never-expanders",
x = NULL, y = "Mortality rate")
ggsave("r_did2_01_headline_2x2.png", p1, width = 10, height = 5.5,
dpi = 300, bg = BG_DARK)

The numbers carry the headline. Under equal weighting, 2014-expansion counties saw mortality rise by $9.27$ deaths per 100,000 between 2013 and 2014; never-expansion counties rose by $9.15$. The treated trend is essentially indistinguishable from the control trend, and the DiD is $+0.122$. Under population weighting, expansion counties rose by only $3.74$ deaths per 100,000 while never-expansion counties rose by $6.30$ — a divergence that yields a DiD of $-2.563$. Crucially, the pre-period gap between treated and control means is essentially identical across the two weightings ($-54.77$ unweighted, $-53.68$ weighted): the reversal is driven entirely by which counties dominate the 2014 averages.

This precisely reproduces the manuscript’s flagship example (line 215, Table tab:two_by_two_ex), which reports $+0.1$ deaths per 100,000 unweighted and $-2.6$ weighted: “Without weighting … 0.1 deaths per 100,000 … In contrast, the DiD result using population weights suggests that Medicaid expansion caused a reduction of 2.6 deaths per 100,000 for the average adult in expansion states.” The ATT is a weighted average treatment effect on the treated; choosing the weight is choosing the question.

5. The same 2x2, written as a regression

Most applied researchers reach for a regression, not cell means. The manuscript’s algebraic Result 1 (line 234) states that on a balanced 2x2 panel, three apparently different regression specifications all recover exactly the same DiD coefficient. Demonstrating that equivalence removes mystery from “Two-Way Fixed Effects” (TWFE) and makes the only substantive choice in the 2x2 case — the weighting — visible.

The three specifications are: (a) a levels regression with treatment, post, and their interaction; (b) a two-way fixed-effects regression with county and year fixed effects; and (c) a long-difference regression of the 2014-minus-2013 outcome change on the treatment indicator. We run each twice (unweighted and weighted) for a total of six fits, all using fixest::feols() with county-clustered standard errors.

short_long_diff <- short_data %>%
group_by(county_code) %>%
summarise(set_wt = mean(set_wt),
diff = crude_rate_20_64[which(year == 2014)] -
crude_rate_20_64[which(year == 2013)],
D = mean(D),
.groups = "drop")
twfe_levels_unw <- feols(crude_rate_20_64 ~ D * Post,
data = short_data, cluster = ~county_code)
twfe_fe_unw <- feols(crude_rate_20_64 ~ D:Post | county_code + year,
data = short_data, cluster = ~county_code)
twfe_long_unw <- feols(diff ~ D,
data = short_long_diff, cluster = ~county_code)
twfe_levels_wt <- feols(crude_rate_20_64 ~ D * Post,
data = short_data, weights = ~set_wt,
cluster = ~county_code)
twfe_fe_wt <- feols(crude_rate_20_64 ~ D:Post | county_code + year,
data = short_data, weights = ~set_wt,
cluster = ~county_code)
twfe_long_wt <- feols(diff ~ D,
data = short_long_diff, weights = ~set_wt,
cluster = ~county_code)

The levels specification recovers the DiD as the coefficient on D:Post; the FE specification absorbs the main effects through fixed effects and identifies the DiD off the same interaction; the long-difference specification collapses each county to one row and identifies the DiD as the coefficient on $D$. All three are algebraically equivalent on a balanced 2x2 panel:

$$Y_{i, t} = \beta_0 + \beta_1 \mathbf{1}\{D_i = 1\} + \beta_2 \mathbf{1}\{t = 2014\} + \beta^{2 \times 2} \big( \mathbf{1}\{D_i = 1\} \times \mathbf{1}\{t = 2014\} \big) + \varepsilon_{i, t}$$

In words, this regression says that mortality $Y_{i, t}$ for county $i$ in year $t$ depends on a baseline level $\beta_0$, a treatment-group shift $\beta_1$, a post-period shift $\beta_2$, and a treatment-and-post interaction $\beta^{2 \times 2}$ that captures the differential change for treated counties after the policy. In our code, $Y_{i, t}$ is crude_rate_20_64, $D_i$ is the D indicator, $t = 2014$ activates the Post dummy, and $\beta^{2 \times 2}$ is the D:Post coefficient that extract_did() pulls out of each model. The manuscript’s eqn:twfe_2_by_2 at line 217 states this specification; the algebraic result we are about to demonstrate is that the same coefficient $\beta^{2 \times 2}$ is also recovered (numerically, not just in expectation) when one drops $\beta_1$ and $\beta_2$ in favor of unit and time fixed effects, or when one collapses the panel to long differences.

extract_did <- function(m, label, weighting) {
co <- coef(m); se <- se(m)
did_name <- if ("D:Post" %in% names(co)) "D:Post" else "D"
tibble(spec = label, weighting = weighting,
est = unname(co[did_name]),
se = unname(se[did_name]),
lo95 = est - 1.96 * se, hi95 = est + 1.96 * se)
}
twfe_tbl <- bind_rows(
extract_did(twfe_levels_unw, "Levels (D:Post)", "Unweighted"),
extract_did(twfe_fe_unw, "Two-way FE (D:Post)", "Unweighted"),
extract_did(twfe_long_unw, "Long difference", "Unweighted"),
extract_did(twfe_levels_wt, "Levels (D:Post)", "Population-weighted"),
extract_did(twfe_fe_wt, "Two-way FE (D:Post)", "Population-weighted"),
extract_did(twfe_long_wt, "Long difference", "Population-weighted")
)
print(twfe_tbl)

2x2 TWFE estimates:
spec weighting est se lo95 hi95
1 Levels (D:Post) Unweighted 0.122 3.75 -7.23 7.47
2 Two-way FE (D:Post) Unweighted 0.122 3.75 -7.22 7.47
3 Long difference Unweighted 0.122 3.75 -7.22 7.47
4 Levels (D:Post) Population-weighted -2.56 1.49 -5.48 0.358
5 Two-way FE (D:Post) Population-weighted -2.56 1.49 -5.48 0.357
6 Long difference Population-weighted -2.56 1.49 -5.48 0.357

The point estimates are numerically identical within each weighting regime: $0.122$ unweighted and $-2.563$ weighted, agreeing to three decimals across all three specifications. This is the manuscript’s algebraic Result 1 in action: “the estimate of $\beta^{2 \times 2}$ is numerically the same if the regression instead contains fixed effects for each unit (columns 2 and 5) or if one regresses outcome changes on a constant and the treatment group dummy” (line 234).

The standard errors are also indistinguishable across specifications — $3.75$ unweighted, $1.49$ weighted — but they differ sharply across weightings: the weighted SE is roughly $2.5\times$ tighter than the unweighted SE. The 95% confidence interval for the weighted estimate, $[-5.48, +0.36]$, narrowly fails to exclude zero; the unweighted CI, $[-7.23, +7.47]$, is far from rejecting the null.

The forest plot makes the point visually: within a weighting, the three rows are essentially superimposed; across weightings, the two color groups are clearly separated.

p2 <- ggplot(twfe_tbl, aes(x = est, y = spec, color = weighting)) +
geom_vline(xintercept = 0, color = TEXT_LIGHT, linetype = "dashed") +
geom_errorbar(aes(xmin = lo95, xmax = hi95), width = 0.18, linewidth = 0.9,
orientation = "y", position = position_dodge(width = 0.55)) +
geom_point(size = 3.4, position = position_dodge(width = 0.55)) +
scale_color_manual(values = c("Unweighted" = BLUE,
"Population-weighted" = ORANGE)) +
labs(title = "Three TWFE specifications, two weighting choices",
x = "DiD coefficient (deaths per 100,000)", y = NULL)
ggsave("r_did2_02_twfe_2x2.png", p2, width = 10, height = 5.5,
dpi = 300, bg = BG_DARK)

The lesson from this stage is structural. For the 2x2 design on a balanced panel, there is no methodological choice between Levels, TWFE, and Long Difference: they are the same estimator written three ways. The only substantive choice is whether to weight. Every later stage of the pipeline carries that lesson forward.

6. Covariate balance and propensity scores

Parallel trends is easier to defend when treated and control groups look similar at baseline. If 2014-expansion counties were already very different from never-expansion counties in 2013, the assumption that they would have shared the same counterfactual trend becomes harder to justify. We assess balance using two complementary tools: the normalized difference for each covariate, and the propensity score (the predicted probability of treatment given covariates).

The normalized difference is defined as

$$\text{Norm. Diff}_{\omega, X} = \frac{\bar{X}_{\omega, T} - \bar{X}_{\omega, C}}{\sqrt{(S_{\omega, T}^2 + S_{\omega, C}^2) / 2}}$$

In words, this is the difference of treated and control means divided by the average of their within-group standard deviations, all computed under weighting scheme $\omega$. The denominator scales the gap by the units' typical spread, so the metric is comparable across covariates with very different ranges (percentages versus dollars versus rates). The rule of thumb the manuscript adopts (line 275, following Imbens and Rubin 2015): values in excess of $0.25$ in absolute value indicate “potentially problematic imbalance.” In our code, $\bar{X}_{\omega, T}$ is mean_T (weighted or unweighted), $\bar{X}_{\omega, C}$ is mean_C, and the denominator combines var_T and var_C (with wtd_var() substituting for var() under weighting).

wtd_var <- function(x, w) {
ok <- !is.na(x + w); x <- x[ok]; w <- w[ok]
xbar <- weighted.mean(x, w)
sum(w * (x - xbar)^2) / (sum(w) - 1)
}
balance_unw <- short_data %>% filter(year == 2013) %>%
pivot_longer(all_of(covs), names_to = "variable", values_to = "value") %>%
group_by(variable, D) %>%
summarise(mean = mean(value), var = var(value), .groups = "drop") %>%
pivot_wider(names_from = D, values_from = c(mean, var)) %>%
mutate(weighting = "Unweighted",
norm_diff = (mean_1 - mean_0) / sqrt((var_1 + var_0) / 2))
balance_wt <- short_data %>% filter(year == 2013) %>%
pivot_longer(all_of(covs), names_to = "variable", values_to = "value") %>%
group_by(variable, D) %>%
summarise(mean = weighted.mean(value, set_wt),
var = wtd_var(value, set_wt), .groups = "drop") %>%
pivot_wider(names_from = D, values_from = c(mean, var)) %>%
mutate(weighting = "Population-weighted",
norm_diff = (mean_1 - mean_0) / sqrt((var_1 + var_0) / 2))

The full balance table (table_covariate_balance.csv) reports the six covariates under each weighting. The point estimates of the means and their normalized differences are:

weighting	variable	mean_C (never)	mean_T (2014)	norm_diff
Unweighted	median_income	43.04	47.97	$+0.427$
Unweighted	perc_female	49.43	49.33	$-0.034$
Unweighted	perc_hispanic	9.64	8.23	$-0.105$
Unweighted	perc_white	81.64	90.48	$+0.586$
Unweighted	poverty_rate	19.28	16.53	$-0.423$
Unweighted	unemp_rate	7.61	8.01	$+0.157$
Population-weighted	median_income	49.31	57.86	$+0.685$
Population-weighted	perc_female	50.48	50.07	$-0.238$
Population-weighted	perc_hispanic	17.01	18.86	$+0.107$
Population-weighted	perc_white	77.91	79.54	$+0.115$
Population-weighted	poverty_rate	17.24	15.29	$-0.375$
Population-weighted	unemp_rate	7.00	8.01	$+0.503$

Six of twelve cells exceed the $\pm 0.25$ threshold in absolute value: under equal weighting, expansion counties are notably whiter (perc_white = $+0.586$), richer (median_income = $+0.427$), and less impoverished (poverty_rate = $-0.423$) than never-expansion counties; under population weighting, the gap shifts toward unemployment and income (unemp_rate = $+0.503$, median_income = $+0.685$). The manuscript flags this pattern at line 279 (Table tab:cov_balance): “Expansion counties in 2013 were whiter and had a higher unemployment rate despite lower poverty and higher median income.” Imbalance does not invalidate parallel trends, but it makes the unconditional parallel-trends assumption harder to swallow on its own — which motivates the covariate-adjusted estimators in Section 7.

The propensity score summarizes all six covariates in a single number: the predicted probability of being a 2014-expansion county given the 2013 covariates. We fit a logit for $P(D = 1 \mid X)$ under each weighting using fixest::feglm().

ps_form <- as.formula(paste("D ~", paste(covs, collapse = " + ")))
ps_unw <- feglm(ps_form, data = short_data %>% filter(year == 2013),
family = "binomial", vcov = "hetero")
ps_wt <- feglm(ps_form, data = short_data %>% filter(year == 2013),
family = "binomial", vcov = "hetero", weights = ~set_wt)

The propensity-score logit estimates (table_propensity_models.csv) corroborate the normalized-difference picture: every covariate except poverty_rate (unweighted) is significant at the 5% level, and the unemployment-rate coefficient under weighting is striking ($+0.680$, $p = 1.2 \times 10^{-15}$). To assess overlap — whether treated and control units occupy the same propensity-score region, a precondition for credible IPW — we plot the density of predicted probabilities by group, faceted by weighting.

ps_plot_df <- bind_rows(
short_data %>% filter(year == 2013) %>%
mutate(p = predict(ps_unw, ., type = "response"),
wt_use = 1, weighting = "Unweighted"),
short_data %>% filter(year == 2013) %>%
mutate(p = predict(ps_wt, ., type = "response"),
wt_use = set_wt, weighting = "Population-weighted")
) %>%
mutate(group = if_else(D == 1, "Expansion", "Non-expansion"),
weighting = factor(weighting,
levels = c("Unweighted", "Population-weighted")))
p3 <- ggplot(ps_plot_df, aes(x = p, fill = group, weight = wt_use)) +
geom_density(alpha = 0.55, color = NA, adjust = 1.2) +
scale_fill_manual(values = c("Expansion" = ORANGE,
"Non-expansion" = BLUE)) +
facet_wrap(~ weighting) +
labs(title = "Propensity-score overlap, by weighting",
x = "Estimated propensity score", y = "Density")
ggsave("r_did2_03_propensity.png", p3, width = 10, height = 5.5,
dpi = 300, bg = BG_DARK)

Under equal weighting the two density curves overlap substantially, with treated and control units occupying similar regions of propensity-score space. Under population weighting the picture is markedly worse: treated counties pile up near a propensity of $0.85$ while non-expansion counties spread bimodally across the full range. Weighting amplifies imbalance because California and Texas (very large expansion and non-expansion counties, respectively) pull the conditional means apart. This is the reason covariate adjustment becomes more consequential under population weighting — and why the next section computes three different covariate-adjusted estimators rather than picking one.

7. Covariate-adjusted 2x2 — OR, IPW, and DRDID

The Section 4 cell-means estimate assumed unconditional parallel trends: treated and control counties would have moved together absent expansion, full stop. The imbalance documented in Section 6 makes that assumption brittle. The fix is conditional parallel trends: treated and control counties with similar covariate values would have moved together. Three estimators implement this fix, each leaning on a different model:

Outcome regression (OR). Fit a model for $Y_{i, t}(0)$ on the control group as a function of covariates; predict counterfactuals for the treated group; subtract from observed outcomes.
Inverse propensity weighting (IPW). Reweight the control group so its covariate distribution matches the treated group’s; compute the DiD on the reweighted sample.
Doubly robust DiD (DRDID). Combine OR and IPW in a way that yields a consistent estimate if either model is correctly specified.

All three are implemented in the did package via the est_method argument to att_gt(). We wrap them in a single helper that toggles weighting on or off.

data_cs_2x2 <- short_data %>%
mutate(treat_year_cs = if_else(D == 1, 2014, 0),
id_num = as.numeric(county_code)) %>%
select(id_num, year, crude_rate_20_64, treat_year_cs, set_wt, all_of(covs))
xformla <- as.formula(paste("~", paste(covs, collapse = " + ")))
cs_one <- function(method, weighted) {
if (weighted) {
res <- did::att_gt(yname = "crude_rate_20_64", tname = "year",
idname = "id_num", gname = "treat_year_cs",
xformla = xformla, data = data_cs_2x2, panel = TRUE,
control_group = "nevertreated",
base_period = "universal",
bstrap = TRUE, est_method = method, biters = BITERS,
weightsname = "set_wt")
} else {
res <- did::att_gt(yname = "crude_rate_20_64", tname = "year",
idname = "id_num", gname = "treat_year_cs",
xformla = xformla, data = data_cs_2x2, panel = TRUE,
control_group = "nevertreated",
base_period = "universal",
bstrap = TRUE, est_method = method, biters = BITERS)
}
agg <- suppressMessages(aggte(res, type = "simple", na.rm = TRUE))
tibble(method = method,
weighting = if (weighted) "Population-weighted" else "Unweighted",
est = agg$overall.att, se = agg$overall.se)
}
cs_2x2_tbl <- bind_rows(
cs_one("reg", FALSE), cs_one("reg", TRUE),
cs_one("ipw", FALSE), cs_one("ipw", TRUE),
cs_one("dr", FALSE), cs_one("dr", TRUE)
)

The doubly robust DRDID estimator (Sant’Anna and Zhao 2020) takes the form

$$\widehat{\text{ATT}}_{\text{DR}} = \frac{1}{n} \sum_{i = 1}^{n} \Big( \hat{w}_{D = 1}(D_i) - \hat{w}_{D = 0}(D_i, X_i) \Big) \Big( \Delta Y_{i} - \hat{\mu}_{\Delta, D = 0}(X_i) \Big)$$

In words, each county contributes a weighted residual: the weight $\hat{w}_{D = 1} - \hat{w}_{D = 0}$ depends on its treatment status and (through the propensity score) its covariates, while the residual $\Delta Y_i - \hat{\mu}_{\Delta, D = 0}(X_i)$ measures how much that county’s 2013-to-2014 change differed from what the outcome regression predicted for an untreated unit with the same covariates. In our code, $\Delta Y_i$ is the long-difference outcome (crude_rate_20_64 in 2014 minus 2013), $\hat{\mu}_{\Delta, D = 0}(X_i)$ comes from the OR step under the hood of att_gt(est_method = "dr"), and the propensity weights $\hat{w}$ come from the same logit we fit in Section 6. The “double” in doubly robust is that the estimator stays consistent if either the OR or the propensity model is correctly specified; it does not require both. The manuscript states the formula at line 446 (eqn:ATT_DR_estimator).

2x2 covariate-adjusted estimates:
method weighting est se method_label lo95 hi95
1 reg Unweighted -1.62 4.66 Outcome regression (OR) -10.7 7.51
2 reg Population-weighted -3.46 2.29 Outcome regression (OR) -7.95 1.03
3 ipw Unweighted -0.859 4.84 Inverse propensity weigh… -10.3 8.62
4 ipw Population-weighted -3.84 3.19 Inverse propensity weigh… -10.1 2.42
5 dr Unweighted -1.23 5.05 Doubly robust (DRDID) -11.1 8.68
6 dr Population-weighted -3.76 3.29 Doubly robust (DRDID) -10.2 2.69

method	weighting	est	se	95% CI
Outcome regression (OR)	Unweighted	$-1.615$	4.66	$[-10.74, +7.51]$
Outcome regression (OR)	Population-weighted	$-3.459$	2.29	$[-7.95, +1.03]$
Inverse propensity weighting (IPW)	Unweighted	$-0.859$	4.84	$[-10.34, +8.62]$
Inverse propensity weighting (IPW)	Population-weighted	$-3.842$	3.19	$[-10.10, +2.42]$
Doubly robust (DRDID)	Unweighted	$-1.226$	5.05	$[-11.13, +8.68]$
Doubly robust (DRDID)	Population-weighted	$-3.756$	3.29	$[-10.20, +2.69]$

The forest plot, combining the three covariate-adjusted estimators with the no-covariates TWFE long-difference baseline from Section 5, makes the comparison visual.

forest_df <- bind_rows(
twfe_tbl %>% filter(spec == "Long difference") %>%
transmute(method_label = "TWFE long diff (no covs)",
weighting, est, se, lo95, hi95),
cs_2x2_tbl %>%
mutate(method_label = recode(method,
reg = "Outcome regression (OR)",
ipw = "Inverse propensity weighting (IPW)",
dr = "Doubly robust (DRDID)"),
lo95 = est - 1.96 * se, hi95 = est + 1.96 * se) %>%
select(method_label, weighting, est, se, lo95, hi95)
)
p4 <- ggplot(forest_df, aes(x = est, y = method_label, color = weighting)) +
geom_vline(xintercept = 0, color = TEXT_LIGHT, linetype = "dashed") +
geom_errorbar(aes(xmin = lo95, xmax = hi95), width = 0.2, linewidth = 0.9,
orientation = "y", position = position_dodge(width = 0.55)) +
geom_point(size = 3.3, position = position_dodge(width = 0.55)) +
scale_color_manual(values = c("Unweighted" = BLUE,
"Population-weighted" = ORANGE)) +
labs(title = "Covariate-adjusted 2x2 estimates",
x = "ATT(2014) (deaths per 100,000)", y = NULL)
ggsave("r_did2_04_drdid_forest.png", p4, width = 11, height = 5.5,
dpi = 300, bg = BG_DARK)

Covariate adjustment moves the unweighted point estimate from $+0.122$ (cell means) down to $-1.226$ (DRDID), and shifts the weighted estimate from $-2.563$ to $-3.756$. The unweighted-to-weighted gap remains roughly $2.5$ deaths per 100,000 — larger than the gap between estimators within each weighting (which is at most $0.8$ deaths per 100,000). The manuscript notes (line 425) that “the weighted IPW estimate is almost twice as large as the RA [regression-adjustment] estimate, despite neither being statistically significant”; we see a similar but smaller divergence ($-3.84$ vs $-3.46$, a $1.1\times$ ratio), with DRDID landing between them. Crucially, none of the six 95% confidence intervals excludes zero. Covariate adjustment matters for interpretation (the unweighted point estimate is now a small negative rather than a small positive), but it does not buy statistical significance — and the weighting choice still dwarfs the methodological choice.

A note on causal language: covariate adjustment is being deployed here for confounding control under observational identification. This is not a randomized experiment with precision-improving covariates; the covariates change the target parameter (from unconditional ATT to ATT conditional on $X$). The manuscript discusses this at lines 264–449.

8. The 2xT event study — 2014 expanders vs never-expanders

The 2x2 design throws away nine of our eleven years. A dynamic event study estimates an ATT for every year relative to expansion, treating $e = -1$ (the year before treatment) as the omitted baseline. The leads ($e \leq -2$) double as a placebo test for parallel trends: if the assumption holds, they should hover around zero. The lags ($e \geq 0$) trace out how the effect evolves over time. We restrict the panel to 2014-expanders and never-treated counties (still no staggered cohorts yet) and use did::att_gt() with est_method = "dr", then aggregate to event time via aggte(type = "dynamic").

data_2xt <- df_prep %>%
filter(treat_year %in% c(0, 2014)) %>%
mutate(id_num = as.numeric(county_code)) %>%
select(id_num, year, crude_rate_20_64, treat_year, set_wt, all_of(covs))
att_2xt_unw <- att_gt(yname = "crude_rate_20_64", tname = "year",
idname = "id_num", gname = "treat_year",
xformla = xformla, data = data_2xt, panel = TRUE,
control_group = "nevertreated",
base_period = "universal",
bstrap = TRUE, est_method = "dr", biters = BITERS)
att_2xt_wt <- att_gt(yname = "crude_rate_20_64", tname = "year",
idname = "id_num", gname = "treat_year",
xformla = xformla, data = data_2xt, panel = TRUE,
control_group = "nevertreated",
base_period = "universal",
bstrap = TRUE, est_method = "dr",
weightsname = "set_wt", biters = BITERS)
es_2xt_unw <- aggte(att_2xt_unw, type = "dynamic", na.rm = TRUE)
es_2xt_wt <- aggte(att_2xt_wt, type = "dynamic", na.rm = TRUE)
event_2xt_tbl <- bind_rows(
tibble(e = es_2xt_unw$egt, est = es_2xt_unw$att.egt,
se = es_2xt_unw$se.egt, weighting = "Unweighted"),
tibble(e = es_2xt_wt$egt, est = es_2xt_wt$att.egt,
se = es_2xt_wt$se.egt, weighting = "Population-weighted")
)

2xT event study (ATT(e)):
e est se weighting lo95 hi95
1 -5 8.48 4.35 Unweighted -0.0363 17.0
2 -4 1.69 4.18 Unweighted -6.51 9.88
3 -3 3.84 4.30 Unweighted -4.59 12.3
4 -2 7.33 5.32 Unweighted -3.10 17.8
5 -1 0 NA Unweighted NA NA
6 0 -1.23 5.00 Unweighted -11.0 8.58
7 1 5.36 4.90 Unweighted -4.24 15.0
8 2 12.2 4.76 Unweighted 2.90 21.6
9 3 13.5 5.19 Unweighted 3.38 23.7
10 4 9.69 5.65 Unweighted -1.38 20.8

The full 22-row panel covers $e = -5$ to $+5$ for each weighting:

e	est (unw)	se (unw)	95% CI (unw)	est (wt)	se (wt)	95% CI (wt)
$-5$	$+8.48$	4.35	$[-0.04, +17.00]$	$+1.75$	3.33	$[-4.78, +8.27]$
$-4$	$+1.69$	4.18	$[-6.51, +9.88]$	$+0.34$	3.28	$[-6.09, +6.77]$
$-3$	$+3.84$	4.30	$[-4.59, +12.26]$	$+2.87$	2.92	$[-2.84, +8.59]$
$-2$	$+7.33$	5.32	$[-3.10, +17.76]$	$+1.51$	4.51	$[-7.33, +10.35]$
$-1$	$0$	–	(reference)	$0$	–	(reference)
$0$	$-1.23$	5.00	$[-11.03, +8.58]$	$-3.76$	3.14	$[-9.91, +2.40]$
$+1$	$+5.36$	4.90	$[-4.24, +14.96]$	$-1.31$	4.78	$[-10.68, +8.05]$
$+2$	$+12.24$	4.76	$[+2.90, +21.57]$	$+3.28$	4.02	$[-4.60, +11.16]$
$+3$	$+13.54$	5.19	$[+3.38, +23.71]$	$-4.71$	5.41	$[-15.31, +5.89]$
$+4$	$+9.69$	5.65	$[-1.38, +20.76]$	$-0.08$	5.29	$[-10.46, +10.29]$
$+5$	$+16.96$	5.17	$[+6.83, +27.09]$	$+2.48$	5.73	$[-8.75, +13.70]$

p5 <- ggplot(event_2xt_tbl, aes(x = e, y = est,
color = weighting, fill = weighting)) +
geom_hline(yintercept = 0, color = TEXT_LIGHT, linetype = "dashed") +
geom_vline(xintercept = -0.5, color = ORANGE, linetype = "dotted") +
geom_ribbon(aes(ymin = est - 1.96 * se, ymax = est + 1.96 * se),
alpha = 0.18, color = NA, na.rm = TRUE) +
geom_line(linewidth = 1.1) +
geom_point(size = 2.6) +
scale_color_manual(values = c("Unweighted" = BLUE,
"Population-weighted" = ORANGE),
aesthetics = c("color", "fill")) +
labs(title = "Event study: 2014 expanders vs never-expanders",
x = "Years since Medicaid expansion (e)",
y = "ATT(e) (deaths per 100,000)")
ggsave("r_did2_05_event_2xT.png", p5, width = 11, height = 5.5,
dpi = 300, bg = BG_DARK)

The leads tell a more nuanced parallel-trends story than the 2x2 could. The unweighted leads at $e = -5$ and $e = -2$ are $+8.48$ and $+7.33$ — both visibly above zero, with the $e = -5$ CI narrowly straddling zero ($[-0.04, +17.00]$). The weighted leads are markedly flatter, ranging from $+0.34$ to $+2.87$ across the same window. After expansion, the trajectories diverge sharply: unweighted ATT(e) climbs from $-1.23$ at $e = 0$ to $+16.96$ at $e = 5$ — a 95% CI of $[+6.83, +27.09]$ that excludes zero — while weighted ATT(e) wanders between $-4.71$ and $+3.28$ with every CI overlapping zero. The dynamic-aggregated ATT averaged over $e \geq 0$ is $+9.43$ unweighted versus $-0.68$ weighted, a 10-death gap that is wider than the 2x2’s $2.7$-death gap.

The manuscript’s fig:2XT_ES (line 535) reports the population-weighted version and concludes “the point estimates do not suggest large mortality effects from Medicaid expansion among expansion counties.” That conclusion follows from the weighted view; the unweighted view tells a strikingly different story. The 2xT design’s identifying assumption — parallel trends in every post-period, manuscript Assumption ass:parallel-trends-ES at line 518 — looks more credible under population weighting in this application, both because the pre-period leads are flatter and because the implied trend across the post-period is more stable.

9. The full GxT staggered design — all four cohorts

The 2xT design used only the 2014 cohort. To use all the variation in expansion timing, we need the Callaway-Sant’Anna $\text{ATT}(g, t)$ framework. Define $G_i$ as the year unit $i$ first expanded (or $\infty$ for never-expanders). The group-time ATT is

$$\text{ATT}(g, t) = \mathbb{E}_\omega \big[ Y_{i, t}(g) - Y_{i, t}(\infty) \mid G_i = g \big]$$

In words, this is the average treatment effect of starting treatment in year $g$ (relative to never starting) at calendar time $t$, restricted to units whose actual treatment year is $g$. The estimand exists separately for every cohort-year cell; aggregation comes later. The identifying assumption is parallel trends with respect to the never-treated group: $\mathbb{E}_\omega[Y_{i, t}(\infty) - Y_{i, t - 1}(\infty) \mid G_i = g] = \mathbb{E}_\omega[Y_{i, t}(\infty) - Y_{i, t - 1}(\infty) \mid G_i = \infty]$, for every cohort $g$ and every period $t$ (manuscript Assumption ass:gt-parallel-trends-never, line 642).

data_gxt <- df_prep %>%
mutate(id_num = as.numeric(county_code)) %>%
select(id_num, year, crude_rate_20_64, treat_year, set_wt, all_of(covs))
att_gxt_unw <- att_gt(yname = "crude_rate_20_64", tname = "year",
idname = "id_num", gname = "treat_year",
xformla = xformla, data = data_gxt, panel = TRUE,
control_group = "nevertreated",
base_period = "universal",
bstrap = TRUE, est_method = "dr", biters = BITERS)
att_gxt_wt <- att_gt(yname = "crude_rate_20_64", tname = "year",
idname = "id_num", gname = "treat_year",
xformla = xformla, data = data_gxt, panel = TRUE,
control_group = "nevertreated",
base_period = "universal",
bstrap = TRUE, est_method = "dr",
weightsname = "set_wt", biters = BITERS)

The raw output contains an $\text{ATT}(g, t)$ for each of $4 \times 11 = 44$ cohort-year cells, times two weightings — 88 values in total, stored in table_attgt_gxt.csv. To extract a comprehensible summary, we aggregate two ways. First, by cohort: average each cohort’s post-treatment ATT(g, t) values into one ATT per cohort. Second, by event time: pool across cohorts and produce one ATT(e) per event time, the same shape as the 2xT event study but using every cohort’s variation.

9a. By-cohort ATT(g)

agg_grp_unw <- aggte(att_gxt_unw, type = "group", na.rm = TRUE)
agg_grp_wt <- aggte(att_gxt_wt, type = "group", na.rm = TRUE)
grp_tbl <- bind_rows(
tibble(group = agg_grp_unw$egt, est = agg_grp_unw$att.egt,
se = agg_grp_unw$se.egt, weighting = "Unweighted"),
tibble(group = agg_grp_wt$egt, est = agg_grp_wt$att.egt,
se = agg_grp_wt$se.egt, weighting = "Population-weighted")
)
print(grp_tbl)

Group-specific ATT(g) (averaged over post periods):
group est se weighting lo95 hi95
1 2014 9.43 3.84 Unweighted 1.90 17.0
2 2015 4.94 5.90 Unweighted -6.61 16.5
3 2016 -17.3 11.0 Unweighted -38.9 4.24
4 2019 3.48 8.85 Unweighted -13.9 20.8
5 2014 -0.684 3.78 Population-weighted -8.09 6.73
6 2015 10.0 2.92 Population-weighted 4.31 15.8
7 2016 -12.6 6.18 Population-weighted -24.7 -0.451
8 2019 3.31 4.46 Population-weighted -5.44 12.1

cohort g	est (unw)	95% CI (unw)	est (wt)	95% CI (wt)
2014	$+9.43$	$[+1.90, +16.96]$	$-0.68$	$[-8.09, +6.73]$
2015	$+4.94$	$[-6.61, +16.50]$	$+10.04$	$[+4.31, +15.77]$
2016	$-17.31$	$[-38.85, +4.24]$	$-12.57$	$[-24.68, -0.45]$
2019	$+3.48$	$[-13.88, +20.83]$	$+3.31$	$[-5.44, +12.06]$

p6 <- ggplot(grp_tbl %>%
mutate(lo95 = est - 1.96 * se, hi95 = est + 1.96 * se),
aes(x = factor(group), y = est, fill = weighting)) +
geom_hline(yintercept = 0, color = TEXT_LIGHT, linetype = "dashed") +
geom_col(position = position_dodge(width = 0.7), width = 0.6, alpha = 0.9) +
geom_errorbar(aes(ymin = lo95, ymax = hi95),
position = position_dodge(width = 0.7), width = 0.18,
color = TEXT_WHITE) +
scale_fill_manual(values = c("Unweighted" = BLUE,
"Population-weighted" = ORANGE)) +
labs(title = "By-cohort ATT(g), Callaway-Sant'Anna staggered design",
x = "Expansion cohort (year)", y = "ATT(g) (deaths per 100,000)")
ggsave("r_did2_06_attgt_groups.png", p6, width = 10, height = 5.5,
dpi = 300, bg = BG_DARK)

The four cohorts show four distinct patterns. The 2014 cohort flips sign with weighting — unweighted $+9.43$ (95% CI $[+1.90, +16.96]$, significant) versus weighted $-0.68$ (95% CI $[-8.09, +6.73]$, not significant). The manuscript’s verdict on this cohort (line 723) is “Medicaid did not lead to significant changes in adult mortality rates,” which agrees with the weighted result and disagrees with the unweighted one. The 2015 cohort agrees in sign across weights but grows under weighting — $+4.94 \to +10.04$, the latter significant ($[+4.31, +15.77]$). The 2016 cohort agrees in sign under both weightings and is the only cohort whose weighted CI excludes zero in the negative direction ($-12.57$, CI $[-24.68, -0.45]$), but it is based on only 93 counties carrying 2% of the panel’s adult population. The 2019 cohort has only one post-period of data and unsurprisingly produces a noisy estimate ($+3.48 \pm 8.85$ unweighted, $+3.31 \pm 4.46$ weighted) with wide CIs in both weights.

The manuscript explicitly cautions (line 725) that “the 2015, 2016, and 2019 expansion groups are relatively small … analyzing these groups separately may be ‘too noisy.'” That caveat is doing real work here: the 2016 cohort’s negative weighted estimate is the only thing keeping the cohort-aggregated story from being a flat “no effect.”

9b. Dynamic event-study aggregation

Aggregating the same $\text{ATT}(g, t)$ cells across cohorts (rather than across time within a cohort) produces an event-study analog to Section 8, but now pooled across all four expansion cohorts. Event time $e$ ranges from $-10$ (the small 2019 cohort has the longest pre-history) to $+5$.

es_gxt_unw <- aggte(att_gxt_unw, type = "dynamic", na.rm = TRUE)
es_gxt_wt <- aggte(att_gxt_wt, type = "dynamic", na.rm = TRUE)
event_gxt_tbl <- bind_rows(
tibble(e = es_gxt_unw$egt, est = es_gxt_unw$att.egt,
se = es_gxt_unw$se.egt, weighting = "Unweighted"),
tibble(e = es_gxt_wt$egt, est = es_gxt_wt$att.egt,
se = es_gxt_wt$se.egt, weighting = "Population-weighted")
)

GxT dynamic event study:
e est se weighting lo95 hi95
1 -10 -23.5 10.1 Unweighted -43.4 -3.65
2 -9 -25.1 9.47 Unweighted -43.7 -6.55
3 -8 -12.8 10.6 Unweighted -33.5 7.92
4 -7 -0.341 8.25 Unweighted -16.5 15.8
5 -6 -1.27 7.96 Unweighted -16.9 14.3
6 -5 6.13 3.56 Unweighted -0.836 13.1
7 -4 2.01 3.27 Unweighted -4.39 8.41
8 -3 4.04 3.34 Unweighted -2.51 10.6
9 -2 5.62 3.84 Unweighted -1.91 13.1
10 -1 0 NA Unweighted NA NA

A condensed table of the GxT event-study ATT(e):

e	est (unw)	se (unw)	est (wt)	se (wt)
$-10$	$-23.54$	10.15	$-15.35$	8.28
$-9$	$-25.11$	9.47	$-25.79$	8.19
$-8$	$-12.81$	10.58	$-17.26$	8.33
$-7$	$-0.34$	8.25	$-3.60$	6.78
$-6$	$-1.27$	7.96	$+2.87$	7.34
$-5$	$+6.13$	3.56	$+0.75$	2.93
$-4$	$+2.01$	3.27	$+1.01$	2.74
$-3$	$+4.04$	3.34	$+2.82$	2.52
$-2$	$+5.62$	3.84	$+1.92$	3.73
$-1$	$0$	–	$0$	–
$0$	$-0.45$	3.72	$-2.65$	2.62
$+1$	$+3.91$	3.97	$+0.23$	3.89
$+2$	$+8.60$	3.85	$+4.49$	3.68
$+3$	$+9.20$	4.20	$-3.74$	4.75
$+4$	$+9.28$	4.89	$+0.79$	4.70
$+5$	$+16.96$	5.31	$+2.48$	5.88

p7 <- ggplot(event_gxt_tbl, aes(x = e, y = est,
color = weighting, fill = weighting)) +
geom_hline(yintercept = 0, color = TEXT_LIGHT, linetype = "dashed") +
geom_vline(xintercept = -0.5, color = ORANGE, linetype = "dotted") +
geom_ribbon(aes(ymin = est - 1.96 * se, ymax = est + 1.96 * se),
alpha = 0.18, color = NA, na.rm = TRUE) +
geom_line(linewidth = 1.1) +
geom_point(size = 2.6) +
scale_color_manual(values = c("Unweighted" = BLUE,
"Population-weighted" = ORANGE),
aesthetics = c("color", "fill")) +
labs(title = "GxT event study: all expansion cohorts pooled",
x = "Years since each cohort's expansion (e)",
y = "ATT(e) (deaths per 100,000)")
ggsave("r_did2_07_event_gxt.png", p7, width = 11, height = 5.5,
dpi = 300, bg = BG_DARK)

The pre-period leads at $e = -10$ and $e = -9$ are dramatically negative under both weightings ($\approx -23$ to $-26$ deaths per 100,000, with 95% CIs excluding zero) and the leads at $e = -8$ are still sizable. These are driven entirely by the small 2019 cohort, the only cohort with a pre-history long enough to produce data at $e = -10$. From $e = -7$ onward the leads settle near zero with confidence intervals comfortably covering it, restoring approximate parallel trends across the bulk of the comparison window.

Post-treatment, the unweighted ATT(e) climbs from $-0.45$ at $e = 0$ to $+16.96$ at $e = 5$ — a stronger upward trajectory than the 2xT-only estimate produced. The weighted ATT(e) stays much flatter, oscillating within $[-3.74, +4.49]$. The dynamic-aggregated ATT averaged over $e \geq 0$ is $+7.917$ unweighted versus $+0.266$ weighted: pooling across all four cohorts shrinks the weighting gap (from $10.1$ in the 2xT to $7.7$ in the GxT) but does not flip the sign on the weighted estimate. The 2014 cohort’s $-0.68$ is partially offset by the 2015 and 2016 cohorts under weighting; the cohort-by-cohort variation we saw in Figure 6 is the source of the pooled GxT’s small positive sign.

10. HonestDiD sensitivity to parallel-trends violations

Every previous section has assumed parallel trends. The HonestDiD framework of Rambachan and Roth (2023) asks the question every honest analyst should care about: how badly can parallel trends be wrong before our conclusion overturns? The “relative magnitudes” version of the procedure, $\Delta^{RM}$, parameterizes the worst possible post-period violation as a multiple $\bar{M}$ of the worst observed pre-period violation. $\bar{M} = 0$ assumes exact parallel trends; $\bar{M} = 0.5$ allows the post-period deviation to be up to half as large as the worst pre-deviation; $\bar{M} = 1$ allows it to be just as large; $\bar{M} = 2$ allows it to be twice as large.

The mechanics involve mapping a did::aggte() object into HonestDiD’s expected input format (a coefficient vector $\hat{\beta}$, its variance-covariance matrix $V$, and a linear combination vector $l$ that selects the aggregate of interest). We wrap that translation in an S3 method on AGGTEobj.

honest_did <- function(es, type = "relative_magnitude",
gridPoints = 100, ...) {
inf <- es$inf.function$dynamic.inf.func.e
n <- nrow(inf)
V <- t(inf) %*% inf / n / n
ref <- -1
idx <- which(es$egt == ref)
V <- V[-idx, -idx]
beta <- es$att.egt[-idx]
egt2 <- es$egt[-idx]
npre <- sum(egt2 < ref)
npost <- length(beta) - npre
l_vec <- matrix(rep(1 / npost, npost))
orig <- HonestDiD::constructOriginalCS(betahat = beta, sigma = V,
numPrePeriods = npre,
numPostPeriods = npost,
l_vec = l_vec)
rob <- HonestDiD::createSensitivityResults_relativeMagnitudes(
betahat = beta, sigma = V,
numPrePeriods = npre, numPostPeriods = npost,
l_vec = l_vec, gridPoints = gridPoints,
Mbarvec = c(0, 0.25, 0.5, 0.75, 1, 1.5, 2), ...)
list(robust = rob, orig = orig)
}
hd_unw <- honest_did(es_gxt_unw, type = "relative_magnitude")
hd_wt <- honest_did(es_gxt_wt, type = "relative_magnitude")

HonestDiD relative-magnitudes sensitivity:
lb ub method Delta Mbar weighting
1 2.01 14.1 C-LF DeltaRM 0 Unweighted
2 -16.8 32.9 C-LF DeltaRM 0.25 Unweighted
3 -40.9 57.0 C-LF DeltaRM 0.5 Unweighted
4 -63.7 66.4 C-LF DeltaRM 0.75 Unweighted
5 -66.4 66.4 C-LF DeltaRM 1 Unweighted
8 -6.07 6.07 C-LF DeltaRM 0 Population-weighted
9 -22.2 22.2 C-LF DeltaRM 0.25 Population-weighted
10 -42.5 43.8 C-LF DeltaRM 0.5 Population-weighted
15 1.41 14.4 Original <NA> NA Unweighted
16 -6.27 6.80 Original <NA> NA Population-weighted

The full bound table across the $\bar{M}$ grid:

weighting	$\bar{M}$	lb	ub	method
Unweighted	original	$+1.41$	$+14.43$	Original CI
Unweighted	0.00	$+2.01$	$+14.09$	$\Delta^{RM}$
Unweighted	0.25	$-16.77$	$+32.87$	$\Delta^{RM}$
Unweighted	0.50	$-40.92$	$+57.02$	$\Delta^{RM}$
Unweighted	0.75	$-63.73$	$+66.42$	$\Delta^{RM}$
Unweighted	1.00	$-66.42$	$+66.42$	saturated
Unweighted	1.50	$-66.42$	$+66.42$	saturated
Unweighted	2.00	$-66.42$	$+66.42$	saturated
Population-weighted	original	$-6.27$	$+6.80$	Original CI
Population-weighted	0.00	$-6.07$	$+6.07$	$\Delta^{RM}$
Population-weighted	0.25	$-22.24$	$+22.24$	$\Delta^{RM}$
Population-weighted	0.50	$-42.46$	$+43.81$	$\Delta^{RM}$
Population-weighted	0.75	$-64.02$	$+64.02$	$\Delta^{RM}$
Population-weighted	1.00	$-66.72$	$+66.72$	saturated
Population-weighted	1.50	$-66.72$	$+66.72$	saturated
Population-weighted	2.00	$-66.72$	$+66.72$	saturated

p8 <- ggplot(hd_tbl %>% filter(!is.na(Mbar)),
aes(x = Mbar, y = (lb + ub) / 2)) +
geom_hline(yintercept = 0, color = TEXT_LIGHT, linetype = "dashed") +
geom_ribbon(aes(ymin = lb, ymax = ub, fill = weighting), alpha = 0.4) +
geom_line(aes(color = weighting), linewidth = 1.1) +
scale_color_manual(values = c("Unweighted" = BLUE,
"Population-weighted" = ORANGE),
aesthetics = c("color", "fill")) +
facet_wrap(~ weighting) +
labs(title = "HonestDiD: how robust is the post-period ATT to pre-trend violations?",
x = expression(bar(M)),
y = "ATT bound (deaths per 100,000)",
caption = "The saturation near +/- 66 is the HonestDiD grid limit, not a feature of the data.")
ggsave("r_did2_08_honestdid.png", p8, width = 11, height = 5.5,
dpi = 300, bg = BG_DARK)

At $\bar{M} = 0$ — exact parallel trends — the unweighted bound on the dynamic ATT is $[+2.01, +14.09]$, entirely positive, suggesting Medicaid expansion raised mortality. (Recall the unweighted GxT dynamic aggregate was $+7.92$.) The weighted bound at $\bar{M} = 0$ is $[-6.07, +6.07]$, straddling zero with no clear sign. By $\bar{M} = 0.25$ both bounds already cross zero; the unweighted bound is $[-16.77, +32.87]$ and the weighted is $[-22.24, +22.24]$. By $\bar{M} = 0.5$ both bounds span $[-40, +57]$, and by $\bar{M} = 1$ both saturate at the HonestDiD package’s default grid range ($\pm 66.4$ unweighted, $\pm 66.7$ weighted). The saturation is a feature of the grid, not the data, and is annotated in the figure caption.

The breakdown value $\bar{M}^*$ — the smallest violation that overturns the conclusion — is informative. For the unweighted result, the (positive-sign) conclusion breaks at $\bar{M}$ between $0$ and $0.25$ (somewhere in the first quarter-multiple of the worst pre-trend). For the weighted result, there is no sign conclusion at $\bar{M} = 0$ to break in the first place; the bound already includes zero. The manuscript’s verdict at line 556 applies symmetrically here: “Rambachan-Roth’s method underscores how little information the pre-trend estimates convey … the identified set spans implausibly large effects in both directions.” Even the weighted-only conclusion of a small negative ATT is fragile to modest parallel-trends violations ($\bar{M} \approx 0.25$ is enough to lose any sign), which reinforces the manuscript’s caution that this empirical case should be read as pedagogical rather than as a definitive estimate of Medicaid’s mortality effect (manuscript line 134).

11. Headline summary

A compact comparison of the five stages where we computed a single overall ATT, twice each:

stage	unweighted	weighted	weighting gap
2x2 cell-means ATT(2014)	$+0.122$	$-2.563$	$2.685$
2x2 TWFE long-difference	$+0.122$	$-2.563$	$2.685$
2x2 DRDID (Callaway-Sant’Anna)	$-1.226$	$-3.756$	$2.530$
2xT dynamic ATT (avg $e \geq 0$)	$+9.428$	$-0.684$	$10.112$
GxT dynamic ATT (avg $e \geq 0$)	$+7.917$	$+0.266$	$7.651$

The “weighting gap” column makes the central pedagogical point explicit: across all five estimation stages, switching from equal weights to population weights moves the point estimate by between $2.5$ and $10.1$ deaths per 100,000. The gap is largest when staggered cohort heterogeneity is in play (2xT and GxT, where the within-cohort treatment effects can disagree across cohorts of very different sizes) and smallest when the four-cell 2x2 design forces a single ATT(2014) (where there is only one cohort and one comparison). The methodological choice between estimators within each row is much smaller than the choice between rows of the same color.

12. Discussion: what did Medicaid expansion do to mortality?

Return to the question that opened the post: did the ACA Medicaid expansion reduce adult mortality? The cleanest empirical answer this analysis can give is “the data are not powerful enough to settle the question, and the answer depends on which estimand you have in mind.” For the typical treated adult (population-weighted ATT), the GxT dynamic aggregate is $+0.27$ deaths per 100,000 with the 2014-cohort component at $-0.68$ and the cell-means 2x2 at $-2.56$; the point estimates range from a small negative to a small positive, and no 95% confidence interval at any stage excludes zero by a comfortable margin. For the typical treated county (unweighted ATT), the GxT dynamic aggregate is $+7.92$ deaths per 100,000, with the 2xT post-treatment trajectory reaching $+16.96$ by year $+5$ with a CI that does exclude zero ($[+6.83, +27.09]$); HonestDiD shows that conclusion holds at $\bar{M} = 0$ but collapses by $\bar{M} = 0.25$.

Why did weighting change the answer? The mechanical reason is the asymmetry documented in Section 3: the never-expansion cohort is 47% of counties but only 38% of adults, while the 2014 expansion cohort is 38% of counties but 50% of adults. Equal weighting overweights small, rural never-expansion counties (e.g., counties in Texas and Florida that are demographically different from the typical American adult) and overweights small, rural 2014-expansion counties. Population weighting shifts the comparison toward larger, more urban counties on both sides. When treatment effects are heterogeneous — as they almost certainly are, since “Medicaid expansion” interacts with each state’s existing healthcare infrastructure and pre-expansion eligibility rules — those two weighting choices produce different averages of different functions of the same data. They are answers to different causal questions, not better and worse answers to the same question. The manuscript states this directly at lines 169–170: “If interest lies in the average treatment effect of Medicaid on mortality in the average treated county … the relevant target parameter is an equally weighted average … If, on the other hand, the parameter of interest is the average treatment effect of Medicaid on mortality in the county in which the average treated adult lives, then population weights are appropriate. When treatment effect heterogeneity is related to the weights, weighted and unweighted target parameters differ meaningfully.”

What should a policymaker take away? In a setting like Medicaid expansion, where the policy choice is “should we cover adults?”, the population-weighted estimand is the more decision-relevant target. It answers “what was the expected effect on the typical newly covered adult?” — and that estimate is small and statistically indistinguishable from zero (weighted DRDID $= -3.76 \pm 3.29$, weighted GxT dynamic $= +0.27$). The unweighted estimate, by contrast, answers “what was the average effect on the typical treated county-as-a-unit?”, which is a useful object for understanding heterogeneity but not the primary policy parameter when the policy is denominated in people. A federal cost-benefit assessment would weight by people. A study of which county types saw the largest local effects would not weight at all. Both are legitimate; the report-writer should be explicit about which one is on offer.

13. Summary and next steps

Takeaways :

The 2x2 sign reversal is real and reproduces the manuscript’s flagship example. Unweighted ATT(2014) $= +0.122$ deaths per 100,000; weighted ATT(2014) $= -2.563$ (manuscript line 215). The pre-period gap is essentially identical in both weightings ($-54.77$ vs $-53.68$), confirming the reversal is driven entirely by which counties dominate the post-period averages — a feature of weighted estimands, not a bug.
Covariate adjustment closes part of the gap but does not eliminate it. DRDID under each weighting is $-1.226$ unweighted and $-3.756$ weighted; the within-weighting estimator spread (OR, IPW, DRDID) is at most $0.8$ deaths per 100,000, while the across-weighting gap remains $2.5$ deaths per 100,000. Methodology and target parameter are orthogonal axes of choice, and the second dominates the first.
Power is the binding constraint, not method. None of the six 2x2 covariate-adjusted 95% confidence intervals excludes zero. The 2xT unweighted post-period at $e = 5$ does ($[+6.83, +27.09]$), but in the opposite-of-expected direction. The weighted estimates are smaller in magnitude than the unweighted ones and never reach statistical significance.
HonestDiD breakdown values are uncomfortably low. The unweighted positive-sign conclusion at $\bar{M} = 0$ collapses by $\bar{M} = 0.25$; the weighted bound straddles zero already at $\bar{M} = 0$. Both bounds saturate at the HonestDiD package’s grid limit ($\pm 66.7$) by $\bar{M} = 1$. We learn very little from the pre-trends in this application.
Staggered cohort heterogeneity matters more than the 2x2 lets on. The 2014 cohort flips sign with weighting ($+9.43 \to -0.68$); the 2016 cohort produces a large negative effect that is significant under weighting but is based on only 93 counties; the 2015 cohort grows from $+4.94$ to $+10.04$ under weighting. The GxT dynamic aggregate ($+7.92$ unweighted, $+0.27$ weighted) hides this cohort-level variation by averaging across it.

Limitations. The bootstrap iteration count was held at $\text{BITERS} = 2{,}000$ for tutorial speed; the manuscript’s reference scripts use $25{,}000$, which would tighten the third significant figure of every confidence interval. The mortality outcome is the CDC crude death rate, not age-adjusted; an age-adjusted rate would address compositional differences across cohorts more cleanly but requires restricted data. The manuscript itself flags the case as pedagogical (line 134): “The results are pedagogical in spirit and do not represent the best possible estimates of Medicaid’s effect on adult mortality.”

Next steps. The natural extensions are (1) synthetic-control estimates on the 2016 and 2019 cohorts to see whether their large weighted negatives survive a different counterfactual construction; (2) placebo tests on the 2007–2013 pre-period (with a sham 2010 expansion date) to check whether the post-2014 ATT estimates exceed what one would see by chance; and (3) an age-adjusted version of the same pipeline using CDC’s standard-population weighting — which is another weighting choice that changes the estimand and would interact with the population weights examined here.

14. Exercises

The script reproduces faithfully and the post’s headline numbers carry through to four decimals; the data are publicly available. Three self-study challenges that build directly on the materials:

Switch the control group. The Callaway-Sant’Anna att_gt() calls use control_group = "nevertreated". Re-run the GxT design with control_group = "notyettreated" (which uses not-yet-treated cohorts as comparison units when never-treated counties run out). How does the dynamic event-study aggregate change? Where in the cohort structure does the comparison-group choice bite hardest?
Substitute a different outcome. The data include other mortality categories (cardiovascular, drug-related, etc., depending on what the CDC file contains in your version). Replace crude_rate_20_64 with a more narrowly defined cause of death and rerun the GxT design. Does the sign reversal still appear in the 2x2? Are the breakdown $\bar{M}^*$ values larger or smaller?
Try the smoothness sensitivity instead. The honest_did() helper accepts type = "smoothness", which parameterizes parallel-trends violations as smooth functions of $t$ rather than as bounded multiples of the worst pre-period violation. Compare the bound widths at small $\Delta$ values for both weightings. Which restriction is the data more informative about?

15. References

AI Podcast: DiD for Regional Data

Click play to load

0:00 0:00