sdid | Carlos Mendez

Staggered Synthetic Difference-in-Differences (SDID) in Stata: Gender Quotas and Women in Parliament

Sun, 07 Jun 2026 00:00:00 +0000

Abstract

Most real-world policies are not adopted on a single clock — parliamentary gender quotas, minimum-wage laws, and carbon taxes arrive in different units in different years, a staggered-adoption design where naive two-way fixed-effects difference-in-differences quietly breaks by using already-treated units as controls and placing negative weights on some effects. This tutorial extends synthetic difference-in-differences (SDID) to staggered adoption and applies it in Stata to a question in political economy: do parliamentary gender quotas raise the share of women in national parliaments? It uses the quota_example dataset distributed with the sdid package (Bhalotra, Clarke, Gomes & Venkataramani, 2023) — a balanced panel of 119 countries observed annually from 1990 to 2015 (3,094 observations), in which 9 countries adopt a quota across 7 cohorts (2000, 2002, 2003, 2005, 2010, 2012, 2013) and 110 remain never-treated. The method estimates a separate, clean SDID per cohort against the never-treated donor pool, then aggregates the cohort effects into the overall ATT with non-negative treated-period-share weights, complemented by the sdid_event event study and bootstrap, jackknife, and placebo inference. The overall ATT is +8.03 percentage points (SE 3.74, p = 0.032), robust to a log-GDP control (8.05 optimized, 8.06 projected), but the cohort effects swing from −3.5 to +21.8 points, with flat pre-adoption placebos supporting parallel synthetic trends and dynamic effects that appear immediately and persist for over a decade. The lesson is that a single headline number summarizes real heterogeneity, and that transparent, non-negative cohort weighting is essential when treatment timing is staggered.

1. Overview

In a previous tutorial, one unit — California — adopted one policy — Proposition 99 — in one year — 1989. That block design is the textbook setting for synthetic difference-in-differences (SDID). But most real policies do not arrive on a single clock. Parliamentary gender quotas, minimum-wage laws, carbon taxes, and clean-air regulations are adopted by different units in different years. This is the staggered adoption design, and it is where naive panel methods quietly break.

This tutorial extends SDID to staggered adoption and applies it in Stata to a real question in political economy: do parliamentary gender quotas raise the share of women in national parliaments? We use the quota_example dataset that ships with the sdid package — 119 countries observed annually from 1990 to 2015, in which 9 countries adopt a gender quota across 7 different cohorts (2000, 2002, 2003, 2005, 2010, 2012, and 2013).

The headline is a story about heterogeneity. The overall effect of quotas is about +8 percentage points of women in parliament, but the cohort-by-cohort effects swing from −3.5 to +21.8 points. A single number hides that range — and, as we will see, the naive two-way fixed-effects regression that most people reach for first can hide even more.

Why does staggered timing break the naive regression? (click to expand)

The workhorse for panel policy evaluation is the two-way fixed-effects (TWFE) regression — unit dummies, time dummies, and a treatment dummy. With one adoption date it estimates a clean difference-in-differences. With staggered timing and heterogeneous effects, the same regression implicitly uses already-treated units as controls for later adopters (“forbidden comparisons”). The result is a variance-weighted average of every 2×2 comparison in the panel, and some of those weights can be negative — so the estimate can even take the wrong sign (Goodman-Bacon, 2021; de Chaisemartin & D’Haultfœuille, 2020). Staggered SDID sidesteps this by estimating a separate, clean SDID effect for each adoption cohort and aggregating with transparent, non-negative weights.

graph TD
subgraph "Block design — predecessor (Prop 99)"
B1["California<br/>adopts 1989"] --> BATT["one ATT"]
B2["other states<br/>never treated"] --> BATT
end
subgraph "Staggered design — this post (gender quotas)"
S1["cohort 2000"] --> SATT["aggregate ATT"]
S2["cohort 2002"] --> SATT
S3["cohorts 2003 to 2013"] --> SATT
SC["110 never-treated<br/>controls"] -.donor pool.-> SATT
end
style B1 fill:#d97757,stroke:#141413,color:#fff
style B2 fill:#6a9bcc,stroke:#141413,color:#fff
style BATT fill:#00d4c8,stroke:#141413,color:#141413
style S1 fill:#d97757,stroke:#141413,color:#fff
style S2 fill:#d97757,stroke:#141413,color:#fff
style S3 fill:#d97757,stroke:#141413,color:#fff
style SC fill:#6a9bcc,stroke:#141413,color:#fff
style SATT fill:#00d4c8,stroke:#141413,color:#141413

1.1 Learning objectives

By the end of this tutorial you will be able to:

Explain why staggered adoption breaks naive TWFE difference-in-differences, and how per-cohort SDID avoids the forbidden-comparison problem.
Derive the SDID estimator from first principles — unit weights $\omega$, time weights $\lambda$, and the weighted two-way fixed-effects objective — and the rule that aggregates cohort-specific effects $\hat{\tau}_a$ into one overall ATT.
Estimate the effect of gender quotas with sdid on a staggered panel, add a covariate two different ways (optimized vs projected), and choose among bootstrap, jackknife, and placebo inference.
Read an SDID event-study plot produced by sdid_event, distinguishing pre-trend placebo coefficients from post-period dynamic effects.

2. Key concepts at a glance

Each card gives a plain-language definition, a concrete example from this quota study, and an everyday analogy. Open any term that is unfamiliar.

1. ATT (average treatment effect on the treated) — the question we actually answer.

Definition. The effect of adopting a quota on the women-in-parliament share, in the countries that adopted one, averaged over their post-adoption years. It is not the effect a quota would have everywhere — only where one was actually tried.

Example. Our headline ATT is +8.0 percentage points: across the nine adopting countries, quotas raised women’s parliamentary share by about eight points relative to their no-quota counterfactual.

Analogy. Like asking “how much did the patients who took the drug improve?” — not “how much would everyone improve?” You measure only the units that were actually treated.

2. Synthetic control — a made-to-order comparison country.

Definition. A weighted blend of never-treated “donor” countries, built so its pre-adoption path mimics the treated cohort. It stands in for the unobservable counterfactual: what the cohort’s outcome would have been without a quota.

Example. The 2002 cohort’s synthetic control mixes dozens of donors (Belgium, Paraguay, Cuba, …) so that, before 2002, the blend tracks the cohort’s trend — then keeps going as the cohort would have without the law.

Analogy. A stunt double cast to match the lead actor’s build and movement — close enough that, in the shots you cannot film the star, the double stands in convincingly.

3. Unit weights (ω) — how much each donor counts.

Definition. Non-negative weights, one per donor country, summing to one, that build the synthetic control. Each cohort gets its own ω.

Example. In the 2000 cohort, 80 donors receive nonzero weight — Argentina ≈ 0.061, Guatemala ≈ 0.057, Austria ≈ 0.045 — a diffuse blend rather than one or two stand-ins.

Analogy. A recipe calling for many ingredients in small, precise amounts: no single one dominates, so the dish survives a bad batch of any one ingredient.

4. Time weights (λ) — which "before" years matter.

Definition. Non-negative weights on the pre-adoption years, summing to one, that decide which pre-periods define the baseline. They up-weight the years most like the post-period.

Example. For the 2002 cohort, λ concentrates on the late 1990s and 2001 rather than spreading evenly across 1990–2001 — the recent past is the relevant baseline.

Analogy. Forecasting tomorrow’s weather, you trust last week far more than the same date five years ago. Time weights formalize “recent and similar counts more.”

5. Adoption cohort (a) — units that switch on together.

Definition. The set of countries that first adopt a quota in the same calendar year. Staggered SDID runs one self-contained SDID per cohort, always against the never-treated controls.

Example. There are seven cohorts — 2000, 2002, 2003, 2005, 2010, 2012, 2013 — with two countries each in 2002 and 2003, and one in the rest.

Analogy. School graduating classes: the “class of 2002” and the “class of 2010” share a start date and are analyzed as groups, even though all attend the same school.

6. Staggered adoption & the forbidden comparison — why the naive regression breaks.

Definition. Staggered adoption means units are treated at different times. The hazard: a two-way fixed-effects regression can use already-treated units as controls for later adopters — a “forbidden comparison” that places negative weights on some effects and can flip the sign.

Example. When the 2012 cohort adopts, a naive TWFE quietly treats the 2002 cohort — already treated, already changed — as part of its control group. Staggered SDID never does this: each cohort is compared only to the 110 never-treated countries.

Analogy. Timing a late runner against runners who already crossed the line and slowed to a walk — your “control” is contaminated because it has already run the race.

7. Event time (relative period) — every cohort on its own clock.

Definition. Time measured relative to each cohort’s own adoption year (… −2, −1, 0, +1 …), so cohorts that adopted in different calendar years can be lined up and averaged.

Example. Event time 0 is the year 2000 for the first cohort but 2013 for the last; re-centring lets us ask “what happens three years after a quota?” across all cohorts at once.

Analogy. Comparing marathon runners by their own start gun, not the wall clock: a runner who started at 9:05 and one who started at 9:20 are both “at mile 10” measured from their own start.

8. ATT aggregation — from many cohort effects to one number.

Definition. The overall ATT is a weighted average of the cohort effects, each weighted by its share of treated unit-by-post-period observations — earlier, longer-exposed, larger cohorts count more.

Example. The seven cohort effects span −3.5 to +21.8; weighted by treated country-years they average to +8.0 (the plain unweighted mean would be ≈ 7.0).

Analogy. A course grade that weights the final exam more than a pop quiz: the cohorts you observe for longer carry more of the final mark.

9. Pre-trend placebo test — the assumption you can see.

Definition. Event-study coefficients for the pre-adoption periods. If treated and synthetic-control countries moved in parallel before treatment, these sit near zero — a falsification check.

Example. For the 2002 cohort, all twelve pre-period placebos fall in [−0.2, +0.8] points — flat, so we cannot reject parallel synthetic trends.

Analogy. Checking a scale by weighing nothing first: if it does not read zero when empty, you distrust every later reading. Flat placebos are that “reads zero when empty” check.

10. Bootstrap, jackknife, placebo — three rulers for uncertainty.

Definition. Three ways to attach a standard error to the ATT. With many treated units all three are available; they share one point estimate but report different spread.

Example. On the two-cohort subsample the ATT is 10.3 for all three, but the SE is 4.7 (bootstrap), 6.0 (jackknife, most conservative), and 2.3 (placebo, tightest).

Analogy. Measuring a table with a tape, a folding ruler, and a laser: they agree on the length but disagree on the error bars — the cautious carpenter reports the widest.

3. The data: gender quotas across 119 countries

We use quota_example.dta, the balanced panel from Bhalotra, Clarke, Gomes & Venkataramani (2023) distributed with the sdid package. The outcome is the percentage of seats held by women in the national parliament; the treatment is the adoption of a reserved-seat gender quota; the covariate is log GDP per capita.

webuse set www.damianclarke.net/stata/
webuse quota_example, clear
label variable quota "Parliamentary gender quota"
xtset country year
codebook country year quota womparl lngdp, compact

Variable Obs Unique Mean Min Max Label
----------------------------------------------------------------------------
country 3094 119 . . . Country
year 3094 26 2002.5 1990 2015 Year
quota 3094 2 .0303814 0 1 =1 if country has a gender quota
womparl 3094 449 14.96531 0 63.8 Women in parliament
lngdp 2990 2956 9.154291 5.8701 11.61789 log(GDP)
----------------------------------------------------------------------------

The panel is balanced: 119 countries times 26 years equals 3,094 observations, with no gaps in the outcome or treatment (lngdp has 104 missing values, which will matter only when we add the covariate). The treatment indicator quota equals one for just 3% of observations, a reminder that treated country-years are scarce. Crucially, quota is absorbing — once a country adopts a quota it stays treated — which SDID requires.

Variable	Role	Symbol	Description
`country`	unit	$i$	119 countries (9 ever-treated, 110 never-treated)
`year`	time	$t$	1990–2015 (26 years)
`womparl`	outcome	$Y_{it}$	% women in the national parliament
`quota`	treatment	$W_{it}$	1 once a country has a quota, 0 before / never
`lngdp`	covariate	$X_{it}$	log GDP per capita

The estimand. Our target is the average treatment effect on the treated (ATT): the effect of adopting a quota on the women-in-parliament share in the countries that adopted one, averaged over their post-adoption years. Formally,

$$ \tau = \frac{1}{N_{tr}\, T_{post}} \sum_{i:\, W_i = 1}\ \sum_{t > T_{pre}} \left[\, Y_{it}(1) - Y_{it}(0) \,\right] $$

In words: for every treated country and every post-adoption year, take the gap between the share of women with a quota, $Y_{it}(1)$, and the share that would have occurred without one, $Y_{it}(0)$ — then average. The first term is observed; the second is the counterfactual that the synthetic control must impute, because we never see a quota-adopting country in the parallel world where it abstained.

An observational, not experimental, setting. Quotas are not randomly assigned. Countries that adopt them early may differ systematically — they may be wealthier, more democratic, or already on a rising trajectory of women’s representation. That is exactly why we need a method that builds a credible counterfactual from comparison countries rather than assuming a simple before/after change would have held. Identification rests on assumptions we will keep visible: that treated and synthetic-control countries share a common (synthetic) trend absent treatment, no anticipation of the quota, no spillovers across countries, and that adoption timing is not itself driven by the outcome’s future path.

3.1 The staggered structure

Before modelling, let us see the timing directly. The adoption year is the first year a country is treated; we tabulate the cohorts.

bysort country (year): egen firsttreat = min(cond(quota==1, year, .))
preserve
keep country firsttreat
duplicates drop
tab firsttreat, missing
restore

 firsttreat | Freq. Percent Cum.
------------+-----------------------------------
2000 | 1 0.84 0.84
2002 | 2 1.68 2.52
2003 | 2 1.68 4.20
2005 | 1 0.84 5.04
2010 | 1 0.84 5.88
2012 | 1 0.84 6.72
2013 | 1 0.84 7.56
. | 110 92.44 100.00
------------+-----------------------------------
Total | 119 100.00

Nine countries adopt a quota, spread across seven cohorts; the 2002 and 2003 cohorts contain two countries each, the rest one. The remaining 110 countries are never treated — they form the donor pool from which every cohort’s synthetic control is built. This staircase of adoption dates is the defining feature of a staggered design, and the reason a single “post” dummy is too blunt.

4. Exploratory analysis with `panelview`

A staggered design is best understood by looking at it. The panelview command (Xu & Hua) draws two pictures we need: a heatmap of who is treated when, and the raw outcome trajectories colored by treatment status.

ssc install panelview, replace
panelview womparl quota, i(country) t(year) type(treat) bytiming
panelview womparl quota, i(country) t(year) type(outcome)

The treatment heatmap (type(treat), sorted with bytiming) makes the staggered structure unmistakable: the dark treated cells appear in the top-right corner as a staircase, each step a different cohort switching on between 2000 and 2013, against a sea of never-treated controls. This is the visual opposite of a block design, where every treated cell would switch on in the same column.

The outcome plot (type(outcome)) overlays all 119 women-in-parliament series, with the 9 treated countries in orange. Several treated countries start near the bottom of the distribution and climb steeply after their adoption year — a hint of a positive effect — but the climbs begin at different times, and a few treated countries barely move. No single “treated average” line could summarize this; we need cohort-specific counterfactuals.

collapse (mean) womparl, by(evertreat year)
* ... reshape and plot ever- vs never-adopting means ...

Collapsing to group means tells a cautionary tale. The ever-adopting countries (orange) start the 1990s below the never-adopting countries (about 4% vs 10% women in parliament) and end above them by 2015 (about 23% vs 22%). A naive eyeball difference-in-differences on these two lines would be badly confounded: the groups began at different levels and the “treated” line aggregates countries that switched on in seven different years. The raw means motivate the machinery to come — we must compare each cohort to a tailored synthetic control, not to the grand average.

5. Synthetic difference-in-differences from first principles

Before tackling staggered timing, fix ideas with a single cohort. SDID (Arkhangelsky et al., 2021) is a weighted two-way fixed-effects regression. It chooses an ATT, a constant, unit fixed effects, and time fixed effects to minimize a weighted sum of squared residuals:

$$ \left(\hat{\tau}, \hat{\mu}, \hat{\alpha}, \hat{\beta}\right) = \arg\min_{\tau,\mu,\alpha,\beta} \sum_{i=1}^{N} \sum_{t=1}^{T} \left(Y_{it} - \mu - \alpha_i - \beta_t - W_{it}\,\tau\right)^{2}\, \hat{\omega}_i\, \hat{\lambda}_t $$

In words: run a difference-in-differences regression, but weight each observation by a unit weight $\hat{\omega}_i$ times a time weight $\hat{\lambda}_t$. Here $\alpha_i$ is a country fixed effect, $\beta_t$ a year fixed effect, $W_{it}$ the treatment dummy, and $\tau$ the ATT we want. Set all weights equal and you recover ordinary DiD; the weights are what make SDID special. They are not free parameters — each solves its own optimization.

The unit weights are chosen so that a weighted blend of control countries tracks the treated cohort across the pre-period:

$$ \hat{\omega} = \arg\min_{\omega_0,\, \omega \ge 0} \sum_{t=1}^{T_{pre}} \left(\omega_0 + \sum_{i=1}^{N_{co}} \omega_i\, Y_{it} - \frac{1}{N_{tr}} \sum_{i=1}^{N_{tr}} Y_{it}\right)^{2} + \zeta^{2}\, T_{pre}\, \lVert \omega \rVert^{2} $$

The bracketed term asks the synthetic control $\sum_i \omega_i Y_{it}$ (plus an intercept $\omega_0$) to match the treated average in every pre-adoption year. The intercept $\omega_0$ is the SDID twist: it lets the synthetic match the treated trend without matching its level, because any constant level gap is later absorbed by the unit fixed effect $\alpha_i$. The final term is a ridge penalty with regularization strength $\zeta$; it spreads weight across many donors instead of concentrating it on a few, which stabilizes the estimate. (Synthetic control, by contrast, drops $\omega_0$ and the penalty and must match the level too.)

The time weights are the mirror image — they pick the pre-period years that best predict each control country’s post-period average:

$$ \hat{\lambda} = \arg\min_{\lambda_0,\, \lambda \ge 0} \sum_{i=1}^{N_{co}} \left(\lambda_0 + \sum_{t=1}^{T_{pre}} \lambda_t\, Y_{it} - \frac{1}{T_{post}} \sum_{t=T_{pre}+1}^{T} Y_{it}\right)^{2} + \zeta_{\lambda}^{2}\, N_{co}\, \lVert \lambda \rVert^{2} $$

Years that look most like the post-period get the most weight, so the “before” comparison is built from the most relevant history rather than a flat average over possibly-irrelevant early years. The two weighting schemes together are what distinguish SDID from its cousins, as the table summarizes.

Method	Unit weights $\omega$	Time weights $\lambda$	Unit FE $\alpha_i$	Must match
DiD	uniform	uniform	yes	trend on all controls
Synthetic control	optimized	uniform	no	level and trend
SDID	optimized	optimized	yes	trend (level gap allowed)

6. The staggered extension: per-cohort effects and their aggregation

Staggered SDID is a disarmingly simple idea: do the single-cohort analysis once per adoption cohort, then average. For each cohort $a$, take only that cohort’s treated countries plus the pure never-treated controls, solve the SDID problem above on that sub-panel to get its own $\hat{\omega}_a$, $\hat{\lambda}_a$, and cohort effect $\hat{\tau}_a$. Because each cohort is compared only to never-treated controls, an already-treated unit is never used as a control for a later adopter — precisely the contamination that breaks naive TWFE.

graph LR
POOL["110 never-treated<br/>controls (donor pool)"]
C1["Cohort 2000<br/>+ controls"]
C2["Cohort 2002<br/>+ controls"]
CD["Cohorts 2003…2013<br/>+ controls"]
T1["SDID &rarr; &tau;<sub>2000</sub> = 8.4"]
T2["SDID &rarr; &tau;<sub>2002</sub> = 7.0"]
TD["SDID &rarr; &tau;<sub>a</sub><br/>(&minus;3.5 … +21.8)"]
ATT["Aggregate ATT = 8.0<br/>weighted by treated periods"]
POOL --> C1 --> T1 --> ATT
POOL --> C2 --> T2 --> ATT
POOL --> CD --> TD --> ATT
style POOL fill:#6a9bcc,stroke:#141413,color:#fff
style C1 fill:#d97757,stroke:#141413,color:#fff
style C2 fill:#d97757,stroke:#141413,color:#fff
style CD fill:#d97757,stroke:#141413,color:#fff
style T1 fill:#1f2b5e,stroke:#6a9bcc,color:#fff
style T2 fill:#1f2b5e,stroke:#6a9bcc,color:#fff
style TD fill:#1f2b5e,stroke:#6a9bcc,color:#fff
style ATT fill:#00d4c8,stroke:#141413,color:#141413

The overall ATT aggregates the cohort effects with non-negative weights equal to each cohort’s share of treated unit-by-post-period observations:

$$ \widehat{ATT} = \sum_{a \in \mathcal{A}} \frac{N_{tr}^{a}\, T_{post}^{a}}{\sum_{b \in \mathcal{A}} N_{tr}^{b}\, T_{post}^{b}}\ \hat{\tau}_a $$

In words: a cohort counts in proportion to how many treated country-years it contributes. The 2000 cohort, treated for 16 years (2000–2015), carries more weight than the 2013 cohort, treated for only 3. This is the staggered generalization of single-cohort SDID, and — unlike TWFE — every weight is positive and interpretable. (When each cohort has one treated unit, this reduces to the post-period share $T_{post}^{a}/T_{post}$ from Clarke et al., 2024.)

7. Estimation in Stata

One command does the whole staggered procedure. We request bootstrap inference and a fixed seed for reproducibility.

sdid womparl country year quota, vce(bootstrap) seed(1213)
matrix list e(tau)

Synthetic Difference-in-Differences Estimator
-----------------------------------------------------------------------------
womparl | ATT Std. Err. t P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------
quota | 8.03410 3.74040 2.15 0.032 0.70305 15.36516
-----------------------------------------------------------------------------

The overall ATT is +8.03 percentage points (SE 3.74, $t=2.15$, $p=0.032$), with a 95% confidence interval of [0.70, 15.37] that excludes zero. Substantively: adopting a parliamentary gender quota raises the share of women in parliament by about eight percentage points in the adopting countries — a large effect against a sample mean of 15%, and statistically distinguishable from no effect at the 5% level.

The single number, though, is the average of a very heterogeneous set of cohort effects, returned in e(tau):

T[7,3]
Tau Std.Err. Time
r1 8.3888685 .68278345 2000
r2 6.9677465 .64102999 2002
r3 13.952256 9.1289943 2003
r4 -3.4505431 .75603453 2005
r5 2.7490355 .44799502 2010
r6 21.762716 .91589982 2012
r7 -.82032354 .83151601 2013

The cohort effects span an enormous range: from −3.5 points (2005 cohort) to +21.8 points (2012 cohort), with the 2003 cohort essentially uninformative (SE 9.13, a confidence interval that runs from −4 to +32). The teal line marks the aggregate ATT of 8.0. Notice that this aggregate is not the simple average of the seven cohort effects — that average would be about 7.0. It is the treated-period-weighted average from the aggregation formula, which up-weights the earlier, longer-exposed 2000, 2002, and 2003 cohorts. The lesson of the figure is that “+8 points on average” is a summary of real heterogeneity, not a universal constant; some quotas were transformative, others did nothing measurable.

To see the synthetic-control machinery underneath one cohort, the figure below plots the 2002 cohort against its synthetic control. Because SDID matches the pre-period trend and lets the unit fixed effect absorb the level gap, we anchor the synthetic to the treated cohort by its $\lambda$-weighted pre-period gap so the two align before adoption.

The treated 2002 cohort (orange) and its anchored synthetic control (blue dashed) track each other closely before 2002 — the synthetic was built precisely to do so — and then diverge: the treated cohort climbs to roughly 15% women in parliament while the synthetic counterfactual reaches only about 9–10%. That post-2002 gap is the cohort effect, about +7 points, matching $\hat{\tau}_{2002}=6.97$ from e(tau).

Which pre-period years anchor that comparison? The time weights $\hat{\lambda}_t$ for the 2002 cohort do not spread evenly over 1990–2001 — they concentrate on the years just before adoption.

The bars show SDID’s baseline for the 2002 cohort leaning on the late 1990s and 2001 — the pre-adoption years whose level most resembles the post-adoption period — rather than weighting all twelve pre-years equally as a plain difference-in-differences would. This is the time-weighting half of SDID at work: it builds the “before” from the most relevant history, which is also the baseline the event study below measures against.

8. Adding a covariate: optimized vs projected

Does the quota effect simply reflect economic development — richer countries both grow GDP and elect more women? We can condition on log GDP per capita. The sdid command offers two routes, and SDID needs a balanced panel, so we first drop the country-years with missing lngdp.

drop if missing(lngdp)
sdid womparl country year quota, vce(bootstrap) seed(2022) covariates(lngdp, optimized)
sdid womparl country year quota, vce(bootstrap) seed(1213) covariates(lngdp, projected)

SDID + lngdp (optimized) ATT = 8.0515 SE = 3.0466
SDID + lngdp (projected) ATT = 8.0593 SE = 3.1191

The two methods differ in how they estimate the covariate’s coefficient. The optimized method (Arkhangelsky et al., 2021) folds the covariate adjustment into the SDID optimization itself, estimating it jointly with the weights — flexible but computationally heavy. The projected method (Kranz, 2022) instead regresses the outcome on the covariate among the untreated observations first, then runs SDID on the residuals — much faster and numerically more stable. Reassuringly, here they agree to the second decimal: 8.05 and 8.06, essentially unchanged from the no-covariate estimate of 8.03. Controlling for income does not explain away the quota effect; the result is robust to the most obvious confounder.

9. The event study with `sdid_event`

A single ATT — even per cohort — cannot tell us when the effect appears, or whether treated and control countries were already diverging before the quota. For that we need an event study: the treatment effect traced out by years relative to adoption. The modern sdid_event command (Ciccia, Clarke & Pailañir, 2024) computes exactly this for SDID, including pre-period placebo estimates that serve as a parallel-trends test.

The dynamic effect at event time $\ell$ is the treated-minus-synthetic gap in that period, net of the same gap at baseline, where — characteristically for SDID — the baseline is the $\lambda$-weighted pre-period average rather than a single “year −1”:

$$ \delta_{\ell} = \left(\bar{Y}_{\ell}^{,tr} - \bar{Y}_{\ell}^{,co}\right) - \left(\bar{Y}_{base}^{,tr} - \bar{Y}_{base}^{,co}\right), \qquad \bar{Y}_{base}^{,g} = \sum_{t=1}^{T_{pre}} \hat{\lambda}_t\, \bar{Y}_t^{,g} $$

sdid_event handles the full staggered panel directly, returning a cohort-aggregated ATT plus dynamic effects. To read the dynamics transparently we focus the plot on the 2002 cohort — the package authors’ own worked example — which gives a clean event-time axis; the full-panel call confirms the same aggregated ATT (≈ 8.06).

ssc install sdid_event, replace
* full staggered panel: aggregated ATT + cohort-aggregated dynamic effects
sdid_event womparl country year quota, vce(bootstrap) brep(100) effects(8) placebo(5) covariates(lngdp)
* clean event study on the 2002 cohort, with all placebos
keep if quotaYear==2002 | quotaYear==.
sdid_event womparl country year quota, vce(placebo) brep(100) placebo(all) covariates(lngdp)

 | Estimate SE LB CI UB CI Switchers
-------------+------------------------------------------------------
ATT | 6.853472 3.372744 .2428928 13.46405 2
Effect_1 | 4.086404 1.191517 1.75103 6.421778 2
Effect_2 | 9.164442 1.522799 6.179756 12.14913 2
Effect_3 | 7.938504 2.182572 3.660663 12.21635 2
... |
Placebo_1 | -.218417 .470226 -1.14006 .703227 2
Placebo_2 | .242148 .884557 -1.491584 1.975880 2
... |

This plot rewards careful reading, and there are three things to look for.

First, the baseline is $\lambda$-weighted, not “the year before.” Unlike a textbook event study that normalizes to $t=-1$, SDID measures everything against the optimally weighted pre-period average. That is why the zero line is a weighted baseline; do not read it as the single pre-adoption year.

Second, the points to the left of zero are placebo tests. Every pre-adoption coefficient (Placebo_1 through Placebo_12, event times −1 to −12) sits within a whisker of zero — ranging only from about −0.2 to +0.8. Because the treated cohort and its synthetic control moved in parallel before 2002, we cannot reject that the parallel-(synthetic-)trends assumption holds. This is the identifying assumption made visible and, here, survived.

Third, the points to the right of zero are the dynamic ATT. The effect appears immediately at adoption (Effect_1 = +4.1 points at event time 0), roughly doubles within a year or two (Effect_2 = +9.2), and then settles in the +6 to +9 range for over a decade. Quotas do not just shift the level once; they sustain a higher share of women in parliament. Aggregated by the same treated-period logic as before, these dynamic effects reproduce the cohort’s overall ATT of about +7 points — but the plot shows the shape the single number conceals.

10. Inference: bootstrap, jackknife, and placebo

With one treated unit (California), the previous tutorial could only use placebo/permutation inference. With nine treated units here, all three of sdid’s variance estimators are on the table. To keep the comparison clean — jackknife needs more than one treated unit per adoption period — we follow Clarke et al. (2024) and restrict to the two-country 2002 and 2003 cohorts by dropping the five single-country cohorts.

graph TD
Q1{"How many<br/>treated units?"}
Q1 -->|"One (e.g. California)"| PL1["Placebo only<br/>jackknife undefined"]
Q1 -->|"Many (e.g. 9 quota adopters)"| Q2{"More controls than treated?<br/>no singleton cohorts?"}
Q2 -->|"Yes"| ALL["All three available"]
Q2 -->|"Singleton cohorts"| PL2["Placebo / bootstrap<br/>jackknife drops out"]
ALL --> BOOT["bootstrap<br/>SE 4.7 (default)"]
ALL --> JACK["jackknife<br/>SE 6.0 (most conservative)"]
ALL --> PLAC["placebo<br/>SE 2.3 (homoskedastic)"]
style Q1 fill:#141413,stroke:#6a9bcc,color:#fff
style Q2 fill:#141413,stroke:#6a9bcc,color:#fff
style PL1 fill:#d97757,stroke:#141413,color:#fff
style PL2 fill:#d97757,stroke:#141413,color:#fff
style ALL fill:#00d4c8,stroke:#141413,color:#141413
style BOOT fill:#6a9bcc,stroke:#141413,color:#fff
style JACK fill:#6a9bcc,stroke:#141413,color:#fff
style PLAC fill:#6a9bcc,stroke:#141413,color:#fff

drop if inlist(country,"Algeria","Kenya","Samoa","Swaziland","Tanzania")
sdid womparl country year quota, vce(bootstrap) seed(1213)
sdid womparl country year quota, vce(placebo) seed(1213)
sdid womparl country year quota, vce(jackknife)

method att se ci_l ci_u
bootstrap 10.33066 4.7291 1.0618 19.5995
placebo 10.33066 2.3404 5.7436 14.9178
jackknife 10.33066 6.0056 -1.4401 22.1014

The point estimate is identical across all three methods — 10.33 points on this subsample — because the inference procedure changes only the standard error, never the estimate. But the standard errors differ by a factor of nearly three: jackknife is the most conservative (SE 6.01, a confidence interval that crosses zero), placebo is the tightest (SE 2.34) but rests on a homoskedasticity assumption and requires more controls than treated units, and bootstrap sits in between (SE 4.73) and is the default. The practical takeaway: with only a handful of treated units, report the bootstrap as your headline but cross-check it — a result that is “significant” under placebo but not under jackknife deserves caution. (The subsample ATT of 10.3 is larger than the full-sample 8.0 because dropping the five single-country cohorts discards the negative 2005 and 2013 effects.)

11. Robustness and discussion

Three caveats keep the result honest. Effect concentration: the +8 aggregate leans heavily on a few cohorts — the 2012 cohort alone contributes a +21.8 effect, and the early 2000/2002/2003 cohorts carry most of the aggregation weight. Drop the 2012 cohort and the average falls noticeably. Fragile counterfactuals: with only 110 controls and as few as one treated country per cohort, some synthetic controls are imprecise — the 2003 cohort’s standard error of 9.13 is the tell. Identifying assumptions: SDID still requires no anticipation, an absorbing treatment, no cross-country spillovers, and that quota timing is not itself a response to the outcome’s trajectory; the flat event-study placebos support, but cannot prove, the parallel-trends part. Finally, quota_example is a teaching subset of Bhalotra et al. (2023); these numbers illustrate the method, not a final verdict on quota policy.

12. Summary and key takeaways

Method. Staggered SDID estimates a separate, clean synthetic difference-in-differences for each adoption cohort — comparing it only to never-treated controls — and aggregates the cohort effects $\hat{\tau}_a$ with non-negative, treated-period-share weights. This avoids the negative-weighting trap that contaminates naive two-way fixed-effects DiD under staggered timing.
Result. Gender quotas raise the share of women in parliament by an overall ATT of +8.0 percentage points (SE 3.74, $p=0.032$), robust to a log-GDP control (8.05 optimized, 8.06 projected). Cohort effects range widely, from −3.5 to +21.8 points — heterogeneity the single number hides.
Event study. The sdid_event plot shows pre-adoption placebo coefficients near zero (parallel synthetic trends) and post-adoption effects that appear immediately and persist for over a decade — the dynamics behind the average.
Inference. With nine treated units, bootstrap, jackknife, and placebo are all available; they share one point estimate (10.3 on the two-cohort illustration) but report standard errors of 4.7, 6.0, and 2.3. Jackknife is the most conservative.
Bridge. The block design (Proposition 99, the previous tutorial) and the staggered design here are two faces of one estimator — the staggered version is just single-cohort SDID, done once per cohort and averaged.

13. Exercises

Re-aggregate by hand. Pull e(tau) and each cohort’s treated unit-count and post-period length. Verify that the treated-period-weighted average of the seven $\hat{\tau}_a$ reproduces the overall ATT of 8.03, and show that it differs from the unweighted mean (≈ 7.0). Which cohorts move the aggregate the most?
Inference sensitivity. Re-run the full nine-country sample with vce(bootstrap) and then vce(placebo) at reps(500). How much do the standard error and confidence interval move, and which would you report given only nine treated units?
Drop the outlier cohort. Re-estimate the overall ATT excluding the 2012 cohort (the +21.8 outlier). How far does the aggregate fall, and what does that tell you about how concentrated the average effect is?

14. References

Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., & Wager, S. (2021). Synthetic Difference-in-Differences. American Economic Review, 111(12), 4088–4118.
Clarke, D., Pailañir, D., Athey, S., & Imbens, G. (2024). On Synthetic Difference-in-Differences and Related Estimation Methods in Stata. The Stata Journal, 24(4). Package: ssc install sdid.
Ciccia, D. (2024). A Short Note on Event-Study Synthetic Difference-in-Differences Estimators. Package: ssc install sdid_event.
Bhalotra, S., Clarke, D., Gomes, J. F., & Venkataramani, A. (2023). Maternal Mortality and Women’s Political Power. Journal of the European Economic Association. (Source of the quota_example data.)
Goodman-Bacon, A. (2021). Difference-in-Differences with Variation in Treatment Timing. Journal of Econometrics, 225(2), 254–277.
de Chaisemartin, C., & D’Haultfœuille, X. (2020). Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects. American Economic Review, 110(9), 2964–2996.
Xu, Y., & Hua, L. panelView: Visualizing Panel Data. Package: ssc install panelview.

Related tutorials on this site: Synthetic Difference-in-Differences (the block design) · Difference-in-Differences.

15. Acknowledgments

This tutorial uses the sdid command (Clarke, Pailañir, Athey & Imbens), the sdid_event command (Ciccia, Clarke & Pailañir), and panelview (Xu & Hua). The data, quota_example, is distributed with sdid and draws on Bhalotra, Clarke, Gomes & Venkataramani (2023). All estimates were produced by the companion analysis.do and verified against Clarke et al. (2024). AI tools (Claude Code) assisted with drafting and figure preparation; all code was executed and every number checked by the author.

AI Podcast: Staggered Synthetic Difference-in-Differences

Click play to load

0:00 0:00

Synthetic Difference-in-Differences (SDID) in Stata: Re-evaluating California's Proposition 99

Sun, 07 Jun 2026 00:00:00 +0000

Abstract

Comparative case studies—where a single large unit adopts a policy and the analyst must recover its causal effect without ever observing the untreated counterfactual—are a recurring challenge in policy evaluation, and the canonical example is California’s Proposition 99, the 1988 ballot measure that raised the cigarette excise tax by 25 cents a pack and funded an anti-smoking campaign. This tutorial introduces and derives synthetic difference-in-differences (SDID) and applies it to re-evaluate Proposition 99, contrasting it with classic difference-in-differences (DiD) and synthetic control (SC). The data are the canonical strongly balanced panel distributed with the sdid package (originally from Abadie, Diamond, and Hainmueller 2010): 39 US states observed annually from 1970 to 2000—1,209 observations, of which only 12 are treated—with annual cigarette sales in packs per capita as the sole outcome and California as the single treated unit from 1989. The methods estimate the average treatment effect on the treated by writing DiD, SC, and SDID as one weighted two-way fixed-effects regression in Stata, using the sdid command (Clarke et al. 2024) and cross-checking SC against synth2. All three estimators agree the policy reduced smoking but disagree on magnitude: the 2×2 DiD gives −27.35 packs per capita, synthetic control −19.48 (pre-period RMSE 1.66, R² 0.98, leaning on Utah, Montana, and Nevada), and SDID −15.60—roughly a 20% reduction—with SDID’s time weights concentrated entirely on 1986–1988. With one treated unit, placebo inference is the only valid procedure: the placebo standard error is 9.88 (95% CI [−35.0, 3.8], including zero) while the permutation test ranks California’s effect extreme (p = 0.026). The implication is that a single sdid command unifies all three estimators, and SDID is the preferred single number because, by allowing a constant level gap and up-weighting the informative late-1980s years, it relies least on the exact parallel-trends assumption the others lean on hardest.

1. Overview

In November 1988 California voters passed Proposition 99, which raised the cigarette excise tax by 25 cents a pack and funded a large anti-smoking campaign. Did it actually reduce smoking? This is the textbook question of comparative case study research: a single, large unit (California) adopts a policy, and we want the causal effect even though we can never observe the California that did not pass Proposition 99.

This tutorial builds up to synthetic difference-in-differences (SDID), the estimator of Arkhangelsky, Athey, Hsiao, Imbens, and Wager (2021), and applies it with the sdid command of Clarke, Pailañir, Athey, and Imbens (2024). SDID is best understood as the marriage of two older ideas:

Difference-in-differences (DiD) — compare California’s before/after change to the before/after change of all control states.
Synthetic control (SC) — build a “synthetic California” as a weighted average of control states that tracks California before the policy.

SDID keeps the best of both: like SC it chooses unit weights so the comparison group resembles California, and like DiD it allows a constant level gap between California and its comparison group (a unit fixed effect). It then adds one more ingredient SC lacks — time weights that emphasize the pre-policy years most predictive of the post-policy period.

A second theme runs through the whole tutorial, and it is worth stating up front. As Clarke et al. (2024) put it, along with SDID, the sdid command implements standard synthetic control and difference-in-differences in an identical framework, allowing estimation, inference, and graphical output in a computationally efficient way. We will show this concretely: the same command, changing only one option, reproduces the raw difference-in-differences and the classic synthetic control — and we cross-check the latter against the dedicated synth2 command.

Learning objectives

By the end you will be able to:

Derive the SDID estimator as a weighted two-way fixed-effects regression and read its unit-weight and time-weight optimization problems.
Distinguish SDID from the original DiD and SC — conceptually (which weights, which fixed effects) and quantitatively (on the same data).
Estimate the effect of Proposition 99 with sdid, and reproduce DiD and SC from the very same command.
Compare the SDID synthetic against a classical synthetic control fit with synth2.
Conduct valid inference when there is a single treated unit, using placebo (permutation) methods — and recognize when other procedures (bootstrap, jackknife) do and do not apply.

What we are estimating

Throughout, the estimand is the average treatment effect on the treated (ATT) — the effect of Proposition 99 on California, over the post-1988 period:

$$ \tau = \frac{1}{N_{tr}\, T_{post}} \sum_{i:\, W_i = 1}\ \sum_{t > T_{pre}} \left[\, Y_{it}(1) - Y_{it}(0) \,\right] $$

In words: average, over treated units and post-treatment years, the difference between the outcome with the policy, $Y_{it}(1)$, and the outcome that would have occurred without it, $Y_{it}(0)$. Here there is exactly one treated unit ($N_{tr} = 1$, California), and $Y_{it}(0)$ is never observed after 1988 — every method in this tutorial is a different way of imputing that missing counterfactual. Because California was not randomly assigned to treatment, this is an observational design: identification rests on assumptions (a stable comparison group, no large contemporaneous shocks unique to California) rather than on randomization.

Key concepts at a glance

Counterfactual — what California's smoking would have been without Proposition 99.

Every estimator here is a recipe for the dashed line “California if the policy had never passed.” DiD, SC, and SDID disagree only about how to build it.

Unit weights (ω) — how much each control state counts toward the synthetic California.

DiD gives every control the same weight ($1/N_{co}$). SC and SDID instead pick weights so the weighted controls reproduce California’s pre-policy outcome path. SC concentrates weight on a handful of states; SDID spreads it more widely.

Time weights (λ) — how much each pre-policy year counts.

This is SDID’s signature. Rather than treat every pre-1989 year equally, SDID up-weights the pre-period years that best predict the post-period — here, 1986–1988. SC and DiD have no time weights.

Unit fixed effects (α) — a constant level gap between California and its synthetic comparison.

DiD and SDID include them, so the comparison group only needs to move in parallel with California, not sit at the same level. Classic SC omits them and instead tries to match California’s level outright.

Placebo inference — how we get a standard error with only one treated unit.

We pretend, one at a time, that a control state was “treated,” re-estimate the effect, and build the distribution of these placebo effects. If California’s real effect is extreme relative to that distribution, it is unlikely to be noise.

2. The Proposition 99 case study

We use the canonical dataset distributed with the sdid package (originally from Abadie, Diamond, and Hainmueller 2010, and used by Arkhangelsky et al. 2021). It is a strongly balanced panel: 39 US states observed annually from 1970 to 2000, with one outcome — annual cigarette sales in packs per capita. California is the single treated unit; the policy bites from 1989 onward. The remaining 38 states (which did not pass comparable large-scale tobacco programs in this window) form the donor pool.

Variable	Role	Description
`state`	unit id	39 US states (California + 38 controls)
`year`	time id	1970–2000 (19 pre-, 12 post-treatment years)
`packspercapita`	outcome $Y_{it}$	annual cigarette pack sales per capita
`treated`	treatment $W_{it}$	1 for California in 1989–2000, else 0

One feature matters for a fair comparison: this panel contains only the outcome — no income, price, or demographic covariates. That is deliberate here. It means synthetic control and SDID see exactly the same information (California’s and the donors’ pre-period smoking paths), so any difference in their answers comes from the estimator, not from a different set of predictors.

graph LR
POOL["<b>Donor pool</b><br/>38 control states<br/>Utah, Nevada, Montana, …"]
CA["<b>California</b><br/>treated 1989"]
SYN["<b>Synthetic California</b><br/>counterfactual Y(0)"]
POOL -->|weighted average ω| SYN
CA -->|compare after 1989| SYN
style POOL fill:#6a9bcc,stroke:#141413,color:#fff
style CA fill:#d97757,stroke:#141413,color:#fff
style SYN fill:#00d4c8,stroke:#141413,color:#141413

Let us first look at the data with no model at all — California against the simple average of the 38 control states.

use prop99_example.dta, clear
describe
encode state, gen(id)
xtset id year

Contains data from prop99_example.dta
Observations: 1,209
Variables: 4
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
state str14 %14s State
year int %8.0g Year
packspercapita float %9.0g PacksPerCapita
treated byte %8.0g
-------------------------------------------------------------------------------
Panel variable: id (strongly balanced)
Time variable: year, 1970 to 2000
Delta: 1 unit

The panel is strongly balanced (no gaps), which every method below requires. The figure compares California to the raw control average.

California (orange) already smoked less than the average control state and was declining throughout the 1980s. After 1989 the gap widens visibly. But two problems jump out: California sits on a different level than the average donor, and it was already on a different trend before 1989. A credible estimate must deal with both — the job of the three estimators below.

3. Three estimators, one equation

The cleanest way to see how DiD, SC, and SDID relate is to write them all as the same weighted two-way fixed-effects (TWFE) regression and change only the weights. This is the unifying view of Arkhangelsky et al. (2021).

Synthetic difference-in-differences

SDID solves a weighted TWFE regression:

$$ \left(\hat{\tau}^{sdid}, \hat{\mu}, \hat{\alpha}, \hat{\beta}\right) = \underset{\tau,\mu,\alpha,\beta}{\arg\min} \sum_{i=1}^{N} \sum_{t=1}^{T} \left(Y_{it} - \mu - \alpha_i - \beta_t - W_{it}\,\tau\right)^{2} \hat{\omega}_i^{sdid}\ \hat{\lambda}_t^{sdid} $$

Reading the symbols against the Stata variables: $Y_{it}$ is packspercapita; $W_{it}$ is treated; $\alpha_i$ is a state fixed effect (one per state); $\beta_t$ is a year fixed effect (one per year); and $\tau$ is the ATT we want. The two extra terms are the difference from ordinary regression: $\hat{\omega}_i^{sdid}$ is a unit weight (how much state $i$ counts) and $\hat{\lambda}_t^{sdid}$ is a time weight (how much year $t$ counts). Set those weights to special values and you recover the older estimators.

The original difference-in-differences

DiD is the special case with no weighting — every unit and every year counts equally:

$$ \left(\hat{\tau}^{did}, \hat{\mu}, \hat{\alpha}, \hat{\beta}\right) = \underset{\tau,\mu,\alpha,\beta}{\arg\min} \sum_{i=1}^{N} \sum_{t=1}^{T} \left(Y_{it} - \mu - \alpha_i - \beta_t - W_{it}\,\tau\right)^{2} $$

This is just two-way fixed-effects regression. Its credibility hinges entirely on parallel trends: the assumption that, absent the policy, California would have moved in lockstep with the average control state. The raw-trends figure already makes that assumption look shaky.

The original synthetic control

SC keeps unit weights but drops the time weights and the unit fixed effects $\alpha_i$:

$$ \left(\hat{\tau}^{sc}, \hat{\mu}, \hat{\beta}\right) = \underset{\tau,\mu,\beta}{\arg\min} \sum_{i=1}^{N} \sum_{t=1}^{T} \left(Y_{it} - \mu - \beta_t - W_{it}\,\tau\right)^{2} \hat{\omega}_i^{sc} $$

Without $\alpha_i$, SC cannot absorb a level gap, so it must build a synthetic California that matches California’s pre-period outcomes in both level and trend. That is a demanding requirement — and the reason SC sometimes cannot find a good fit.

graph TD
OBJ["<b>One weighted two-way<br/>fixed-effects regression</b><br/><i>min Σ (Y − μ − α − β − Wτ)² · ω · λ</i>"]
OBJ --> DID["<b>DiD</b><br/>ω uniform, λ uniform<br/>α included<br/><i>parallel trends on all controls</i>"]
OBJ --> SC["<b>Synthetic control</b><br/>ω optimized, no λ<br/><b>no</b> unit FE α<br/><i>match level AND trend</i>"]
OBJ --> SDID["<b>SDID</b><br/>ω optimized + λ optimized<br/>α included<br/><i>match trend, allow level gap</i>"]
style OBJ fill:#141413,stroke:#6a9bcc,color:#fff
style DID fill:#d97757,stroke:#141413,color:#fff
style SC fill:#6a9bcc,stroke:#141413,color:#fff
style SDID fill:#00d4c8,stroke:#141413,color:#141413

How the weights are chosen

The unit weights make the weighted controls track California’s pre-period path, with a small ridge penalty for stability:

$$ \hat{\omega}^{sdid} = \underset{\omega \in \Omega}{\arg\min} \sum_{t=1}^{T_{pre}} \left(\omega_0 + \sum_{i=1}^{N_{co}} \omega_i\, Y_{it} - \frac{1}{N_{tr}} \sum_{i=N_{co}+1}^{N} Y_{it}\right)^{2} + \zeta^{2}\, T_{pre}\, \lVert \omega \rVert_2^{2} $$

In words: choose nonnegative weights summing to one (the set $\Omega$) so the weighted control outcome, plus an intercept $\omega_0$, comes as close as possible to the treated outcome in every pre-treatment year. The intercept $\omega_0$ is what lets SDID match California’s trend without matching its level. The penalty $\zeta^{2} T_{pre} \lVert \omega \rVert_2^2$ discourages putting all weight on one or two donors; Arkhangelsky et al. set $\zeta = (N_{tr} T_{post})^{1/4}\, \hat{\sigma}$, with $\hat{\sigma}$ the standard deviation of first-differenced control outcomes.

The time weights are the mirror image — they find pre-period years whose weighted average lines up with the post-period:

$$ \hat{\lambda}^{sdid} = \underset{\lambda \in \Lambda}{\arg\min} \sum_{i=1}^{N_{co}} \left(\lambda_0 + \sum_{t=1}^{T_{pre}} \lambda_t\, Y_{it} - \frac{1}{T_{post}} \sum_{t=T_{pre}+1}^{T} Y_{it}\right)^{2} + \zeta_{\lambda}^{2}\, N_{co}\, \lVert \lambda \rVert^{2} $$

This says: find pre-period year weights so the weighted pre-period control outcome matches each control’s post-period average. Years that look most like the post-period get the most weight. We will see SDID place essentially all pre-period weight on 1986–1988.

	Unit weights ω	Time weights λ	Unit FE α	Must match
DiD	uniform	uniform	yes	parallel trends vs. all controls
SC	optimized	none	no	California’s level and trend
SDID	optimized	optimized	yes	California’s trend (level gap allowed)

4. Loading the data

We already loaded and xtset the panel above. The sdid command takes the data in long form and needs four arguments — outcome, unit, time, and a 0/1 treatment indicator — so no further reshaping is required. The synth2 command additionally needs a numeric panel id and xtset, which we created with encode.

summarize packspercapita
tab treated

 Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
packsperca~a | 1,209 122.6493 35.04942 40.7 296.2
treated | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,197 99.01 99.01
1 | 12 0.99 100.00
------------+-----------------------------------

Only 12 of 1,209 observations are treated — California in its 12 post-1988 years. This extreme imbalance (one treated unit) is the defining feature of a comparative case study and, as we will see in Section 9, dictates how inference must be done.

5. A first look: the original difference-in-differences

The simplest credible estimate is a 2×2 difference-in-differences: compare California’s change from before to after 1989 with the control states’ change over the same window. The “difference in differences” removes anything common to all states (the nationwide decline in smoking) and anything fixed about California (its lower baseline level).

gen byte cal = state=="California"
gen byte post = year>=1989
reg packspercapita i.cal##i.post

------------------------------------------------------------------------------
packsperca~a | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
1.cal | -14.359 6.788699 -2.12 0.035 -27.67799 -1.040019
1.post | -28.51142 1.747208 -16.32 0.000 -31.93932 -25.08351
|
cal#post |
1 1 | -27.34911 10.91131 -2.51 0.012 -48.75638 -5.941839
|
_cons | 130.5695 1.087062 120.11 0.000 128.4368 132.7023
------------------------------------------------------------------------------

The interaction cal#post = −27.35 is the DiD estimate: relative to the control states, California’s smoking fell by about 27 packs per capita after Proposition 99. We can read the four group means straight off the table: control states averaged 130.57 packs before and 102.06 after (a drop of 28.5), while California went from 116.21 to 60.35 (a drop of 55.86). The difference of those drops, $-55.86 - (-28.51) = -27.35$, is the DiD.

But this number trusts the parallel-trends assumption against the simple average of 38 very different states — and the raw-trends figure showed California was already drifting away from that average before 1989. If California was on a steeper downward path for reasons unrelated to the policy, DiD will overstate the effect. This is the weakness synthetic methods are designed to fix.

6. The original synthetic control with `synth2`

Synthetic control replaces the simple average of controls with a weighted average chosen to track California before 1989. We fit it with synth2 (Yan and Chen 2023), a modern wrapper around Abadie’s synth that adds placebo tests and visualization. Because our panel has only the outcome, we match on the full pre-period path — each pre-1989 year of packspercapita enters as its own predictor. This is the fair, like-for-like analog to what SDID uses.

* California is id 3 after encode (alphabetical)
local preds
forvalues y = 1970/1988 {
local preds "`preds' packspercapita(`y')"
}
synth2 packspercapita `preds', trunit(3) trperiod(1989) figure

 Number of Control Units = 38 Root Mean Squared Error = 1.65640
Number of Covariates = 19 R-squared = 0.97699
Optimal Unit Weights:
---------------------------
Unit | U.weight
--------------+------------
Utah | 0.3940
Montana | 0.2320
Nevada | 0.2050
Connecticut | 0.1090
NewHampshire | 0.0450
Colorado | 0.0150
---------------------------
Note: The average treatment effect over the posttreatment period is -19.4814.

The pre-period fit is excellent — a root mean squared prediction error of 1.66 packs and an $R^2$ of 0.98, meaning synthetic California reproduces real California almost exactly before 1989. The synthetic is built from just six donors, dominated by Utah (0.39), Montana (0.23), and Nevada (0.21) — states that smoked like California before the program. The estimated effect averages −19.48 packs per capita over 1989–2000, smaller than the naive DiD’s −27.35: once we compare California to states that actually looked like it, part of the apparent drop turns out to be the wrong comparison group, not the policy.

The fit before 1989 is the whole credibility argument for synthetic control: if the synthetic matches California for nineteen years and then diverges exactly when the policy starts, the divergence is plausibly the policy. The next figure shows the same thing as a single gap series.

The gap is essentially flat and near zero through 1988 — the pre-period fit is good — and then opens up after the policy, reaching roughly −27 packs by 2000. Averaged over the post-period, that is the −19.5 headline. The growing gap is consistent with a program whose effect compounds as the tax and campaign change long-run behavior.

7. Synthetic difference-in-differences with `sdid`

Now SDID. The syntax mirrors the data structure — outcome, unit, time, treatment — and one option, vce(), selects the inference method. We start with vce(noinference) to focus on the point estimate and the diagnostic graph.

sdid packspercapita state year treated, method(sdid) vce(noinference) graph

Synthetic Difference-in-Differences Estimator
-----------------------------------------------------------------------------
packsperca~a | ATT Std. Err. t P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------
treated | -15.60383 . . . . .
-----------------------------------------------------------------------------

The SDID estimate is −15.60 packs per capita — smaller again than both DiD (−27.35) and SC (−19.48). Relative to the level SDID implies California would have smoked, this is roughly a 20% reduction, and it is the number reported in Arkhangelsky et al. (2021). Why is it smaller than SC’s? Because SDID does two things SC does not: it allows a constant level gap (so it is not forced to fit California’s level, only its trend), and it down-weights pre-period years that look nothing like the late 1980s. Both make the comparison more conservative.

The graph option produces SDID’s signature diagnostic.

Two things are worth noticing. First, the synthetic “Control” line sits above California throughout — SDID does not try to close that level gap, because the unit fixed effect absorbs it. What SDID cares about is whether the two lines stay parallel before 1989 (they do) and then diverge after (they do). Second, the green shaded ribbon shows the time weights $\hat{\lambda}_t$ — and they are not uniform.

This is SDID’s distinctive move. Of the nineteen pre-policy years, it places all pre-period weight on 1986–1988 — the years most similar to the post-1989 period — and zero on 1970–1985. Intuitively, smoking behavior and its determinants in 1972 tell us little about the counterfactual for 1995; the late 1980s tell us much more. DiD and SC, by contrast, treat 1972 and 1988 as equally informative. We can confirm which states and years carry weight by asking sdid to return them:

sdid packspercapita state year treated, vce(noinference) returnweights mattitles

The returned unit weights $\hat{\omega}_i$ are diffuse compared with synthetic control’s: the largest are Nevada (0.12), New Hampshire (0.11), Connecticut (0.08), Delaware (0.07), and Colorado (0.06), with positive weight spread across roughly twenty states. Where synth2 leaned on six donors, SDID’s ridge penalty spreads the weight — trading a little pre-period fit for a more stable, less idiosyncratic comparison group. Both methods nonetheless agree on the kind of state that resembles California: Nevada, Utah, Montana, Connecticut, and Colorado appear prominently in both.

8. One command, three estimators

Here is the practical payoff emphasized by Clarke et al. (2024): the sdid command implements all three estimators in an identical framework. You do not switch packages or rewrite your model — you change the single option method(). Estimation, inference (vce()), and the diagnostic graph all work the same way for each.

sdid packspercapita state year treated, method(did) vce(noinference) graph
sdid packspercapita state year treated, method(sc) vce(noinference) graph
sdid packspercapita state year treated, method(sdid) vce(noinference) graph

DiD (sdid framework) = -27.34911
SC (sdid framework) = -19.61966
SDID = -15.60383

This is a strong internal consistency check. The framework’s method(did) returns −27.349 — identical, to the decimal, to the raw 2×2 interaction we computed by hand with reg in Section 5. And method(sc) returns −19.620, essentially the same as the −19.481 from the standalone synth2 command (the tiny gap reflects different regularization: sdid matches the full pre-period path with a ridge penalty, while synth2 optimizes Abadie’s predictor-weighting V-matrix). In other words, the unified command reproduces the two classic estimators we obtained by entirely separate routes — which is exactly the claim that they are special cases of one weighted regression. And because the optimal weights are computed once and reused across vce() options, doing so is computationally cheap.

The same graph option yields each method’s diagnostic, so they can be read side by side.

Stacking all four counterfactuals on one chart makes the ranking transparent. To put them on a common scale, the SDID counterfactual is anchored to California by its $\lambda$-weighted pre-period gap (recall SDID identifies effects only up to a constant level, which the unit fixed effect absorbs).

The story is consistent across methods — Proposition 99 reduced smoking — but the magnitude depends on how the counterfactual is built. The naive DiD is the most extreme because it compares California to a control average that was already on a different trajectory. Synthetic control fixes the comparison group and shrinks the estimate to about −19.5. SDID, by additionally allowing a level gap and weighting the informative late-1980s years, is the most conservative at −15.6. Reasonable methods bracket the truth; SDID’s contribution is to be robust to the assumption — exact parallel trends — that the others lean on hardest.

Collecting every estimate in one place:

Method	Command	ATT (packs per capita)
Raw 2×2 DiD	`reg y i.cal##i.post`	−27.35
DiD (unified)	`sdid …, method(did)`	−27.35
Synthetic control	`synth2 …`	−19.48
SC (unified)	`sdid …, method(sc)`	−19.62
SDID	`sdid …, method(sdid)`	−15.60

9. Inference: how sure are we?

A point estimate is not enough; we need a standard error. SDID’s variance feeds a familiar normal-approximation confidence interval:

$$ \hat{\tau}^{sdid} \pm z_{\alpha/2} \sqrt{\hat{V}_{\tau}} $$

Arkhangelsky et al. (2021) offer three ways to estimate $\hat{V}_{\tau}$: a bootstrap, a jackknife, and a placebo (permutation) procedure. The choice is not free here — it is forced by our design. With a single treated unit:

The jackknife is literally undefined. It works by deleting one unit at a time and re-estimating; when it deletes California, there is no treated unit left, so the treated-removed estimate does not exist.
The bootstrap relies on resampling many treated units; its asymptotics require the number of treated units to grow. With one treated unit it is unreliable.
The placebo procedure is the one valid option. It keeps the controls, repeatedly assigns the treatment structure to a control state as a fake “placebo” treatment, re-estimates the effect, and uses the spread of those placebo estimates as the variance.

graph TD
Q{"How many<br/>treated units?"}
Q -->|"One — e.g. California"| PL["<b>Placebo / permutation</b><br/><i>the valid choice here</i>"]
Q -->|"Many — e.g. staggered adoption"| BJ["Bootstrap or jackknife<br/><i>asymptotics in number of treated units</i>"]
PL --> THIS["this tutorial<br/>vce(placebo)"]
BJ --> OOS["out of scope<br/>(needs another design)"]
style Q fill:#141413,stroke:#6a9bcc,color:#fff
style PL fill:#00d4c8,stroke:#141413,color:#141413
style THIS fill:#6a9bcc,stroke:#141413,color:#fff
style BJ fill:#6a9bcc,stroke:#141413,color:#fff
style OOS fill:#d97757,stroke:#141413,color:#fff

So we run placebo inference, the appropriate choice for a comparative case study.

sdid packspercapita state year treated, vce(placebo) seed(1213)

Synthetic Difference-in-Differences Estimator
-----------------------------------------------------------------------------
packsperca~a | ATT Std. Err. t P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------
treated | -15.60383 9.87941 -1.58 0.114 -34.96712 3.75946
-----------------------------------------------------------------------------
95% CIs and p-values are based on large-sample approximations.

The placebo standard error is 9.88, giving a 95% interval of roughly [−35.0, 3.8]. Notice this interval includes zero: by the normal-approximation criterion, we cannot reject “no effect” at the 5% level ($p = 0.114$). With a single treated unit and a noisy donor pool, the SDID interval is genuinely wide — honest about how hard it is to be certain from one case.

But the normal approximation is not the only — or the sharpest — way to use the placebo distribution. We can also run an explicit permutation test: assign the placebo treatment to each control state in turn, collect the placebo effects, and ask how California’s real estimate ranks against them.

* assign each control as a placebo-treated unit, collect placebo ATTs
drop if state=="California"
levelsof state, local(ctrls)
foreach s of local ctrls {
preserve
gen byte ptreat = (state=="`s'") & (year>=1989)
sdid packspercapita state year ptreat, vce(noinference)
* store e(ATT)
restore
}

The placebo effects for control states cluster around zero — reassuring, since those states passed no comparable policy — while California’s −15.6 lands far in the left tail. Only 1 of 38 control states produced a placebo effect as large in magnitude as California’s, a permutation p-value of 0.026. So the two inferential lenses tell complementary stories: the rank-based permutation test says California’s drop is very unlikely to be noise (significant at 5%), while the conservative normal-approximation interval reminds us that, with a single treated unit, the precision of the magnitude is limited. Reporting both is the honest summary.

Other inference designs (out of scope)

It would be wrong to conclude that bootstrap and jackknife are “bad” — they are simply built for a different design. They come into their own when there are many treated units, especially under staggered adoption, where units adopt the policy at different times. In that setting the ATT is an average of adoption-cohort-specific effects,

$$ \widehat{ATT} = \sum_{a \in A} \frac{T_{post}^{a}}{T_{post}}\ \hat{\tau}_a^{sdid} $$

and with many treated units the asymptotic arguments behind the bootstrap and jackknife hold. The sdid command supports all of this — vce(bootstrap), vce(jackknife), covariate adjustment, and staggered timing — but those tools require a genuinely different research design (multiple treated units adopting at multiple times) than California’s single 1989 intervention. We deliberately keep this tutorial to the block design with one treated unit, where the placebo procedure is the right and sufficient tool. The staggered case, with its own estimation and inference, is a natural next tutorial.

10. Robustness and discussion

What should we take away about Proposition 99? Three independent constructions of the counterfactual — DiD, synthetic control, and SDID — all agree the policy reduced smoking, with estimates from −15.6 to −27.3 packs per capita. The disagreement is informative rather than alarming: it maps directly onto how much each method trusts the comparison group.

DiD (−27.35) trusts that California would have moved parallel to the average of 38 heterogeneous states. The pre-period figure shows that average was already diverging from California, so DiD likely overstates the effect.
Synthetic control (−19.48) fixes the comparison group to states that actually resembled California (Utah, Montana, Nevada). Its pre-period fit is excellent (RMSE 1.66), which is the evidence for its credibility.
SDID (−15.60) additionally allows a constant level gap and concentrates on the informative late-1980s years. It is the most robust to a violation of exact parallel trends, and the most conservative.

The honest range, then, is something like “Proposition 99 cut cigarette consumption by roughly 16–20 packs per capita per year, plausibly larger by the end of the 1990s,” with SDID the preferred single number because it leans least on the assumption most likely to fail.

A few caveats apply to all three estimates. With one treated unit, statistical power is inherently limited — the SDID confidence interval includes zero even though the permutation test is significant, and no method can fully escape that. The placebo variance assumes homoskedasticity across units (the placebo treatments are drawn only from controls). And like every comparative case study, identification assumes no other large shock hit California alone in 1989 and no spillovers to the donor states (if Californians bought cigarettes across state lines, neighboring donors are contaminated). These are assumptions to argue substantively, not settle statistically.

11. Summary and key takeaways

Method. SDID is one weighted two-way fixed-effects regression. DiD is the special case with uniform weights; synthetic control is the special case with unit weights but no time weights and no unit fixed effect. SDID uses both unit and time weights and keeps the unit fixed effect, so it matches California’s pre-period trend while allowing a constant level gap.
Data. On the Proposition 99 panel, the estimates are DiD −27.35, synthetic control −19.48, and SDID −15.60 packs per capita — the same direction, with magnitude shrinking as the comparison group becomes more credible. SDID’s time weights land entirely on 1986–1988.
One framework. The single sdid command reproduced the hand-computed 2×2 DiD exactly (−27.35) and the standalone synth2 synthetic control closely (−19.62 vs −19.48), confirming that all three are special cases of one estimator and can be run, with inference and graphs, from one command.
Inference. With a single treated unit, placebo is the valid procedure: jackknife is undefined and the bootstrap is unreliable. The placebo SE is 9.88 (95% CI [−35.0, 3.8], which includes zero), while the permutation test gives p = 0.026. Report both.
Limitation and next step. One treated unit means limited power. The natural extension is staggered adoption with many treated units, where vce(bootstrap) and vce(jackknife) become appropriate and covariates can be added — a different design, and a good follow-up tutorial.

12. Exercises

Weights side by side. Re-run sdid …, method(sc) vce(noinference) returnweights and compare its unit weights to the synth2 donor weights from Section 6. Which states appear in both? Why does the sdid version spread weight more widely? (Hint: the ridge penalty $\zeta$.)
Placebo stability. Re-estimate sdid …, vce(placebo) seed(1213) with a different seed() and with more replications via reps(). How much does the standard error move? What does that tell you about reading a single placebo SE to three decimal places?
Time weights matter. Inspect e(lambda) after the SDID run and confirm the weight on 1986–1988. Then think through: if you forced uniform time weights (as DiD and SC do), would you expect the estimate to move toward or away from the DiD number? Check your intuition by comparing method(sdid) with method(sc).

References

Arkhangelsky, D., Athey, S., Hsiao, D. A., Imbens, G. W., and Wager, S. (2021). Synthetic Difference-in-Differences. American Economic Review 111(12): 4088–4118.
Clarke, D., Pailañir, D., Athey, S., and Imbens, G. (2024). On Synthetic Difference-in-Differences and Related Estimation Methods in Stata. The Stata Journal (st0757). The sdid command.
Abadie, A., Diamond, A., and Hainmueller, J. (2010). Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program. Journal of the American Statistical Association 105(490): 493–505.
Abadie, A., and Gardeazabal, J. (2003). The Economic Costs of Conflict: A Case Study of the Basque Country. American Economic Review 93(1): 113–132.
Yan, G., and Chen, Q. (2023). synth2: Synthetic Control Method with Placebo Tests, Robustness Test and Visualization. The Stata Journal 23(3): 597–624. The synth2 command.

Acknowledgments

The analysis uses the sdid (Clarke, Pailañir, Athey, and Imbens) and synth2 (Yan and Chen) Stata packages and the Proposition 99 dataset distributed with sdid. AI tools (Claude Code, with NotebookLM for the audio summary) assisted in drafting and exposition; all code was executed and all numbers verified by the author, who is responsible for any remaining errors.

AI Podcast: Synthetic Difference-in-Differences

Click play to load

0:00 0:00

sdid | Carlos Mendez

Staggered Synthetic Difference-in-Differences (SDID) in Stata: Gender Quotas and Women in Parliament

Abstract

1. Overview

1.1 Learning objectives

2. Key concepts at a glance

3. The data: gender quotas across 119 countries

3.1 The staggered structure

4. Exploratory analysis with panelview

5. Synthetic difference-in-differences from first principles

6. The staggered extension: per-cohort effects and their aggregation

7. Estimation in Stata

8. Adding a covariate: optimized vs projected

9. The event study with sdid_event

10. Inference: bootstrap, jackknife, and placebo

11. Robustness and discussion

12. Summary and key takeaways

13. Exercises

14. References

15. Acknowledgments

AI Podcast: Staggered Synthetic Difference-in-Differences

Synthetic Difference-in-Differences (SDID) in Stata: Re-evaluating California's Proposition 99

Abstract

1. Overview

Learning objectives

What we are estimating

Key concepts at a glance

2. The Proposition 99 case study

3. Three estimators, one equation

Synthetic difference-in-differences

The original difference-in-differences

The original synthetic control

How the weights are chosen

4. Loading the data

5. A first look: the original difference-in-differences

6. The original synthetic control with synth2

7. Synthetic difference-in-differences with sdid

8. One command, three estimators

9. Inference: how sure are we?

Other inference designs (out of scope)

10. Robustness and discussion

11. Summary and key takeaways

12. Exercises

References

Acknowledgments

AI Podcast: Synthetic Difference-in-Differences

4. Exploratory analysis with `panelview`

9. The event study with `sdid_event`

6. The original synthetic control with `synth2`

7. Synthetic difference-in-differences with `sdid`