<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>heterogeneous treatment effects | Carlos Mendez</title><link>https://carlos-mendez.org/tag/heterogeneous-treatment-effects/</link><atom:link href="https://carlos-mendez.org/tag/heterogeneous-treatment-effects/index.xml" rel="self" type="application/rss+xml"/><description>heterogeneous treatment effects</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><copyright>Carlos Mendez</copyright><lastBuildDate>Thu, 07 May 2026 00:00:00 +0000</lastBuildDate><image><url>https://carlos-mendez.org/media/icon_huedfae549300b4ca5d201a9bd09a3ecd5_79625_512x512_fill_lanczos_center_3.png</url><title>heterogeneous treatment effects</title><link>https://carlos-mendez.org/tag/heterogeneous-treatment-effects/</link></image><item><title>Causal Machine Learning and the Resource Curse with Python EconML</title><link>https://carlos-mendez.org/post/python_econml/</link><pubDate>Thu, 07 May 2026 00:00:00 +0000</pubDate><guid>https://carlos-mendez.org/post/python_econml/</guid><description>&lt;h2 id="overview">Overview&lt;/h2>
&lt;p>Can natural resource wealth be both a blessing and a curse? And can local institutions determine which way it goes? In this tutorial, we use &lt;strong>EconML&amp;rsquo;s &lt;code>CausalForestDML&lt;/code>&lt;/strong> to estimate &lt;strong>heterogeneous causal effects&lt;/strong> of mining and mineral prices on economic development &amp;mdash; and test whether institutional quality moderates those effects differently for mining versus price shocks.&lt;/p>
&lt;p>We use &lt;strong>simulated data with known ground-truth parameters&lt;/strong> so we can verify that the method recovers the correct answers. The simulated dataset mirrors the structure of Hodler, Lechner &amp;amp; Raschky (2023), who studied 3,800 Sub-Saharan African districts using a Modified Causal Forest. This tutorial focuses on the &lt;strong>DML methodology&lt;/strong>: how the Double Machine Learning framework separates nuisance estimation from causal effect estimation to produce valid, efficient heterogeneous treatment effect estimates.&lt;/p>
&lt;p>For the &lt;strong>economic narrative&lt;/strong> and a companion implementation in Stata 19, see &lt;a href="https://carlos-mendez.org/post/stata_cate2/">Causal Machine Learning and the Resource Curse with Stata 19&lt;/a>.&lt;/p>
&lt;h3 id="learning-objectives">Learning objectives&lt;/h3>
&lt;p>By the end of this tutorial, you will be able to:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Understand&lt;/strong> the Double Machine Learning (DML) framework and the residualization argument that makes it work&lt;/li>
&lt;li>&lt;strong>Distinguish&lt;/strong> heterogeneity features (X) from nuisance controls (W) in &lt;code>CausalForestDML&lt;/code>&lt;/li>
&lt;li>&lt;strong>Configure&lt;/strong> &lt;code>CausalForestDML&lt;/code> for discrete multi-valued treatments with panel data&lt;/li>
&lt;li>&lt;strong>Estimate&lt;/strong> Average Treatment Effects (ATEs) and Group Average Treatment Effects (GATEs), and read the Bootstrap-of-Little-Bags standard errors EconML reports&lt;/li>
&lt;li>&lt;strong>Interpret&lt;/strong> GATE patterns to identify which variables moderate treatment effects&lt;/li>
&lt;li>&lt;strong>Use&lt;/strong> EconML-specific tools like &lt;code>SingleTreeCateInterpreter&lt;/code> for data-driven subgroup discovery&lt;/li>
&lt;li>&lt;strong>Evaluate&lt;/strong> estimated effects against known ground-truth parameters and explain any remaining gap&lt;/li>
&lt;/ol>
&lt;h3 id="key-concepts-at-a-glance">Key concepts at a glance&lt;/h3>
&lt;p>The post leans on a small vocabulary repeatedly. The rest of the tutorial assumes you can move between these terms quickly. Each concept below has three parts. The &lt;strong>definition&lt;/strong> is always visible. The &lt;strong>example&lt;/strong> and &lt;strong>analogy&lt;/strong> sit behind clickable cards: open them when you need them, leave them collapsed for a quick scan. If a later section mentions &amp;ldquo;honest splitting&amp;rdquo; or &amp;ldquo;Neyman orthogonality&amp;rdquo; and the term feels slippery, this is the section to re-read.&lt;/p>
&lt;p>&lt;strong>1. Potential outcomes&lt;/strong> $Y_i(t)$.
The outcome unit $i$ &lt;strong>would&lt;/strong> take under treatment value $t$. Each unit has one potential outcome per treatment level. We observe only one of them: the one matching the treatment actually received. The rest are &lt;em>counterfactual&lt;/em>. They live in worlds we never see.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Take district 47 in 2008. Four potential NTL outcomes exist for it: $Y_{47,2008}(0)$, $Y_{47,2008}(1)$, $Y_{47,2008}(2)$, and $Y_{47,2008}(3)$. They correspond to no mining, low prices, medium prices, and high prices. Only one is in the dataset. It is the one matching whatever treatment that district-year actually had. The other three are forever invisible.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Every life decision is a fork in the road. You took one fork. The parallel-universe versions of yourself took the other forks. Their lives are real conceptual objects. You just cannot directly observe them. Causal inference reconstructs those parallel universes. It does so by looking at people who &lt;em>did&lt;/em> take the other forks.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>2. CATE&lt;/strong> &amp;mdash; Conditional Average Treatment Effect, $\tau(\mathbf{x})$.
The average treatment effect for units with covariate profile $\mathbf{x}$. The CATE is a &lt;strong>function&lt;/strong> of $\mathbf{x}$, not a single number. Where the CATE bends with $\mathbf{x}$, the treatment helps some units more than others.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Take a well-governed district profile in our data: &lt;code>exec_constraints = 6&lt;/code>, &lt;code>quality_of_govt = 0.7&lt;/code>, and so on. For that $\mathbf{x}$ the CATE is $\tau(\mathbf{x}) \approx 0.26$. Mining lifts log-NTL by about 0.26 for that profile. Now move to the weakest-institutions case: &lt;code>exec_constraints = 1&lt;/code>. The same function gives only $\tau(\mathbf{x}) \approx 0.18$. The CATE is what makes this comparison possible.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>A drug&amp;rsquo;s &amp;ldquo;average effect&amp;rdquo; might be a 5-point reduction in blood pressure. But a doctor cares about a specific patient. Maybe a 65-year-old male with diabetes. The CATE &lt;em>is&lt;/em> that personalized effect. It takes a patient profile in. It returns the expected effect for someone like them.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>3. GATE&lt;/strong> &amp;mdash; Group Average Treatment Effect.
The CATE averaged over a &lt;em>pre-specified&lt;/em> subgroup. The subgroup is defined by some variable $Z$. GATEs test targeted moderation hypotheses. A typical question: &amp;ldquo;does institutional quality moderate the effect of mining?&amp;rdquo;&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Sort districts by &lt;code>exec_constraints&lt;/code> (1&amp;ndash;6). Average the per-observation CATEs inside each level. At level 1 we get $\widehat{\mathrm{GATE}} \approx 0.18$. The number climbs to $\approx 0.26$ at level 6. That climb is the moderation pattern Finding 3 reports. It is exactly what the GATE plots in this post visualize.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>A nationwide marketing campaign might lift sales by 5% on average. Before scaling it up, the company asks a simple question: did it work better in cities than in rural towns? The GATE answers exactly that. It reports the campaign&amp;rsquo;s effect &lt;em>inside&lt;/em> each store type. It surfaces heterogeneity that the headline ATE hides.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>4. ATE&lt;/strong> &amp;mdash; Average Treatment Effect.
The CATE averaged over the entire sample, $E[\tau(\mathbf{X})]$. The headline policy number. It answers a single question: if we turned the treatment on for everyone, what average effect would we see?&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Take our 3,000 district-years. The estimated ATE for the 1-vs-0 contrast (mining at low prices vs. no mining) is $\widehat{\mathrm{ATE}} = 0.240$. On average, mining-at-low-prices raises log-NTL by 0.24. In unlogged NTL, that is about a 27% bump.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>&amp;ldquo;This drug lowers cholesterol by 12 points on average.&amp;rdquo; That is an ATE statement. A single number, suitable for a press release. It says nothing about whether the drug works better in some patients than others. That question belongs to GATEs and CATEs.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>5. Nuisance functions&lt;/strong> $g_0, m_0$.
These are two conditional means: $g_0(\mathbf{x}, \mathbf{w}) = E[Y \mid \mathbf{X}, \mathbf{W}]$ and $m_0(\mathbf{x}, \mathbf{w}) = E[T \mid \mathbf{X}, \mathbf{W}]$. We call them &lt;em>nuisance&lt;/em> because we do not care about their values. We estimate them for one reason only. That reason is to strip out the part of $Y$ and $T$ that is predictable from $(\mathbf{X}, \mathbf{W})$. What remains is the variation that identifies the causal effect.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>$\hat g_0$ is a Gradient Boosting regressor. It predicts a district&amp;rsquo;s log-NTL from elevation, ruggedness, ethnic fractionalization, country, year, and so on. It &lt;em>ignores&lt;/em> mining status. $\hat m_0$ is a Gradient Boosting classifier. It predicts the probability of each treatment level from the same covariates. Both predictions matter only as inputs to the residualization step.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Astronomers photograph faint galaxies in two steps. First, they take a &amp;ldquo;dark frame&amp;rdquo; with the lens cap on. The dark frame records sensor noise. Then they subtract it from the real exposure. Nobody hangs the dark frame on their wall. It exists only to be subtracted. $g_0$ and $m_0$ are dark frames for confounding. Their job is to be subtracted out. That is what lets the real causal signal show through.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>6. Cross-fitting&lt;/strong> (sometimes &amp;ldquo;sample-splitting&amp;rdquo; or &amp;ldquo;out-of-fold prediction&amp;rdquo;).
Estimate the nuisance functions on one fold of the data. Apply them to a held-out fold. Rotate so that every observation is residualized using nuisance models that did not see it. Without this rotation, in-sample residuals come out systematically too small. That bias propagates straight into the second stage.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Setting &lt;code>cv=5&lt;/code> in &lt;code>CausalForestDML&lt;/code> splits the 3,000 observations into five folds of 600. The forest fits $\hat g_0$ and $\hat m_0$ on folds 1&amp;ndash;4. It then residualizes fold 5 using those fitted models. The procedure rotates four more times. The end result: each district-year is residualized by nuisance models trained on a strictly disjoint sample.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Suppose you give a class the same problems for practice and for the final exam. Students who memorized the practice will ace the final. The score reflects memorization, not learning. Hiding the final-exam questions until grading time fixes the problem. Cross-fitting does the same trick. It hides each observation from the very nuisance model that will eventually residualize it.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>7. Honest splitting&lt;/strong> (a property of an &lt;em>honest causal forest&lt;/em>).
A causal tree uses one random subsample to &lt;em>choose&lt;/em> its split structure: which variable, which threshold. It uses a &lt;em>separate&lt;/em> random subsample to &lt;em>estimate&lt;/em> the treatment-effect value in each leaf. The split-chooser and the leaf-estimator never share data. This separation is what licenses valid confidence intervals from the forest.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Consider a single tree inside the forest. With &lt;code>honest=True&lt;/code>, half of its bootstrap sample picks the splits. Maybe the choice is &amp;ldquo;split first on &lt;code>distance_capital&lt;/code>, then on &lt;code>exec_constraints&lt;/code>&amp;rdquo;. The other half computes the average CATE in each resulting leaf. Those leaf-level numbers are unbiased. The reason: the splits were chosen without seeing them.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>A jury that hears the evidence should not also write the verdict template. If the same people pick the conclusion language &lt;em>and&lt;/em> hear the case, the verdict reflects their pre-baked preferences. It would not reflect the evidence alone. Splitting the two roles is a basic guard against motivated reasoning. Honesty does the same job inside one tree. Split-choosers and leaf-estimators are different &amp;ldquo;people&amp;rdquo;. The leaf values cannot be tailored to the splits that produced them.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>8. Neyman orthogonality.&lt;/strong>
A property of the DML estimating equation $\psi(W; \tau, \eta)$. Here $\eta = (g_0, m_0)$ collects the nuisance functions. The property is $\left.\partial_\eta E[\psi]\right|_{\eta=\eta_0} = 0$. In words: at the truth, the expected estimating equation is &lt;em>flat&lt;/em> in the nuisance functions. Small errors in $\hat g_0$ and $\hat m_0$ enter the second-stage estimator only at second order.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Suppose $\hat g_0$ misses the true $g_0$ by 10% on average. A naive plug-in two-stage procedure inherits roughly that 10% error in the causal estimate. With Neyman orthogonality, the picture changes. The same 10% nuisance error contributes only on the order of $(0.10)^2 = 0.01$ to the causal estimate. That is one percentage point &amp;mdash; orders of magnitude less than the input. This is why a Gradient Boosting first stage works. It converges at a slower-than-parametric rate. Even so, the second-stage estimate of $\tau$ remains $\sqrt{n}$-consistent and asymptotically normal.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Picture a self-righting boat. You can lean over the rail. You can slosh the cargo. You can even slip on the deck. The hull pulls itself upright every time. Stability is built into its geometry, not into never being disturbed. Neyman orthogonality is the hull design. It lets DML stay upright when the nuisance estimates wobble.&lt;/p>
&lt;/details>
&lt;/div>
&lt;h2 id="the-dml-causal-forest">The DML Causal Forest&lt;/h2>
&lt;h3 id="potential-outcomes-and-the-cate">Potential outcomes and the CATE&lt;/h3>
&lt;p>Causal inference rests on the &lt;strong>potential-outcomes&lt;/strong> framework (Rubin, 1974; Imbens &amp;amp; Rubin, 2015). For each unit $i$ and each treatment value $t$, we imagine an outcome $Y_i(t)$ that would be realized if $i$ received treatment $t$. The catch is the &lt;strong>fundamental problem of causal inference&lt;/strong>: only the potential outcome corresponding to the treatment unit $i$ actually receives is observable. All other potential outcomes for that unit are counterfactual &amp;mdash; they live in a world we never see. Causal inference is therefore an exercise in &lt;em>imputation&lt;/em>: using the observed outcomes of comparable units to stand in for the missing counterfactuals.&lt;/p>
&lt;p>The &lt;strong>Conditional Average Treatment Effect&lt;/strong> (CATE) for a unit with covariates $\mathbf{x}$ is&lt;/p>
&lt;p>$$\tau(\mathbf{x}) = E\{Y_i(1) - Y_i(0) \mid \mathbf{X}_i = \mathbf{x}\}.$$&lt;/p>
&lt;p>In words: among units who look like $\mathbf{x}$, what is the average gap between the treated and untreated potential outcomes? When the function $\tau(\cdot)$ is constant across $\mathbf{x}$, every type of unit responds the same way and a single ATE summarizes everything. When $\tau(\cdot)$ bends with $\mathbf{x}$, we have &lt;strong>treatment effect heterogeneity&lt;/strong> &amp;mdash; mining might raise nighttime lights in well-governed districts and barely move them elsewhere. Estimating that bend, not just its average, is the whole point of a causal forest.&lt;/p>
&lt;h3 id="the-partially-linear-model-with-heterogeneous-effects">The partially linear model with heterogeneous effects&lt;/h3>
&lt;p>EconML&amp;rsquo;s &lt;code>CausalForestDML&lt;/code> works inside the &lt;strong>partially linear model&lt;/strong> of Robinson (1988), extended by Chernozhukov et al. (2018) to allow heterogeneous effects:&lt;/p>
&lt;p>$$Y_i = \tau(\mathbf{X}_i)\, T_i + g_0(\mathbf{X}_i, \mathbf{W}_i) + \varepsilon_i, \qquad E[\varepsilon_i \mid \mathbf{X}_i, \mathbf{W}_i] = 0.$$&lt;/p>
&lt;p>$$T_i = m_0(\mathbf{X}_i, \mathbf{W}_i) + v_i, \qquad E[v_i \mid \mathbf{X}_i, \mathbf{W}_i] = 0.$$&lt;/p>
&lt;p>The &lt;strong>outcome equation&lt;/strong> says that $Y_i$ depends on the treatment $T_i$ multiplied by a &lt;em>unit-specific&lt;/em> effect $\tau(\mathbf{X}_i)$, plus an arbitrary, possibly nonlinear function $g_0$ of the controls, plus mean-zero noise. The &amp;ldquo;partially linear&amp;rdquo; name comes from $T$ entering linearly (multiplied by $\tau$) while $g_0$ is allowed to be any flexible function.&lt;/p>
&lt;p>The &lt;strong>treatment equation&lt;/strong> writes $T_i$ as the conditional-mean treatment $m_0(\mathbf{X}_i, \mathbf{W}_i)$ plus a residual $v_i$. For a continuous treatment, $m_0$ is a regression. For our four-level treatment, $m_0$ is a multi-class classifier &amp;mdash; specifically, a &lt;code>GradientBoostingClassifier&lt;/code> &amp;mdash; and &amp;ldquo;$T - m_0$&amp;rdquo; is shorthand for the residual of treatment around its conditional probabilities.&lt;/p>
&lt;p>The functions $g_0$ and $m_0$ are called &lt;strong>nuisance functions&lt;/strong> because we do not care about their values. We estimate them only to &lt;em>remove&lt;/em> the part of $Y$ and $T$ that is predictable from $(\mathbf{X}, \mathbf{W})$, leaving behind the variation that identifies the causal effect.&lt;/p>
&lt;h4 id="why-two-stages-the-residualization-argument">Why two stages? The residualization argument&lt;/h4>
&lt;p>Subtract $E[Y_i \mid \mathbf{X}, \mathbf{W}] = \tau(\mathbf{X}_i) \, m_0(\mathbf{X}_i, \mathbf{W}_i) + g_0(\mathbf{X}_i, \mathbf{W}_i)$ from the outcome equation. The $g_0$ terms cancel, and after a line of algebra. Define the residualized outcome and treatment as&lt;/p>
&lt;p>$$\tilde Y_i = Y_i - E[Y_i \mid \mathbf{X}, \mathbf{W}], \qquad \tilde T_i = T_i - m_0(\mathbf{X}_i, \mathbf{W}_i).$$&lt;/p>
&lt;p>Plugging these residuals into the partial linear model yields:&lt;/p>
&lt;p>$$\tilde Y_i = \tau(\mathbf{X}_i) \cdot \tilde T_i + \varepsilon_i.$$&lt;/p>
&lt;p>So if we (a) estimate $g_0$ and $m_0$ in a &lt;em>first stage&lt;/em> with any flexible learner, (b) residualize both $Y$ and $T$, and (c) regress $\tilde Y$ on $\tilde T$ with covariate-dependent slope, that slope at point $\mathbf{x}$ recovers $\tau(\mathbf{x})$. This is exactly the &lt;strong>Frisch&amp;ndash;Waugh&amp;ndash;Lovell&lt;/strong> logic &amp;mdash; if you have not seen FWL before, the &lt;a href="https://carlos-mendez.org/post/python_fwl/">tutorial on the Frisch&amp;ndash;Waugh&amp;ndash;Lovell theorem&lt;/a> walks through the linear case in detail.&lt;/p>
&lt;p>The causal forest is the second-stage learner that estimates this covariate-dependent slope from $(\tilde T, \tilde Y, \mathbf{X})$, splitting on $\mathbf{X}$ to find regions where the local slope is approximately constant.&lt;/p>
&lt;h3 id="neyman-orthogonality-why-first-stage-errors-barely-matter">Neyman orthogonality: why first-stage errors barely matter&lt;/h3>
&lt;p>Think of residualization like noise-canceling headphones: the first stage removes the &amp;ldquo;background noise&amp;rdquo; of confounders from both the outcome and the treatment, so the causal forest only hears the &amp;ldquo;signal&amp;rdquo; of the treatment effect.&lt;/p>
&lt;p>The formal version of that intuition is &lt;strong>Neyman orthogonality&lt;/strong>. The DML estimating equation $\psi(W; \tau, \eta)$ &amp;mdash; where $\eta = (g_0, m_0)$ collects the nuisance functions &amp;mdash; satisfies&lt;/p>
&lt;p>$$\left.\frac{\partial}{\partial \eta} E[\psi(W; \tau, \eta)] \right|_{\eta = \eta_0} = 0.$$&lt;/p>
&lt;p>In words: at the truth, the expected estimating equation is &lt;em>flat&lt;/em> in the nuisance functions. Small errors in $\hat g_0$ and $\hat m_0$ enter the second-stage estimator only through second-order terms. The practical consequence is striking: even if Gradient Boosting estimates $g_0$ and $m_0$ at the slow rate $O(n^{-1/4})$, much slower than the parametric $\sqrt{n}$ rate, the resulting estimate of $\tau$ is still $\sqrt{n}$-consistent and asymptotically normal (Chernozhukov et al., 2018, §2.2). A naive plug-in two-stage procedure &amp;mdash; one that does not use the orthogonal moment &amp;mdash; inherits the slower nuisance rate and loses valid inference.&lt;/p>
&lt;h3 id="three-levels-of-effects">Three levels of effects&lt;/h3>
&lt;p>The causal forest produces per-observation CATE estimates, which aggregate to three levels with different uses:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Level&lt;/th>
&lt;th>Notation&lt;/th>
&lt;th>What it measures&lt;/th>
&lt;th>When to report&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>CATE&lt;/strong>&lt;/td>
&lt;td>$\tau(\mathbf{x})$&lt;/td>
&lt;td>Effect for a unit with covariates $\mathbf{x}$&lt;/td>
&lt;td>Exploratory: feed into a decision tree or partial-dependence plot to see how effects vary.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>GATE&lt;/strong>&lt;/td>
&lt;td>$E[\tau(\mathbf{X}) \mid Z = z]$&lt;/td>
&lt;td>Average CATE in a pre-specified subgroup defined by a variable $Z$&lt;/td>
&lt;td>Theory-driven: testing whether a &lt;em>named&lt;/em> covariate (e.g., institutional quality) moderates the effect.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>ATE&lt;/strong>&lt;/td>
&lt;td>$E[\tau(\mathbf{X})]$&lt;/td>
&lt;td>Overall average across all units&lt;/td>
&lt;td>Policy: the headline number for &amp;ldquo;what happens on average if we turn the treatment on?&amp;rdquo;&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="dml-pipeline">DML pipeline&lt;/h3>
&lt;pre>&lt;code class="language-mermaid">flowchart LR
A[&amp;quot;&amp;lt;b&amp;gt;Panel Data&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;3,000 obs&amp;quot;]:::data
B[&amp;quot;&amp;lt;b&amp;gt;First Stage&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;GBM nuisance&amp;lt;br/&amp;gt;models&amp;quot;]:::first
C[&amp;quot;&amp;lt;b&amp;gt;Residualize&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;Y - E[Y | X,W]&amp;lt;br/&amp;gt;T - E[T | X,W]&amp;quot;]:::resid
D[&amp;quot;&amp;lt;b&amp;gt;Causal Forest&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;500 honest trees&amp;quot;]:::forest
E[&amp;quot;&amp;lt;b&amp;gt;CATEs&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;Per-observation&amp;lt;br/&amp;gt;effects&amp;quot;]:::cate
A --&amp;gt; B --&amp;gt; C --&amp;gt; D --&amp;gt; E
classDef data fill:#6a9bcc,stroke:#141413,color:#fff
classDef first fill:#d97757,stroke:#141413,color:#fff
classDef resid fill:#00d4c8,stroke:#141413,color:#141413
classDef forest fill:#141413,stroke:#d97757,color:#fff
classDef cate fill:#6a9bcc,stroke:#141413,color:#fff
&lt;/code>&lt;/pre>
&lt;h2 id="setup-and-configuration">Setup and configuration&lt;/h2>
&lt;p>We use &lt;code>CausalForestDML&lt;/code> from EconML with Gradient Boosting nuisance models. The ground-truth parameters are defined inline so the tutorial is fully self-contained.&lt;/p>
&lt;pre>&lt;code class="language-python">import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from econml.dml import CausalForestDML
from sklearn.ensemble import (GradientBoostingRegressor,
GradientBoostingClassifier)
# Ground-truth ATEs from the data-generating process
TRUE_ATES = {
'1-0': 0.250, # Mining effect
'2-0': 0.300, # Mining + medium price
'3-0': 0.550, # Mining + high price
'2-1': 0.050, # Medium price premium (small)
'3-1': 0.300, # High price premium (large)
'3-2': 0.250, # High vs medium step
}
&lt;/code>&lt;/pre>
&lt;h2 id="load-the-simulated-data">Load the simulated data&lt;/h2>
&lt;p>The dataset simulates 300 districts across 8 countries observed over 10 years (2003&amp;ndash;2012), following the structure of Hodler, Lechner &amp;amp; Raschky (2023). Treatment has four levels: no mining (0), mining at low prices (1), medium prices (2), and high prices (3).&lt;/p>
&lt;pre>&lt;code class="language-python">DATA_URL = (&amp;quot;https://github.com/cmg777/starter-academic-v501&amp;quot;
&amp;quot;/raw/master/content/post/python_EconML/sim_resource_curse.csv&amp;quot;)
df = pd.read_csv(DATA_URL)
print(f&amp;quot;Dataset: {len(df):,} observations&amp;quot;)
print(f&amp;quot;Districts: {df['district_id'].nunique()}, &amp;quot;
f&amp;quot;Countries: {df['country_id'].nunique()}&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Dataset: 3,000 observations
Districts: 300, Countries: 8
&lt;/code>&lt;/pre>
&lt;p>The dataset contains 3,000 district-year observations with a &lt;strong>heavily imbalanced&lt;/strong> treatment: 85% of observations are untreated (no mining), while each of the three mining groups comprises only 5% of the data. This imbalance makes causal inference challenging &amp;mdash; the causal forest must learn from relatively few treated observations.&lt;/p>
&lt;h2 id="descriptive-statistics">Descriptive statistics&lt;/h2>
&lt;h3 id="treatment-distribution">Treatment distribution&lt;/h3>
&lt;pre>&lt;code class="language-python">labels = {0: 'No mining', 1: 'Low prices',
2: 'Med prices', 3: 'High prices'}
for t, n in df['treatment'].value_counts().sort_index().items():
print(f&amp;quot; {t} ({labels[t]}): {n:,} ({n/len(df):.1%})&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> 0 (No mining): 2,550 (85.0%)
1 (Low prices): 150 (5.0%)
2 (Med prices): 150 (5.0%)
3 (High prices): 150 (5.0%)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="python_econml_treatment_dist.png" alt="Treatment distribution across the four groups">
&lt;em>Treatment distribution across the four groups. The 85/5/5/5 imbalance makes causal inference challenging.&lt;/em>&lt;/p>
&lt;p>The 85/5/5/5 split means the causal forest has 2,550 control observations but only 150 per treatment level. For within-mining comparisons (e.g., 3-1), only 300 observations contribute, making standard errors larger for price-effect estimates.&lt;/p>
&lt;h3 id="outcomes-by-treatment-group">Outcomes by treatment group&lt;/h3>
&lt;pre>&lt;code class="language-python">for t in sorted(df['treatment'].unique()):
mask = df['treatment'] == t
m_ntl = df.loc[mask, 'ntl_log'].mean()
m_conf = df.loc[mask, 'conflict'].mean()
print(f&amp;quot; {t} ({labels[t]}): NTL={m_ntl:.3f} Conflict={m_conf:.1%}&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> 0 (No mining): NTL=-1.137 Conflict=10.7%
1 (Low prices): NTL=-1.028 Conflict=18.0%
2 (Med prices): NTL=-0.930 Conflict=18.0%
3 (High prices): NTL=-0.615 Conflict=28.0%
&lt;/code>&lt;/pre>
&lt;p>The raw means show a clear gradient: higher treatment levels are associated with higher NTL and higher conflict rates. But these raw comparisons are &lt;strong>biased&lt;/strong> because mining districts differ systematically from non-mining districts in geography, institutions, and economic development.&lt;/p>
&lt;h2 id="naive-comparison-why-we-need-causal-ml">Naive comparison: why we need causal ML&lt;/h2>
&lt;pre>&lt;code class="language-python">for comp in ['1-0', '2-1', '3-1']:
a, b = int(comp[0]), int(comp[2])
naive = df.loc[df['treatment']==a, 'ntl_log'].mean() - \
df.loc[df['treatment']==b, 'ntl_log'].mean()
truth = TRUE_ATES[comp]
print(f&amp;quot; {comp}: Naive={naive:.3f} Truth={truth:.3f} Bias={naive-truth:+.3f}&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> 1-0: Naive=0.109 Truth=0.250 Bias=-0.141
2-1: Naive=0.098 Truth=0.050 Bias=+0.048
3-1: Naive=0.413 Truth=0.300 Bias=+0.113
&lt;/code>&lt;/pre>
&lt;p>The naive 1-0 estimate of &lt;strong>0.109&lt;/strong> is severely biased downward from the true effect of &lt;strong>0.250&lt;/strong> &amp;mdash; a 56% underestimate. This happens because mining districts tend to have worse geographic and institutional characteristics that independently reduce development. The DML Causal Forest removes this &lt;strong>selection bias&lt;/strong> by residualizing both the outcome and the treatment against observed confounders before estimating the causal effect.&lt;/p>
&lt;h2 id="econml-estimation">EconML estimation&lt;/h2>
&lt;h3 id="configuration">Configuration&lt;/h3>
&lt;p>We separate covariates into two groups with distinct roles in the DML framework:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>X features&lt;/strong> (10 variables): Enter the causal forest and can drive treatment effect heterogeneity. These include &lt;code>exec_constraints&lt;/code>, &lt;code>quality_of_govt&lt;/code>, &lt;code>gdp_pc&lt;/code>, &lt;code>elevation&lt;/code>, &lt;code>temperature&lt;/code>, &lt;code>ruggedness&lt;/code>, &lt;code>distance_capital&lt;/code>, &lt;code>agri_suitability&lt;/code>, &lt;code>population&lt;/code>, and &lt;code>ethnic_frac&lt;/code>.&lt;/li>
&lt;li>&lt;strong>W controls&lt;/strong> (2 variables): Used only in the first-stage nuisance models (&lt;code>country_id&lt;/code>, &lt;code>year&lt;/code>). These absorb country and time fixed effects but do not enter the causal forest.&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-python">X_COLS = ['exec_constraints', 'quality_of_govt', 'gdp_pc',
'elevation', 'temperature', 'ruggedness',
'distance_capital', 'agri_suitability', 'population',
'ethnic_frac']
W_COLS = ['country_id', 'year']
&lt;/code>&lt;/pre>
&lt;h3 id="fitting-the-model">Fitting the model&lt;/h3>
&lt;pre>&lt;code class="language-python">Y = df['ntl_log'].values
T = df['treatment'].values
X = df[X_COLS].values
W = df[W_COLS].values
est_ntl = CausalForestDML(
model_y=GradientBoostingRegressor(n_estimators=200, max_depth=4,
random_state=42),
model_t=GradientBoostingClassifier(n_estimators=200, max_depth=4,
random_state=42),
discrete_treatment=True,
categories=[0, 1, 2, 3],
n_estimators=500,
min_samples_leaf=10,
honest=True, # Separate split/estimation samples
inference=True, # BLB confidence intervals
cv=5, # 5-fold cross-fitting
n_jobs=1,
random_state=42,
)
est_ntl.fit(Y, T, X=X, W=W, groups=df['district_id'].values)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> NTL: fitted in ~25s
&lt;/code>&lt;/pre>
&lt;p>Several configuration choices deserve explanation.&lt;/p>
&lt;p>&lt;strong>Honest trees&lt;/strong> (&lt;code>honest=True&lt;/code>) split the data inside each tree into two halves. One half is used to &lt;em>choose&lt;/em> the splits &amp;mdash; which variable, which threshold &amp;mdash; and the other half is used to &lt;em>estimate&lt;/em> the leaf means. A standard regression tree uses the same observations for both jobs, which lets the tree pick splits that artificially separate noisy observations and then quote the resulting separation back as if it were signal. The &amp;ldquo;exam writer / exam taker&amp;rdquo; analogy: honesty stops the tree from setting questions it has already memorized the answers to. Operationally, honesty is what licenses asymptotically valid confidence intervals &amp;mdash; without it, the leaf estimates are tighter than they should be and &lt;code>inference=True&lt;/code>&amp;rsquo;s reported standard errors would be misleadingly small. Wager &amp;amp; Athey (2018) formalize the result and prove $\sqrt{n}$-asymptotic normality for honest causal forests.&lt;/p>
&lt;p>&lt;strong>Cross-fitting&lt;/strong> (&lt;code>cv=5&lt;/code>) addresses a different overfitting risk. When the same data are used to estimate the nuisance functions $\hat g_0, \hat m_0$ and to apply them as residualizers, in-sample residuals are &lt;em>too small&lt;/em> on average and bias the second stage. Cross-fitting splits the data into 5 folds, fits the nuisance models on 4 of them, applies the fitted models to the held-out fold, and rotates. Each observation is residualized using nuisance estimates that did not see it.&lt;/p>
&lt;p>&lt;strong>GroupKFold via &lt;code>groups=district_id&lt;/code>.&lt;/strong> Our panel observes each district across multiple years. Plain $K$-fold would scatter rows from the same district across folds, so the nuisance models would peek at most of a district&amp;rsquo;s rows when predicting one held-out year &amp;mdash; leakage that artificially shrinks first-stage residuals. Passing &lt;code>groups=df['district_id'].values&lt;/code> to &lt;code>fit()&lt;/code> triggers &lt;code>GroupKFold&lt;/code>, which keeps every district inside one fold.&lt;/p>
&lt;p>A common confusion: GroupKFold is &lt;strong>not&lt;/strong> the same as clustered standard errors. It blocks within-district leakage in cross-fitting; it does not adjust the second-stage variance for within-district correlation in the residuals. The standard errors EconML reports are forest-level Bootstrap-of-Little-Bags SEs that treat observations as independent. With panel data, true clustered SEs would typically be larger. We flag this as a limitation again in the Discussion section.&lt;/p>
&lt;h3 id="identification-the-conditional-independence-assumption">Identification: the Conditional Independence Assumption&lt;/h3>
&lt;p>The causal forest leans on the &lt;strong>Conditional Independence Assumption&lt;/strong> (CIA), also called &lt;em>unconfoundedness&lt;/em> or &lt;em>selection on observables&lt;/em>: after conditioning on the observed covariates $(X, W)$, treatment assignment is as good as random, in the sense that&lt;/p>
&lt;p>$$\{Y_i(0), Y_i(1), Y_i(2), Y_i(3)\} \perp T_i \mid (\mathbf{X}_i, \mathbf{W}_i).$$&lt;/p>
&lt;p>In plain English: once we know a district&amp;rsquo;s geography, institutions, demographics, country, and year, knowing whether mining is active there tells us nothing more about what its potential nighttime-lights outcomes would be. Because we built the simulated data ourselves, the CIA holds by construction &amp;mdash; every confounder we created is in $(X, W)$.&lt;/p>
&lt;p>In real data, the CIA is &lt;em>untestable&lt;/em> and easy to violate. Two concrete violation channels for this application:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Mineral surveys.&lt;/strong> Mining companies often arrive in a district &lt;em>because&lt;/em> a geological survey flagged the geology as promising. The same survey may also predict future infrastructure investment unrelated to mining. If those surveys are not in $(X, W)$, both treatment and the potential outcome are correlated with an unobserved confounder.&lt;/li>
&lt;li>&lt;strong>Political connections.&lt;/strong> Districts whose elites are aligned with the central government may both attract mining concessions &lt;em>and&lt;/em> receive non-mining infrastructure (roads, electrification). An analyst without a measure of political alignment would mis-attribute the infrastructure effect to mining.&lt;/li>
&lt;/ul>
&lt;p>Hodler, Lechner &amp;amp; Raschky (2023) defend the CIA in their setting by including a rich set of geological, geographic, and institutional controls; the methodology in this tutorial is no stronger than that defense.&lt;/p>
&lt;h2 id="average-treatment-effects">Average Treatment Effects&lt;/h2>
&lt;p>EconML&amp;rsquo;s &lt;code>ate_inference()&lt;/code> returns the average causal effect for a chosen pair of treatment levels, together with a standard error and a confidence interval.&lt;/p>
&lt;p>The standard error here is the SE of the &lt;em>forest-level&lt;/em> ATE point estimate, not the SE of any one unit&amp;rsquo;s CATE. It comes from the &lt;strong>Bootstrap of Little Bags&lt;/strong> (BLB), a sub-bootstrap procedure (Athey, Tibshirani &amp;amp; Wager, 2019, §4) tailored to forests. Rather than refit hundreds of full forests &amp;mdash; which would cost $O(B \cdot \text{forest})$ &amp;mdash; BLB partitions the existing forest&amp;rsquo;s trees into &amp;ldquo;bags&amp;rdquo;, computes bag-level estimates, and uses the variance across bags as an estimate of the sampling variance of the full-forest ATE. The trick exploits the conditional independence of trees grown on different sub-samples; it returns valid asymptotic confidence intervals at a fraction of the cost of the obvious resampling scheme. EconML enables BLB whenever you pass &lt;code>inference=True&lt;/code> to the constructor.&lt;/p>
&lt;p>We report 90% intervals (&lt;code>alpha=0.1&lt;/code>) by default &amp;mdash; the convention used in Athey, Tibshirani &amp;amp; Wager (2019) and Hodler, Lechner &amp;amp; Raschky (2023). The substantive conclusions are unchanged at 95%, but the wider intervals make the price-effect comparisons (which have low power because only 150 observations per treatment level contribute) look more uncertain than the asymmetric pattern actually warrants.&lt;/p>
&lt;p>We compute all six pairwise treatment contrasts:&lt;/p>
&lt;pre>&lt;code class="language-python">comparisons = [
('1-0', 0, 1), ('2-0', 0, 2), ('3-0', 0, 3),
('2-1', 1, 2), ('3-1', 1, 3), ('3-2', 2, 3),
]
for comp_label, t0, t1 in comparisons:
res = est_ntl.ate_inference(X, T0=t0, T1=t1)
lo, hi = res.conf_int_mean(alpha=0.1)
print(f&amp;quot; {comp_label}: ATE={res.mean_point:.4f} &amp;quot;
f&amp;quot;SE={res.stderr_mean:.4f} 90%CI=[{lo:.3f}, {hi:.3f}]&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> 1-0: ATE=0.2398 SE=0.0701 90%CI=[0.124, 0.355]
2-0: ATE=0.2684 SE=0.0791 90%CI=[0.138, 0.399]
3-0: ATE=0.4598 SE=0.0811 90%CI=[0.326, 0.593]
2-1: ATE=0.0286 SE=0.1008 90%CI=[-0.137, 0.194]
3-1: ATE=0.2200 SE=0.1013 90%CI=[0.053, 0.387]
3-2: ATE=0.1914 SE=0.1093 90%CI=[0.012, 0.371]
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Finding 1: Mining raises economic activity, after controlling for confounding.&lt;/strong> All three mining-vs-no-mining contrasts (1-0, 2-0, 3-0) are positive, with point estimates well separated from zero relative to their standard errors. The basic mining effect 1-0 is &lt;strong>0.240&lt;/strong> (SE = 0.070, 90% CI = [0.124, 0.355]) &amp;mdash; comfortably above zero and within sampling error of the ground-truth 0.250. The naive difference-in-means for the same contrast was 0.109; the DML forest has eliminated nearly all of that confounding bias. Because the outcome is log nighttime lights, an effect of 0.24 corresponds to roughly a 27% increase in unlogged NTL ($e^{0.24} - 1 \approx 0.27$).&lt;/p>
&lt;p>&lt;strong>Finding 2: The price gradient is non-linear.&lt;/strong> Comparing medium prices to low prices (2-1) returns an ATE of &lt;strong>0.029&lt;/strong> with an SE of 0.101 &amp;mdash; the 90% interval [-0.137, 0.194] easily contains zero. Medium prices, in this DGP, add nothing detectable beyond the basic mining effect. The high-vs-low contrast (3-1), in contrast, is &lt;strong>0.220&lt;/strong> (SE = 0.101) and significant at the 5% level, with a 90% interval that excludes zero. The high-vs-medium step (3-2) is &lt;strong>0.191&lt;/strong> and significant at 10%. The forest has recovered the qualitative shape of the true price-response curve &amp;mdash; flat at low-to-medium prices, jumping at high prices &amp;mdash; without being told to look for a non-linearity. This is the kind of finding causal ML buys you: shape discovery without functional-form pre-specification.&lt;/p>
&lt;h2 id="treatment-effect-heterogeneity-gates">Treatment effect heterogeneity (GATEs)&lt;/h2>
&lt;h3 id="computing-gates-from-per-observation-cates">Computing GATEs from per-observation CATEs&lt;/h3>
&lt;p>EconML returns per-observation CATEs through &lt;code>effect_inference()&lt;/code>. To form a GATE we average those CATEs within a chosen subgroup, and to form a standard error we propagate the per-observation BLB standard errors. Doing this by hand is more illuminating than a one-line API call &amp;mdash; it makes the relationship between CATE-level heterogeneity and group-level effects visible.&lt;/p>
&lt;pre>&lt;code class="language-python">def compute_gate(est, df, z_var, t0, t1):
inf = est.effect_inference(X, T0=t0, T1=t1)
ite, ite_se = inf.point_estimate, inf.stderr
for z in sorted(df[z_var].unique()):
mask = df[z_var].values == z
gate = ite[mask].mean()
# Propagate BLB standard errors (see derivation below)
gate_se = np.sqrt(np.mean(ite_se[mask]**2) / mask.sum())
&lt;/code>&lt;/pre>
&lt;p>For a subgroup $g$ of size $n_g$, the GATE estimator is the simple average of the per-observation CATE estimates,&lt;/p>
&lt;p>$$\widehat{\mathrm{GATE}}_g = \frac{1}{n_g} \sum_{i \in g} \widehat\tau(\mathbf{X}_i).$$&lt;/p>
&lt;p>If we treat the $\widehat\tau(\mathbf{X}_i)$ as approximately uncorrelated within the group &amp;mdash; a working assumption, since EconML&amp;rsquo;s BLB does not return their full covariance matrix &amp;mdash; the variance of their average is&lt;/p>
&lt;p>$$\mathrm{Var}\left(\widehat{\mathrm{GATE}}_g\right) \approx \frac{1}{n_g^2} \sum_{i \in g} \mathrm{Var}\left(\widehat\tau(\mathbf{X}_i)\right) = \frac{1}{n_g} \cdot \overline{\mathrm{SE}_i^2}.$$&lt;/p>
&lt;p>Taking the square root gives the formula in the code: &lt;code>sqrt(mean(se_i^2) / n_g)&lt;/code>. The CIs we report are point $\pm 1.645 \cdot \widehat{\mathrm{SE}}$ for a 90% level. Two caveats are worth flagging up front: (i) the within-group independence assumption probably understates the SE in panel data where the same district appears multiple times in the same group, and (ii) this SE captures estimation uncertainty in the CATE function only, not sampling variability of the subgroup composition. As with the ATE, the headline qualitative pattern survives at 95% intervals.&lt;/p>
&lt;h3 id="gates-by-executive-constraints">GATEs by Executive Constraints&lt;/h3>
&lt;p>The mining effect (1-0) should vary with institutional quality, while the price effect (3-1) should be flat:&lt;/p>
&lt;p>&lt;img src="python_econml_gate_ntl_1v0_exec.png" alt="GATEs for NTL mining effect (1-0) by Executive Constraints">
&lt;em>GATEs for the mining effect (1-0) by executive constraints. The upward slope shows that stronger institutions amplify the economic benefits of mining.&lt;/em>&lt;/p>
&lt;p>&lt;img src="python_econml_gate_ntl_3v1_exec.png" alt="GATEs for NTL price effect (3-1) by Executive Constraints">
&lt;em>GATEs for the price effect (3-1) by executive constraints. The flat pattern confirms that institutions do not moderate price effects.&lt;/em>&lt;/p>
&lt;pre>&lt;code class="language-text"> 1-0 (Mining vs No Mining):
Exec. Constr. GATE 90% CI N
----------------------------------------------------
1 0.175 [0.168, 0.182] 300
2 0.255 [0.249, 0.262] 330
3 0.240 [0.236, 0.244] 720
4 0.242 [0.238, 0.246] 780
5 0.243 [0.237, 0.250] 420
6 0.264 [0.259, 0.269] 450
Range: 0.089
3-1 (High vs Low Prices):
Exec. Constr. GATE 90% CI N
----------------------------------------------------
1 0.242 [0.232, 0.252] 300
2 0.197 [0.187, 0.206] 330
3 0.217 [0.211, 0.224] 720
4 0.227 [0.221, 0.233] 780
5 0.224 [0.216, 0.231] 420
6 0.211 [0.204, 0.219] 450
Range: 0.045
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Finding 3: Institutions moderate the mining margin but not the price margin.&lt;/strong> The mining-effect GATEs (1-0) span a range of &lt;strong>0.089&lt;/strong> across executive-constraint levels, climbing roughly monotonically from 0.175 at the weakest institutions to 0.264 at the strongest. Read substantively: weaker institutions cut the development gain from mining roughly in half. The price-effect GATEs (3-1) span only &lt;strong>0.045&lt;/strong> and show no monotone pattern &amp;mdash; a non-finding that is itself the finding. The GATE plot effectively flat-lines because the price step is, by construction, uniform across institutional environments in the DGP.&lt;/p>
&lt;p>This asymmetry &amp;mdash; institutions shaping the mining-vs-no-mining margin but not the price margin &amp;mdash; is the structural prediction of the institutions-and-resources literature (Mehlum, Moene &amp;amp; Torvik, 2006) and the empirical pattern Hodler, Lechner &amp;amp; Raschky (2023) document for Sub-Saharan African districts. A causal forest does not assume the asymmetry; it discovers it. That is the distinguishing payoff of letting the slope $\tau(\mathbf{x})$ be a flexible function rather than fixing it parametrically (e.g., a single $\tau \times \mathrm{exec\_constraints}$ interaction term).&lt;/p>
&lt;h3 id="gates-by-quality-of-government">GATEs by Quality of Government&lt;/h3>
&lt;p>The same pattern appears when we use a continuous institutional measure:&lt;/p>
&lt;p>&lt;img src="python_econml_gate_ntl_1v0_qog.png" alt="GATEs for NTL mining effect (1-0) by Quality of Government">
&lt;em>GATEs for the mining effect (1-0) by quality of government. The positive relationship cross-validates the executive constraints finding.&lt;/em>&lt;/p>
&lt;p>&lt;img src="python_econml_gate_ntl_3v1_qog.png" alt="GATEs for NTL price effect (3-1) by Quality of Government">
&lt;em>GATEs for the price effect (3-1) by quality of government. The flat pattern is consistent across institutional measures.&lt;/em>&lt;/p>
&lt;p>The mining effect (1-0) shows a positive relationship with quality of government, while the price effect (3-1) remains approximately flat across the institutional quality distribution. This cross-validates Finding 3 using a different institutional measure.&lt;/p>
&lt;h2 id="variable-importance">Variable importance&lt;/h2>
&lt;p>EconML reports &lt;code>feature_importances_&lt;/code> for the causal forest &amp;mdash; the normalized contribution of each $X$-variable to treatment-effect &lt;em>heterogeneity&lt;/em> across all splits in all trees:&lt;/p>
&lt;pre>&lt;code class="language-python">importances = est_ntl.feature_importances_
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> distance_capital 0.171
ethnic_frac 0.142
ruggedness 0.135
population 0.126
agri_suitability 0.120
elevation 0.120
temperature 0.120
gdp_pc 0.034
quality_of_govt 0.018
exec_constraints 0.014
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="python_econml_var_importance.png" alt="Feature importance for treatment effect heterogeneity">
&lt;em>Feature importance for treatment effect heterogeneity. Geographic variables dominate splitting frequency, but the GATE plots show that institutional variables are the true moderators in the DGP.&lt;/em>&lt;/p>
&lt;p>This ranking looks paradoxical: the GATE plots above just demonstrated that &lt;code>exec_constraints&lt;/code> is what bends the mining effect, yet &lt;code>exec_constraints&lt;/code> is dead last by importance. The resolution is that &lt;strong>feature importance and moderation are different objects&lt;/strong>.&lt;/p>
&lt;p>A variable $X_j$ is a &lt;strong>moderator&lt;/strong> of the treatment effect if changing it changes the effect:&lt;/p>
&lt;p>$$\frac{\partial \tau(\mathbf{x})}{\partial x_j} \neq 0.$$&lt;/p>
&lt;p>A variable&amp;rsquo;s &lt;strong>forest importance&lt;/strong>, by contrast, is the variance-reduction-weighted frequency with which it is selected as a split variable. The two diverge in a predictable way:&lt;/p>
&lt;ul>
&lt;li>&lt;em>Continuous variables&lt;/em> (e.g., &lt;code>distance_capital&lt;/code>, &lt;code>ethnic_frac&lt;/code>) admit many candidate split thresholds and tend to be picked frequently for fine-grained slicing, even when each individual split contributes only a tiny amount to actual heterogeneity.&lt;/li>
&lt;li>&lt;em>Coarse discrete variables&lt;/em> like &lt;code>exec_constraints&lt;/code> (6 levels) have at most 5 candidate splits. Even when one of those splits captures the dominant moderation pattern, the variable accumulates a smaller total importance than a continuous neighbor that splits 50 times.&lt;/li>
&lt;/ul>
&lt;p>Read importances as a &lt;strong>screening&lt;/strong> signal &amp;mdash; a &amp;ldquo;where might heterogeneity be hiding?&amp;rdquo; first pass. Confirm or reject moderation with a hypothesis-driven GATE, a partial-dependence plot of $\tau(\mathbf{x})$, or the CATE Interpreter described next. The GATE analysis above is what nails the institutional-moderation finding; the importance ranking is what would have made you suspicious enough to draw the GATE plot in the first place.&lt;/p>
&lt;h2 id="cate-interpreter">CATE Interpreter&lt;/h2>
&lt;p>EconML&amp;rsquo;s &lt;code>SingleTreeCateInterpreter&lt;/code> fits a &lt;em>shallow&lt;/em> decision tree to the estimated CATEs themselves &amp;mdash; the tree&amp;rsquo;s outcome is the model&amp;rsquo;s prediction $\widehat\tau(\mathbf{X}_i)$, not the original $Y_i$. By splitting on $\mathbf{X}$, the tree finds the covariates and thresholds that best separate units with different treatment effects, returning a small set of subgroups summarized by their average $\widehat\tau$. It is a &lt;em>summary&lt;/em> of the forest&amp;rsquo;s heterogeneity surface, not a re-estimation of treatment effects.&lt;/p>
&lt;pre>&lt;code class="language-python">from econml.cate_interpreter import SingleTreeCateInterpreter
intrp = SingleTreeCateInterpreter(max_depth=2, min_samples_leaf=100)
intrp.interpret(est_ntl, X)
intrp.plot(feature_names=X_COLS)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="python_econml_cate_tree.png" alt="Decision tree summarizing CATE heterogeneity for the mining effect">
&lt;em>Depth-2 decision tree summarizing CATE heterogeneity for the mining effect (1-0). Each leaf reports the mean estimated CATE for the subgroup defined by the splits above it.&lt;/em>&lt;/p>
&lt;p>Two design choices control how interpretable the output is. &lt;strong>Tree depth&lt;/strong> trades off detail against communicability: depth 2 produces at most four leaves and a story you can tell out loud; depth 4 or more reveals interaction structure but rarely fits in a paper figure. &lt;strong>Minimum leaf size&lt;/strong> (&lt;code>min_samples_leaf=100&lt;/code>) prevents the tree from carving out tiny, noisy subgroups whose CATE estimates are statistically unreliable. We pull both into the named module constants &lt;code>CATE_TREE_DEPTH&lt;/code> and &lt;code>CATE_TREE_MIN_LEAF&lt;/code> in &lt;code>script.py&lt;/code> so the choice is one place to change rather than scattered magic numbers.&lt;/p>
&lt;p>The CATE Interpreter is a complement to, not a substitute for, the GATE analysis. &lt;strong>GATEs are hypothesis-driven&lt;/strong>: you pre-specify the moderating variable (here, &lt;code>exec_constraints&lt;/code>) and test how the effect varies across its values. &lt;strong>The CATE Interpreter is exploratory&lt;/strong>: it asks &amp;ldquo;of all the covariates, which ones &amp;mdash; at which thresholds &amp;mdash; best separate high-effect from low-effect units?&amp;rdquo; Running both is good practice. If the tree&amp;rsquo;s top split corresponds to a pre-specified moderator, your theory is reinforced; if the tree finds a different split, you have learned something the theory did not predict and have a candidate for follow-up GATE plots.&lt;/p>
&lt;h2 id="discussion">Discussion&lt;/h2>
&lt;h3 id="limitations">Limitations&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>No clustered standard errors.&lt;/strong> &lt;em>Clustered SEs&lt;/em> allow the residual variance to differ across clusters (here, districts) and absorb arbitrary within-cluster correlation. EconML&amp;rsquo;s &lt;code>inference=True&lt;/code> reports forest-level Bootstrap-of-Little-Bags SEs that treat observations as independent. With panel data &amp;mdash; the same district appearing in multiple years &amp;mdash; the BLB SEs are likely too small. We use &lt;code>GroupKFold&lt;/code> by district to prevent first-stage data leakage, but that is a different problem from second-stage variance estimation. The &lt;a href="https://carlos-mendez.org/post/stata_cate2/">companion Stata tutorial&lt;/a> uses Stata 19&amp;rsquo;s &lt;code>cate&lt;/code> command, which supports &lt;code>vce(cluster district_id)&lt;/code> directly.&lt;/li>
&lt;li>&lt;strong>Contemporaneous outcomes.&lt;/strong> Hodler, Lechner &amp;amp; Raschky (2023) use treatment at time $t$ and outcome at $t+1$, which rules out reverse causality from outcome to treatment within the same year. Our simulated data uses contemporaneous treatment and outcomes; in real applications, lagging the outcome is cheap insurance.&lt;/li>
&lt;li>&lt;strong>Simplified covariate set.&lt;/strong> The real analysis uses 60+ covariates spanning geology, geography, demography, institutions, and pre-treatment outcomes; we use 12. The simulated DGP guarantees that the CIA holds because we control for every confounder we built in. Real-world identification is only as strong as the controls support, and &amp;ldquo;we used a causal forest&amp;rdquo; does not relax the CIA.&lt;/li>
&lt;/ul>
&lt;h3 id="assumptions">Assumptions&lt;/h3>
&lt;p>The CATE estimates rely on the &lt;strong>Conditional Independence Assumption&lt;/strong>: treatment is independent of potential outcomes given $(X, W)$. The CIA is untestable from data alone &amp;mdash; it asserts something about the &lt;em>unobserved&lt;/em> potential outcomes. In observational work, the standard defense is a combination of (i) institutional knowledge of the treatment-assignment process, (ii) a rich, theory-motivated set of covariates, and (iii) sensitivity analyses (e.g., Rosenbaum bounds, $E$-values) that ask how strong an unobserved confounder would have to be to overturn the conclusion. None of these is a substitute for randomization. In the simulated data here, we know the CIA holds because we built it that way.&lt;/p>
&lt;h2 id="summary-and-next-steps">Summary and next steps&lt;/h2>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>EconML&amp;rsquo;s &lt;code>CausalForestDML&lt;/code> recovered all three ground-truth findings.&lt;/strong> The ATE for the basic mining effect (1-0 = 0.240) is within sampling error of the true value 0.250 and removes nearly all of the 0.141 confounding bias visible in the naive estimator. Price effects come out non-linear (2-1 = 0.029, n.s.; 3-1 = 0.220, significant at 5%; 3-2 = 0.191, significant at 10%) without any pre-specified non-linearity. GATE patterns reveal that institutions moderate the mining effect (range = 0.089 across executive-constraint levels) but not the price effect (range = 0.045, no monotone pattern).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>The DML two-stage residualization argument is what makes the causal forest valid in observational settings.&lt;/strong> Substituting the treatment equation into the outcome equation reduces causal estimation to a regression of $\tilde Y$ on $\tilde T$, where the residualizers $\hat g_0$ and $\hat m_0$ can be any flexible learner. Neyman orthogonality means errors in the residualizers enter only at second order, so $\sqrt n$-consistent estimates of $\tau$ are recoverable even with $O(n^{-1/4})$ first-stage rates.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Feature importance is a screening tool, not a moderation test.&lt;/strong> Continuous variables accumulate importance because they offer many split points, even when they do not bend the treatment effect. The GATE plot of $\tau$ against the suspected moderator is the right tool for confirming moderation; importance is the right tool for identifying candidates worth plotting.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>The CATE Interpreter is the exploratory dual of GATEs.&lt;/strong> A shallow decision tree on the predicted CATEs surfaces data-driven subgroups, complementing the hypothesis-driven GATE analysis. Use both: GATEs test theory, the interpreter audits theory.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>For the economic story behind these findings and a parallel implementation using Stata 19&amp;rsquo;s built-in &lt;code>cate&lt;/code> command, see the companion tutorial: &lt;a href="https://carlos-mendez.org/post/stata_cate2/">Causal Machine Learning and the Resource Curse with Stata 19&lt;/a>.&lt;/p>
&lt;h2 id="exercises">Exercises&lt;/h2>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Replace the nuisance models.&lt;/strong> Swap &lt;code>GradientBoostingRegressor&lt;/code> with &lt;code>RandomForestRegressor(n_estimators=200)&lt;/code>. Do the ATE and GATE estimates change? Why or why not (think about Neyman orthogonality)?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Vary the number of trees.&lt;/strong> Try &lt;code>n_estimators=100&lt;/code> vs &lt;code>n_estimators=1000&lt;/code>. How do the standard errors and GATE patterns change?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Test the GroupKFold assumption.&lt;/strong> Remove &lt;code>groups=df['district_id'].values&lt;/code> from the &lt;code>fit()&lt;/code> call. What happens to the confidence intervals?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Discretize quality of government.&lt;/strong> Create quartiles of &lt;code>quality_of_govt&lt;/code> and compute GATEs on the quartiles instead of raw values. Do the patterns become clearer?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Explore the CATE interpreter depth.&lt;/strong> Increase &lt;code>max_depth&lt;/code> from 2 to 4 in &lt;code>SingleTreeCateInterpreter&lt;/code>. Do the additional splits reveal meaningful subgroups or just noise?&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h2 id="references">References&lt;/h2>
&lt;ol>
&lt;li>&lt;a href="https://doi.org/10.1371/journal.pone.0284968" target="_blank" rel="noopener">Hodler, R., Lechner, M., &amp;amp; Raschky, P.A. (2023). Institutions and the resource curse: New insights from causal machine learning. &lt;em>PLoS ONE&lt;/em>, 18(6), e0284968.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1111/ectj.12097" target="_blank" rel="noopener">Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., &amp;amp; Robins, J. (2018). Double/Debiased Machine Learning for Treatment and Structural Parameters. &lt;em>The Econometrics Journal&lt;/em>, 21(1), C1&amp;ndash;C68.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1214/18-AOS1709" target="_blank" rel="noopener">Athey, S., Tibshirani, J., &amp;amp; Wager, S. (2019). Generalized Random Forests. &lt;em>The Annals of Statistics&lt;/em>, 47(2), 1148&amp;ndash;1178.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1080/01621459.2017.1319839" target="_blank" rel="noopener">Wager, S. &amp;amp; Athey, S. (2018). Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. &lt;em>Journal of the American Statistical Association&lt;/em>, 113(523), 1228&amp;ndash;1242.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.2307/1912705" target="_blank" rel="noopener">Robinson, P.M. (1988). Root-N-Consistent Semiparametric Regression. &lt;em>Econometrica&lt;/em>, 56(4), 931&amp;ndash;954.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1037/h0037350" target="_blank" rel="noopener">Rubin, D.B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. &lt;em>Journal of Educational Psychology&lt;/em>, 66(5), 688&amp;ndash;701.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1017/CBO9781139025751" target="_blank" rel="noopener">Imbens, G.W. &amp;amp; Rubin, D.B. (2015). &lt;em>Causal Inference for Statistics, Social, and Biomedical Sciences&lt;/em>. Cambridge University Press.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.nber.org/papers/w5398" target="_blank" rel="noopener">Sachs, J.D. &amp;amp; Warner, A.M. (1995). Natural Resource Abundance and Economic Growth. &lt;em>NBER Working Paper&lt;/em> No. 5398.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1111/j.1468-0297.2006.01045.x" target="_blank" rel="noopener">Mehlum, H., Moene, K., &amp;amp; Torvik, R. (2006). Institutions and the Resource Curse. &lt;em>The Economic Journal&lt;/em>, 116(508), 1&amp;ndash;20.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.pywhy.org/EconML/" target="_blank" rel="noopener">EconML Documentation &amp;mdash; PyWhy&lt;/a>&lt;/li>
&lt;/ol></description></item><item><title>Causal Machine Learning and the Resource Curse with Stata 19</title><link>https://carlos-mendez.org/post/stata_cate2/</link><pubDate>Wed, 06 May 2026 00:00:00 +0000</pubDate><guid>https://carlos-mendez.org/post/stata_cate2/</guid><description>&lt;h2 id="1-overview">1. Overview&lt;/h2>
&lt;p>Imagine discovering that the very thing that should make a country rich &amp;mdash; abundant natural resources &amp;mdash; actually makes it poorer. This is the &lt;strong>resource curse&lt;/strong> hypothesis, first documented by Sachs and Warner (1995): countries rich in oil, minerals, or other extractive resources often experience slower growth, weaker institutions, and more conflict than resource-poor nations.&lt;/p>
&lt;p>But the story is more nuanced than &amp;ldquo;resources are bad.&amp;rdquo; Mehlum, Moene, and Torvik (2006) argued that &lt;strong>institutional quality&lt;/strong> determines whether resource wealth becomes a blessing or a curse. Countries with strong rule of law and quality governance channel resource revenues productively, while weak institutions allow rent-seeking and conflict.&lt;/p>
&lt;p>This tutorial is inspired by &lt;a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0284968" target="_blank" rel="noopener">Hodler, Lechner &amp;amp; Raschky (2023)&lt;/a>, who brought &lt;strong>causal machine learning&lt;/strong> to this debate. Using a Modified Causal Forest on sub-national mining districts across Sub-Saharan Africa, they uncovered three key findings:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Mining increases development and conflict&lt;/strong> &amp;mdash; Districts that begin mining experience higher nighttime lights (a proxy for economic activity) and more conflict events.&lt;/li>
&lt;li>&lt;strong>Price effects are non-linear&lt;/strong> &amp;mdash; The effect of mineral prices on outcomes is small at moderate prices but jumps sharply at high prices.&lt;/li>
&lt;li>&lt;strong>Institutions moderate mining but NOT prices&lt;/strong> &amp;mdash; Institutional quality amplifies the development benefits of mining (upward-sloping GATEs), but does &lt;em>not&lt;/em> moderate the effect of global price shocks (flat GATEs).&lt;/li>
&lt;/ol>
&lt;p>This tutorial uses &lt;strong>Stata 19&amp;rsquo;s &lt;code>cate&lt;/code> command&lt;/strong> to replicate all three findings on a simulated dataset with &lt;strong>known ground-truth causal effects&lt;/strong> (3,000 observations = 300 districts $\times$ 10 years). Because the data-generating process is known, we can directly compare our estimates against the true parameter values. The &lt;code>cate&lt;/code> command provides native access to generalized random forests, doubly robust estimation, and formal hypothesis tests &amp;mdash; all without external packages.&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>Prerequisite.&lt;/strong> This post requires &lt;strong>Stata 19 or later&lt;/strong>. The &lt;code>cate&lt;/code> command does not exist in Stata 18. The companion do-file aborts on startup if it detects an older Stata.&lt;/p>
&lt;/blockquote>
&lt;blockquote>
&lt;p>&lt;strong>Runtime.&lt;/strong> The full analysis takes approximately &lt;strong>20&amp;ndash;30 minutes&lt;/strong> on a modern machine. Each &lt;code>cate&lt;/code> estimation takes 60&amp;ndash;90 seconds with 5-fold cross-fitting.&lt;/p>
&lt;/blockquote>
&lt;p>For a deeper introduction to the CATE framework and the &lt;code>cate&lt;/code> command on a binary-treatment dataset, see the companion tutorial &lt;a href="https://carlos-mendez.org/post/stata_cate/">Conditional Average Treatment Effects (CATE) with Stata 19&lt;/a>.&lt;/p>
&lt;h3 id="11-learning-objectives">1.1 Learning objectives&lt;/h3>
&lt;p>By the end of this tutorial you should be able to:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Understand&lt;/strong> the resource curse hypothesis and why treatment effects may vary with institutional quality.&lt;/li>
&lt;li>&lt;strong>Apply&lt;/strong> Stata 19&amp;rsquo;s &lt;code>cate&lt;/code> command to a multi-valued treatment via binary pairwise comparisons.&lt;/li>
&lt;li>&lt;strong>Distinguish&lt;/strong> PO and AIPW estimators and when each is preferred.&lt;/li>
&lt;li>&lt;strong>Estimate&lt;/strong> ATEs, GATEs, and IATEs for multiple treatment contrasts.&lt;/li>
&lt;li>&lt;strong>Interpret&lt;/strong> GATE patterns to identify institutional moderation of treatment effects.&lt;/li>
&lt;li>&lt;strong>Diagnose&lt;/strong> treatment-effect heterogeneity with formal hypothesis tests (&lt;code>estat heterogeneity&lt;/code>, &lt;code>estat gatetest&lt;/code>).&lt;/li>
&lt;li>&lt;strong>Visualize&lt;/strong> individualized treatment effects using &lt;code>categraph&lt;/code> postestimation tools.&lt;/li>
&lt;li>&lt;strong>Connect&lt;/strong> statistical results to substantive findings from published research.&lt;/li>
&lt;/ul>
&lt;h3 id="12-analytical-roadmap">1.2 Analytical roadmap&lt;/h3>
&lt;p>The diagram below shows the five stages of this tutorial, from data exploration through advanced diagnostics.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">flowchart LR
A[&amp;quot;&amp;lt;b&amp;gt;Data &amp;amp;&amp;lt;br/&amp;gt;Descriptives&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;&amp;lt;i&amp;gt;Sections 3--4&amp;lt;/i&amp;gt;&amp;quot;]:::data
B[&amp;quot;&amp;lt;b&amp;gt;Naive vs&amp;lt;br/&amp;gt;Ground Truth&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;&amp;lt;i&amp;gt;Section 5&amp;lt;/i&amp;gt;&amp;quot;]:::naive
C[&amp;quot;&amp;lt;b&amp;gt;ATE&amp;lt;br/&amp;gt;Estimation&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;&amp;lt;i&amp;gt;Sections 6--7&amp;lt;/i&amp;gt;&amp;quot;]:::ate
D[&amp;quot;&amp;lt;b&amp;gt;GATE&amp;lt;br/&amp;gt;Heterogeneity&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;&amp;lt;i&amp;gt;Section 8&amp;lt;/i&amp;gt;&amp;quot;]:::gate
E[&amp;quot;&amp;lt;b&amp;gt;Advanced&amp;lt;br/&amp;gt;Diagnostics&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;&amp;lt;i&amp;gt;Section 9&amp;lt;/i&amp;gt;&amp;quot;]:::diag
A --&amp;gt; B --&amp;gt; C --&amp;gt; D --&amp;gt; E
classDef data fill:#6a9bcc,stroke:#141413,color:#fff
classDef naive fill:#d97757,stroke:#141413,color:#fff
classDef ate fill:#00d4c8,stroke:#141413,color:#141413
classDef gate fill:#d97757,stroke:#141413,color:#fff
classDef diag fill:#141413,stroke:#d97757,color:#fff
&lt;/code>&lt;/pre>
&lt;h3 id="13-key-concepts-at-a-glance">1.3 Key concepts at a glance&lt;/h3>
&lt;p>The post leans on a small vocabulary repeatedly. The rest of the tutorial assumes you can move between these terms quickly. Each concept below has three parts. The &lt;strong>definition&lt;/strong> is always visible. The &lt;strong>example&lt;/strong> and &lt;strong>analogy&lt;/strong> sit behind clickable cards: open them when you need them, leave them collapsed for a quick scan. If a later section mentions &amp;ldquo;PO vs AIPW&amp;rdquo; or &amp;ldquo;honest splitting&amp;rdquo; and the term feels slippery, this is the section to re-read.&lt;/p>
&lt;p>&lt;strong>1. Potential outcomes&lt;/strong> $Y_i(t)$.
The outcome unit $i$ &lt;strong>would&lt;/strong> take under treatment value $t$. Each unit has one potential outcome per treatment level. We observe only one of them: the one matching the treatment actually received. The rest are &lt;em>counterfactual&lt;/em>. They live in worlds we never see.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Take district 47 in 2008. Four potential NTL outcomes exist for it: $Y_{47,2008}(0)$, $Y_{47,2008}(1)$, $Y_{47,2008}(2)$, and $Y_{47,2008}(3)$. They correspond to no mining, low prices, medium prices, and high prices. Only one is in the dataset. It is the one matching whatever &lt;code>treatment&lt;/code> value that district-year actually had. The other three are forever invisible.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Every life decision is a fork in the road. You took one fork. The parallel-universe versions of yourself took the other forks. Their lives are real conceptual objects. You just cannot directly observe them. Causal inference reconstructs those parallel universes. It does so by looking at people who &lt;em>did&lt;/em> take the other forks.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>2. CATE&lt;/strong> &amp;mdash; Conditional Average Treatment Effect, $\tau(\mathbf{x})$.
The average treatment effect for units with covariate profile $\mathbf{x}$. The CATE is a &lt;strong>function&lt;/strong> of $\mathbf{x}$, not a single number. Where the CATE bends with $\mathbf{x}$, the treatment helps some units more than others. Stata&amp;rsquo;s &lt;code>cate&lt;/code> command estimates exactly this function.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Take a well-governed district profile in our data: &lt;code>exec_constraints = 6&lt;/code>, &lt;code>quality_of_govt = 0.7&lt;/code>, and so on. For that profile the CATE is $\tau(\mathbf{x}) \approx 0.26$. Mining lifts log-NTL by about 0.26 for that profile. Now move to the weakest-institutions case: &lt;code>exec_constraints = 1&lt;/code>. The same function gives only $\tau(\mathbf{x}) \approx 0.18$. The CATE is what makes this comparison possible.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>A drug&amp;rsquo;s &amp;ldquo;average effect&amp;rdquo; might be a 5-point reduction in blood pressure. But a doctor cares about a specific patient. Maybe a 65-year-old male with diabetes. The CATE &lt;em>is&lt;/em> that personalized effect. It takes a patient profile in. It returns the expected effect for someone like them.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>3. GATE&lt;/strong> &amp;mdash; Group Average Treatment Effect.
The CATE averaged over a &lt;em>pre-specified&lt;/em> subgroup. The subgroup is defined by some variable. GATEs test targeted moderation hypotheses. A typical question: &amp;ldquo;does institutional quality moderate the effect of mining?&amp;rdquo;&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Sort districts by &lt;code>exec_constraints&lt;/code> (1&amp;ndash;6). Average the per-observation CATEs inside each level. At level 1 we get $\widehat{\mathrm{GATE}} \approx 0.18$. The number climbs to $\approx 0.26$ at level 6. That climb is the moderation pattern §8 visualizes. It is exactly what &lt;code>estat gateplot&lt;/code> reports after the &lt;code>cate&lt;/code> command.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>A nationwide marketing campaign might lift sales by 5% on average. Before scaling it up, the company asks a simple question: did it work better in cities than in rural towns? The GATE answers exactly that. It reports the campaign&amp;rsquo;s effect &lt;em>inside&lt;/em> each store type. It surfaces heterogeneity that the headline ATE hides.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>4. ATE&lt;/strong> &amp;mdash; Average Treatment Effect, $E[\tau(\mathbf{X})]$.
The CATE averaged over the entire sample. The headline policy number. It answers a single question: if we turned the treatment on for everyone, what average effect would we see?&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Take our 3,000 district-years. The PO ATE for the 1-vs-0 mining contrast is 0.194 (SE = 0.010). AIPW gives a more conservative 0.149 (SE = 0.011). Both are reported in §7. They are two estimates of the same population-level number.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>&amp;ldquo;This drug lowers cholesterol by 12 points on average.&amp;rdquo; That is an ATE statement. A single number, suitable for a press release. It says nothing about whether the drug works better in some patients than others. That question belongs to GATEs and CATEs.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>5. Nuisance functions&lt;/strong> $g_0, m_0$.
Two conditional means. $g_0(\mathbf{x}, \mathbf{w}) = E[Y \mid \mathbf{X}, \mathbf{W}]$ predicts the outcome from covariates. $m_0(\mathbf{x}, \mathbf{w}) = E[T \mid \mathbf{X}, \mathbf{W}]$ predicts the treatment from covariates. We call them &lt;em>nuisance&lt;/em> because we do not care about their values directly. We estimate them only to strip out the part of $Y$ and $T$ that is predictable from $(\mathbf{X}, \mathbf{W})$. What remains is the variation that identifies the causal effect.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>In this post Stata&amp;rsquo;s &lt;code>cate&lt;/code> fits both $g_0$ and $m_0$ as random forests behind the scenes. We never see them. We never tune them directly. They are intermediate machinery the command consumes and discards on its way to the CATE.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Two surveyors map two different layers of the same terrain. One maps elevation. The other maps soil type. Neither map is the goal. The goal is to subtract them from a third map and see what is left. That residue is what we actually care about.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>6. Cross-fitting and honest splitting&lt;/strong>.
Cross-fitting splits the sample into $K$ folds. Nuisance models are fit on $K-1$ folds and applied to the held-out fold. The roles rotate. No observation is ever scored by a model that saw it during training. Honest splitting goes one step further inside each tree. It uses one subsample to choose where to split. It uses a separate subsample to estimate the leaf values. Both tricks remove the over-fitting bias that would otherwise contaminate the CATE.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Stata&amp;rsquo;s &lt;code>cate&lt;/code> does this internally. We pass &lt;code>xfolds(5)&lt;/code> and the rest is automatic. We never call separate train/test commands. The 5 folds rotate behind the scenes; the user sees only the final estimates.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Two-pass exam grading. One TA writes the rubric without seeing your paper. A different TA applies the rubric without writing it. The separation is what makes the grade defensible. Mixing the two roles is exactly the over-fitting bias these tricks remove.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>7. PO vs AIPW estimators&lt;/strong>.
Two ways to map nuisance estimates to a CATE. &lt;strong>PO&lt;/strong> (Partialing Out) residualizes both $Y$ and $T$ against the covariates, then regresses one residual on the other. Simple, transparent, sensitive to extreme propensity scores. &lt;strong>AIPW&lt;/strong> (Augmented Inverse-Probability Weighting) reweights observations by inverse propensity and adds a regression correction. More complex, but &lt;strong>doubly robust&lt;/strong>: it stays consistent if either $g_0$ or $m_0$ is right.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>This post fits both. They disagree by about 0.045 on the 1-vs-0 contrast (PO 0.194, AIPW 0.149). That gap is the model-disagreement diagnostic. When PO and AIPW disagree, the overlap is suspect or one of the nuisance models is mis-specified.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Two judges hear the same case. They follow slightly different reasoning paths. When their verdicts agree, you trust the case. When they disagree, you re-read the evidence. The disagreement is the signal, not noise.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>8. Heterogeneity test&lt;/strong>.
A formal test that $\tau(\mathbf{x})$ varies with $\mathbf{x}$. The null hypothesis is constant treatment effects: every unit gets the same effect. Rejection licenses CATE and GATE interpretation. Failing to reject does not mean effects are constant. It means the test could not detect variation at this sample size.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>After &lt;code>cate&lt;/code>, run &lt;code>estat heterogeneity&lt;/code> in §9. It returns a $\chi^2$ statistic and a $p$-value. A small $p$-value is the green light to inspect GATEs and CATEs. A large $p$-value is a caution: the heterogeneity story may not be in the data.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>A metal detector for hidden moderation. It does not tell you &lt;em>where&lt;/em> in the field the metal is buried. It only tells you whether to keep digging.&lt;/p>
&lt;/details>
&lt;/div>
&lt;hr>
&lt;h2 id="2-the-cate-framework">2. The CATE framework&lt;/h2>
&lt;h3 id="21-from-ate-to-cate">2.1 From ATE to CATE&lt;/h3>
&lt;p>The Average Treatment Effect (ATE) summarizes causal effects as a single number for the entire population. But when effects are heterogeneous &amp;mdash; varying across subgroups &amp;mdash; the ATE can mask important patterns. The &lt;strong>Conditional Average Treatment Effect (CATE)&lt;/strong> captures this heterogeneity:&lt;/p>
&lt;p>$$\tau(\mathbf{x}) = E\{y_i(1) - y_i(0) \mid \mathbf{x}_i = \mathbf{x}\}$$&lt;/p>
&lt;p>where $y_i(1)$ and $y_i(0)$ are potential outcomes under treatment and control, and $\mathbf{x}$ is a vector of characteristics that may moderate the treatment effect. If $\tau(\mathbf{x})$ is constant across all $\mathbf{x}$, we are back at the ATE. Whenever it varies, the ATE is an average of these subgroup effects weighted by how common each $\mathbf{x}$ is in the data.&lt;/p>
&lt;h3 id="22-the-partial-linear-model">2.2 The partial linear model&lt;/h3>
&lt;p>Stata 19&amp;rsquo;s &lt;code>cate&lt;/code> estimates CATEs within a partial linear framework:&lt;/p>
&lt;p>$$y = d \cdot \tau(\mathbf{x}) + g(\mathbf{x}, \mathbf{w}) + \epsilon, \qquad d = f(\mathbf{x}, \mathbf{w}) + u$$&lt;/p>
&lt;p>where $\tau(\mathbf{x})$ is the heterogeneous treatment effect function, $g(\cdot)$ and $f(\cdot)$ are flexible nuisance functions estimated by machine learning, $\mathbf{x}$ are CATE covariates (potential moderators), and $\mathbf{w}$ are additional controls.&lt;/p>
&lt;p>Think of the nuisance functions as &lt;em>background noise&lt;/em> that must be cleaned away before the treatment effect signal becomes visible. The &lt;code>cate&lt;/code> command uses &lt;strong>cross-fitting&lt;/strong> to prevent the nuisance models from overfitting: data are split into $K$ folds, and each fold&amp;rsquo;s nuisance predictions are made using models trained on the other $K-1$ folds.&lt;/p>
&lt;h3 id="23-two-estimators">2.3 Two estimators&lt;/h3>
&lt;p>Stata 19 provides two estimators for the CATE:&lt;/p>
&lt;p>&lt;strong>Partialing-Out (PO).&lt;/strong> Think of PO like cleaning two messy signals before comparing them. It residualizes both the outcome and treatment against $\mathbf{x}$ and $\mathbf{w}$, then estimates $\tau(\mathbf{x})$ from the residuals using a generalized random forest (Nie &amp;amp; Wager, 2021). PO is robust when propensity scores get close to 0 or 1.&lt;/p>
&lt;p>&lt;strong>Augmented Inverse-Probability Weighting (AIPW).&lt;/strong> AIPW is like having a backup GPS &amp;mdash; if one route fails, the other still gets you there. It constructs doubly robust scores that combine outcome modeling and propensity score weighting. Even if one model is misspecified, the estimator remains consistent (Knaus, 2022; Kennedy, 2023).&lt;/p>
&lt;h3 id="24-three-levels-of-treatment-effects">2.4 Three levels of treatment effects&lt;/h3>
&lt;pre>&lt;code class="language-mermaid">flowchart LR
A[&amp;quot;&amp;lt;b&amp;gt;Panel Data&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;3,000 obs&amp;lt;br/&amp;gt;300 districts x 10 years&amp;quot;]:::data
A --&amp;gt; B[&amp;quot;&amp;lt;b&amp;gt;cate po / aipw&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;Binary pairwise&amp;lt;br/&amp;gt;comparisons&amp;quot;]:::main
B --&amp;gt; C[&amp;quot;&amp;lt;b&amp;gt;IATEs&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;Per-observation&amp;lt;br/&amp;gt;effects tau(x_i)&amp;quot;]:::iate
B --&amp;gt; D[&amp;quot;&amp;lt;b&amp;gt;GATEs&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;Group averages&amp;lt;br/&amp;gt;by institutions&amp;quot;]:::gate
B --&amp;gt; E[&amp;quot;&amp;lt;b&amp;gt;ATE&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;Overall population&amp;lt;br/&amp;gt;average&amp;quot;]:::ate
C --&amp;gt; F[&amp;quot;categraph histogram&amp;lt;br/&amp;gt;categraph iateplot&amp;quot;]:::post
D --&amp;gt; G[&amp;quot;categraph gateplot&amp;lt;br/&amp;gt;estat gatetest&amp;quot;]:::post
E --&amp;gt; H[&amp;quot;estat heterogeneity&amp;lt;br/&amp;gt;estat ate&amp;quot;]:::post
classDef data fill:#6a9bcc,stroke:#141413,color:#fff
classDef main fill:#141413,stroke:#141413,color:#fff
classDef iate fill:#00d4c8,stroke:#141413,color:#141413
classDef gate fill:#d97757,stroke:#141413,color:#fff
classDef ate fill:#6a9bcc,stroke:#141413,color:#fff
classDef post fill:#f5f5f5,stroke:#141413,color:#141413
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>&lt;strong>IATE&lt;/strong> (Individualized Average Treatment Effects): One effect per observation, $\tau(\mathbf{x}_i)$&lt;/li>
&lt;li>&lt;strong>GATE&lt;/strong> (Group Average Treatment Effects): Average effect within prespecified groups, $\tau(g) = E\{\tau(\mathbf{x}) \mid G = g\}$&lt;/li>
&lt;li>&lt;strong>ATE&lt;/strong>: Overall population average, $\text{ATE} = E\{\tau(\mathbf{x})\}$&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="3-data-preparation">3. Data preparation&lt;/h2>
&lt;p>We use a simulated dataset of 3,000 observations (300 districts $\times$ 10 years) across 8 fictional countries. The data mirror the structure of Hodler et al. (2023) but with known ground-truth causal effects, enabling direct validation of our estimates.&lt;/p>
&lt;pre>&lt;code class="language-stata">* Import the simulated resource curse dataset
* GitHub: import delimited using &amp;quot;https://github.com/quarcs-lab/data-open/raw/master/stata19/sim_resource_curse.csv&amp;quot;, clear
import delimited using &amp;quot;sim_resource_curse.csv&amp;quot;, clear
* Label variables
label variable district_id &amp;quot;District ID (1-300)&amp;quot;
label variable country_id &amp;quot;Country ID (1-8)&amp;quot;
label variable year &amp;quot;Year (2003-2012)&amp;quot;
label variable treatment &amp;quot;Treatment group (0=none, 1=low, 2=med, 3=high)&amp;quot;
label variable mining &amp;quot;Mining district (binary)&amp;quot;
label variable price_index &amp;quot;Mineral price index&amp;quot;
label variable exec_constraints &amp;quot;Constraints on Executive (1-6)&amp;quot;
label variable quality_of_govt &amp;quot;Quality of Government (0.22-0.70)&amp;quot;
label variable gdp_pc &amp;quot;GDP per capita&amp;quot;
label variable elevation &amp;quot;Elevation (meters)&amp;quot;
label variable temperature &amp;quot;Mean temperature (Celsius)&amp;quot;
label variable ruggedness &amp;quot;Terrain ruggedness&amp;quot;
label variable distance_capital &amp;quot;Distance to capital (meters)&amp;quot;
label variable agri_suitability &amp;quot;Agricultural suitability (0-1)&amp;quot;
label variable population &amp;quot;Population&amp;quot;
label variable ethnic_frac &amp;quot;Ethnic fractionalization (0-1)&amp;quot;
label variable ntl_log &amp;quot;Log nighttime lights&amp;quot;
label variable conflict &amp;quot;Conflict event (binary)&amp;quot;
* Create integer version of exec_constraints for group()
gen int exec_con = round(exec_constraints)
label variable exec_con &amp;quot;Executive Constraints (integer 1-6)&amp;quot;
* Save as .dta
save &amp;quot;sim_resource_curse.dta&amp;quot;, replace
* Report dataset dimensions
describe, short
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Contains data from sim_resource_curse.dta
Observations: 3,000
Variables: 19 6 May 2026
Sorted by:
&lt;/code>&lt;/pre>
&lt;p>The dataset contains &lt;strong>3,000 observations&lt;/strong> organized as a balanced panel: 300 districts observed over 10 years (2003&amp;ndash;2012) in 8 fictional countries. The key variables are:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Variable&lt;/th>
&lt;th>Description&lt;/th>
&lt;th>Type&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;code>treatment&lt;/code>&lt;/td>
&lt;td>Treatment group (0=none, 1=low, 2=med, 3=high price)&lt;/td>
&lt;td>Categorical&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>ntl_log&lt;/code>&lt;/td>
&lt;td>Log nighttime lights (development proxy)&lt;/td>
&lt;td>Continuous outcome&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>conflict&lt;/code>&lt;/td>
&lt;td>Conflict event indicator&lt;/td>
&lt;td>Binary outcome&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>exec_constraints&lt;/code>&lt;/td>
&lt;td>Constraints on executive (1&amp;ndash;6 scale)&lt;/td>
&lt;td>Institutional moderator&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>quality_of_govt&lt;/code>&lt;/td>
&lt;td>Quality of government (0.22&amp;ndash;0.70)&lt;/td>
&lt;td>Institutional moderator&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>gdp_pc&lt;/code>, &lt;code>elevation&lt;/code>, &lt;code>temperature&lt;/code>, &amp;hellip;&lt;/td>
&lt;td>Economic and geographic covariates&lt;/td>
&lt;td>Controls&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;hr>
&lt;h2 id="4-descriptive-statistics">4. Descriptive statistics&lt;/h2>
&lt;pre>&lt;code class="language-stata">* Summary statistics for key variables
tabstat ntl_log conflict exec_constraints quality_of_govt gdp_pc ///
elevation temperature ruggedness distance_capital ///
agri_suitability population ethnic_frac, ///
statistics(mean sd min max) columns(statistics) format(%9.3f)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> Variable | Mean SD Min Max
-------------+----------------------------------------
ntl_log | -1.096 0.435 -2.503 0.265
conflict | 0.123 0.328 0.000 1.000
exec_const~s | 3.680 1.489 1.000 6.000
quality_of~t | 0.440 0.152 0.220 0.700
gdp_pc | 2198.000 1469.937 500.000 5000.000
elevation | 499.083 302.031 0.000 1357.232
temperature | 23.913 3.920 13.993 35.000
ruggedness | 24.423 17.803 0.000 76.953
distance_c~l | 2.68e+05 1.44e+05 10813.747 4.97e+05
agri_suita~y | 0.395 0.197 0.000 0.983
population | 82028.426 85186.961 4134.682 5.97e+05
ethnic_frac | 0.550 0.202 0.201 0.899
------------------------------------------------------
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-stata">* Treatment distribution
tab treatment, missing
* Mining share
count if treatment &amp;gt; 0
* Outcomes by treatment group
table treatment, statistic(mean ntl_log) statistic(mean conflict) ///
statistic(count ntl_log) nformat(%9.3f)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> Treatment |
group |
(0=none, |
1=low, |
2=med, |
3=high) | Freq. Percent Cum.
------------+-----------------------------------
0 | 2,550 85.00 85.00
1 | 150 5.00 90.00
2 | 150 5.00 95.00
3 | 150 5.00 100.00
------------+-----------------------------------
Total | 3,000 100.00
Mining share: 15.0%
--------------------------------------------------------------
| Mean
| ntl_log conflict
-----------------------------------------------+--------------------
Treatment group (0=none, 1=low, 2=med, 3=high) |
0 | -1.137 0.107
1 | -1.028 0.180
2 | -0.930 0.180
3 | -0.615 0.280
Total | -1.096 0.123
--------------------------------------------------------------
&lt;/code>&lt;/pre>
&lt;blockquote>
&lt;p>&lt;strong>Treatment imbalance.&lt;/strong> The treatment distribution is highly imbalanced: approximately &lt;strong>85% of observations&lt;/strong> are in the control group (no mining), while each treated group contains only about &lt;strong>5% of observations&lt;/strong>. This mirrors real-world mining data where few districts have active mines. Stata&amp;rsquo;s &lt;code>cate&lt;/code> handles this via honest random forests with appropriate sample-splitting.&lt;/p>
&lt;/blockquote>
&lt;p>The descriptive statistics reveal important patterns. Mean &lt;code>ntl_log&lt;/code> varies across treatment groups, but these raw differences mix the causal effect with confounding &amp;mdash; mining districts differ systematically from non-mining districts in geography, institutions, and economic conditions. The next section demonstrates this directly.&lt;/p>
&lt;hr>
&lt;h2 id="5-naive-comparison-vs-ground-truth">5. Naive comparison vs ground truth&lt;/h2>
&lt;p>Before applying any causal method, we compute raw mean differences and compare them to the known ground-truth ATEs from the data-generating process.&lt;/p>
&lt;pre>&lt;code class="language-stata">* Naive difference-in-means (biased by confounders)
display as text _newline &amp;quot;=== Naive Difference-in-Means (biased) ===&amp;quot;
display as text &amp;quot;Comparison&amp;quot; _col(20) &amp;quot;NTL diff&amp;quot; _col(35) &amp;quot;Ground Truth&amp;quot;
display as text &amp;quot;{hline 50}&amp;quot;
* 1-0: mining vs no mining
quietly summarize ntl_log if treatment == 1
local m1 = r(mean)
quietly summarize ntl_log if treatment == 0
local m0 = r(mean)
display as result &amp;quot;1 vs 0&amp;quot; _col(20) %7.4f (`m1' - `m0') _col(35) &amp;quot;0.25&amp;quot;
* 3-1: high vs low prices
quietly summarize ntl_log if treatment == 3
local m3 = r(mean)
display as result &amp;quot;3 vs 1&amp;quot; _col(20) %7.4f (`m3' - `m1') _col(35) &amp;quot;0.30&amp;quot;
* 2-1: medium vs low prices
quietly summarize ntl_log if treatment == 2
local m2 = r(mean)
display as result &amp;quot;2 vs 1&amp;quot; _col(20) %7.4f (`m2' - `m1') _col(35) &amp;quot;0.05&amp;quot;
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">=== Naive Difference-in-Means (biased) ===
Comparison NTL diff Ground Truth
--------------------------------------------------
1 vs 0 0.1092 0.25
2 vs 0 0.2077 0.30
3 vs 0 0.5227 0.55
2 vs 1 0.0985 0.05
3 vs 1 0.4135 0.30
3 vs 2 0.3150 0.25
--------------------------------------------------
&lt;/code>&lt;/pre>
&lt;p>The ground-truth ATEs for all six pairwise comparisons are:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Contrast&lt;/th>
&lt;th style="text-align:center">Ground Truth&lt;/th>
&lt;th>Interpretation&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>1-0&lt;/td>
&lt;td style="text-align:center">0.25&lt;/td>
&lt;td>Mining effect at mean institutions&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2-0&lt;/td>
&lt;td style="text-align:center">0.30&lt;/td>
&lt;td>Mining + medium price premium&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3-0&lt;/td>
&lt;td style="text-align:center">0.55&lt;/td>
&lt;td>Mining + high price premium&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2-1&lt;/td>
&lt;td style="text-align:center">0.05&lt;/td>
&lt;td>Medium price premium (small)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3-1&lt;/td>
&lt;td style="text-align:center">0.30&lt;/td>
&lt;td>High price premium (large)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3-2&lt;/td>
&lt;td style="text-align:center">0.25&lt;/td>
&lt;td>High vs medium step&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>The naive comparisons are &lt;strong>biased&lt;/strong> because mining districts differ systematically from non-mining districts in geography, institutions, and economic conditions. Some confounders push the raw difference above the truth, others below. This motivates the use of causal machine learning methods that adjust for these confounders.&lt;/p>
&lt;hr>
&lt;h2 id="6-estimation-strategy">6. Estimation strategy&lt;/h2>
&lt;h3 id="61-binary-pairwise-comparisons">6.1 Binary pairwise comparisons&lt;/h3>
&lt;p>Stata 19&amp;rsquo;s &lt;code>cate&lt;/code> command requires a &lt;strong>binary treatment variable&lt;/strong>. Since our treatment has 4 levels (0, 1, 2, 3), we run separate estimations for each pairwise comparison, subsetting the data to the two relevant groups each time. This yields 6 binary comparisons that map directly to the three key findings:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
T[&amp;quot;&amp;lt;b&amp;gt;4-Level Treatment&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;0: No mining&amp;lt;br/&amp;gt;1: Low price&amp;lt;br/&amp;gt;2: Medium price&amp;lt;br/&amp;gt;3: High price&amp;quot;]:::data
subgraph F1[&amp;quot;Finding 1: Mining Effect&amp;quot;]
C10[&amp;quot;1 vs 0&amp;quot;]:::f1
C20[&amp;quot;2 vs 0&amp;quot;]:::f1
C30[&amp;quot;3 vs 0&amp;quot;]:::f1
end
subgraph F2[&amp;quot;Finding 2: Price Non-linearity&amp;quot;]
C21[&amp;quot;2 vs 1&amp;lt;br/&amp;gt;&amp;lt;i&amp;gt;small ~ 0.05&amp;lt;/i&amp;gt;&amp;quot;]:::f2
C31[&amp;quot;3 vs 1&amp;lt;br/&amp;gt;&amp;lt;i&amp;gt;large ~ 0.30&amp;lt;/i&amp;gt;&amp;quot;]:::f2
C32[&amp;quot;3 vs 2&amp;quot;]:::f2
end
T --&amp;gt; F1
T --&amp;gt; F2
classDef data fill:#6a9bcc,stroke:#141413,color:#fff
classDef f1 fill:#00d4c8,stroke:#141413,color:#141413
classDef f2 fill:#d97757,stroke:#141413,color:#fff
&lt;/code>&lt;/pre>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Contrast&lt;/th>
&lt;th>Comparison&lt;/th>
&lt;th>Finding&lt;/th>
&lt;th style="text-align:center">Ground Truth&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>1-0&lt;/td>
&lt;td>Mining (any price) vs No mining&lt;/td>
&lt;td>Finding 1&lt;/td>
&lt;td style="text-align:center">0.25&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2-0&lt;/td>
&lt;td>Mining (medium price) vs No mining&lt;/td>
&lt;td>Finding 1&lt;/td>
&lt;td style="text-align:center">0.30&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3-0&lt;/td>
&lt;td>Mining (high price) vs No mining&lt;/td>
&lt;td>Finding 1&lt;/td>
&lt;td style="text-align:center">0.55&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2-1&lt;/td>
&lt;td>Medium vs Low prices (within mining)&lt;/td>
&lt;td>Finding 2 (small)&lt;/td>
&lt;td style="text-align:center">0.05&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3-1&lt;/td>
&lt;td>High vs Low prices (within mining)&lt;/td>
&lt;td>Finding 2 (large)&lt;/td>
&lt;td style="text-align:center">0.30&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3-2&lt;/td>
&lt;td>High vs Medium prices (within mining)&lt;/td>
&lt;td>Finding 2&lt;/td>
&lt;td style="text-align:center">0.25&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="62-variable-specification">6.2 Variable specification&lt;/h3>
&lt;p>We separate variables into two groups following the &lt;code>cate&lt;/code> framework:&lt;/p>
&lt;pre>&lt;code class="language-stata">* CATE variables (x): potential drivers of treatment-effect heterogeneity
global catevars exec_constraints quality_of_govt gdp_pc ///
elevation temperature ruggedness distance_capital ///
agri_suitability population ethnic_frac
* Controls (w): nuisance variables for background adjustment only
global controls i.country_id i.year
&lt;/code>&lt;/pre>
&lt;p>The &lt;strong>catevarlist&lt;/strong> ($\mathbf{x}$) contains the 10 covariates that may drive heterogeneity &amp;mdash; institutional, economic, and geographic variables. The &lt;strong>controls&lt;/strong> ($\mathbf{w}$) contain country and year fixed effects to absorb panel-level confounding without overcomplicating the CATE function.&lt;/p>
&lt;p>We use &lt;code>xfolds(5)&lt;/code> rather than the default 10 to ensure adequate sample sizes per fold, especially for the within-mining comparisons. With &lt;code>rseed(12345)&lt;/code>, all results are reproducible.&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>Small sample warning.&lt;/strong> The mining-vs-no-mining comparisons (1-0, 2-0, 3-0) use approximately 2,700 observations &amp;mdash; adequate for the causal forest. However, the &lt;strong>within-mining&lt;/strong> price comparisons (2-1, 3-1, 3-2) use only about 300 observations (two treated groups of ~150 each). With &lt;code>xfolds(5)&lt;/code>, each fold has only ~60 observations. Expect wider confidence intervals for these comparisons.&lt;/p>
&lt;/blockquote>
&lt;hr>
&lt;h2 id="7-average-treatment-effects">7. Average treatment effects&lt;/h2>
&lt;p>We estimate ATEs for all 6 NTL contrasts and key conflict contrasts. For the two most important comparisons (NTL 1-0 and NTL 3-1), we show both PO and AIPW estimators. Remaining comparisons use AIPW only.&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>Runtime.&lt;/strong> Each &lt;code>cate&lt;/code> estimation takes approximately 60&amp;ndash;90 seconds with 5-fold cross-fitting. The full section runs in approximately 15&amp;ndash;20 minutes.&lt;/p>
&lt;/blockquote>
&lt;h3 id="71-ntl-mining-effect-1-0-----po-vs-aipw">7.1 NTL: Mining effect (1-0) &amp;mdash; PO vs AIPW&lt;/h3>
&lt;p>This is the most important contrast: does mining increase nighttime lights? The ground truth is 0.25.&lt;/p>
&lt;pre>&lt;code class="language-stata">* --- PO estimator ---
preserve
keep if treatment == 1 | treatment == 0
gen byte treat_1v0 = (treatment == 1)
cate po (ntl_log $catevars) (treat_1v0), ///
controls($controls) ///
rseed(12345) xfolds(5) ///
omethod(rforest) tmethod(rforest)
estimates store po_ntl_1v0
restore
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Conditional average treatment effects Number of observations = 2,700
Estimator: Partialing out Number of folds in cross-fit = 5
Outcome model: Random forest Number of outcome controls = 28
Treatment model: Random forest Number of treatment controls = 28
CATE model: Random forest Number of CATE variables = 10
------------------------------------------------------------------------------
| Robust
ntl_log | Coefficient std. err. z P&amp;gt;|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ATE |
treat_1v0 |
(Mining ..) |
vs |
No mining) | .1936814 .0097428 19.88 0.000 .1745858 .212777
-------------+----------------------------------------------------------------
POmean |
treat_1v0 |
No mining | -1.142413 .0079236 -144.18 0.000 -1.157943 -1.126883
------------------------------------------------------------------------------
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-stata">* --- AIPW estimator ---
preserve
keep if treatment == 1 | treatment == 0
gen byte treat_1v0 = (treatment == 1)
cate aipw (ntl_log $catevars) (treat_1v0), ///
controls($controls) ///
rseed(12345) xfolds(5) ///
omethod(rforest) tmethod(rforest)
estimates store aipw_ntl_1v0
restore
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Conditional average treatment effects Number of observations = 2,700
Estimator: Augmented IPW Number of folds in cross-fit = 5
Outcome model: Random forest Number of outcome controls = 28
Treatment model: Random forest Number of treatment controls = 28
CATE model: Random forest Number of CATE variables = 10
------------------------------------------------------------------------------
| Robust
ntl_log | Coefficient std. err. z P&amp;gt;|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ATE |
treat_1v0 |
(Mining ..) |
vs |
No mining) | .1489842 .0105686 14.10 0.000 .1282701 .1696983
-------------+----------------------------------------------------------------
POmean |
treat_1v0 |
No mining | -1.142416 .0079187 -144.27 0.000 -1.157936 -1.126896
------------------------------------------------------------------------------
&lt;/code>&lt;/pre>
&lt;p>The PO ATE is &lt;strong>0.194&lt;/strong> (SE = 0.010) and the AIPW ATE is &lt;strong>0.149&lt;/strong> (SE = 0.011) — both positive and significant, confirming that mining increases nighttime lights. The estimates differ somewhat from each other and from the ground truth (0.25), which is expected with 5-fold cross-fitting on a moderately sized sample. The PO estimate is closer to the truth here, while AIPW is more conservative. Both confirm the directional finding.&lt;/p>
&lt;h3 id="72-ntl-high-vs-low-prices-3-1-----po-vs-aipw">7.2 NTL: High vs low prices (3-1) &amp;mdash; PO vs AIPW&lt;/h3>
&lt;p>The price effect comparison tests Finding 2. The ground truth is 0.30 &amp;mdash; a large jump from low to high prices. Note that this comparison uses only mining districts (~300 observations), so estimates will be noisier.&lt;/p>
&lt;pre>&lt;code class="language-stata">* --- PO estimator ---
preserve
keep if treatment == 3 | treatment == 1
gen byte treat_3v1 = (treatment == 3)
display as text &amp;quot;N = &amp;quot; _N &amp;quot; observations (mining districts only)&amp;quot;
cate po (ntl_log $catevars) (treat_3v1), ///
controls($controls) ///
rseed(12345) xfolds(5) ///
omethod(rforest) tmethod(rforest)
estimates store po_ntl_3v1
restore
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">N = 300 observations (mining districts only)
Conditional average treatment effects Number of observations = 300
Estimator: Partialing out Number of folds in cross-fit = 5
Outcome model: Random forest Number of outcome controls = 28
Treatment model: Random forest Number of treatment controls = 28
CATE model: Random forest Number of CATE variables = 10
------------------------------------------------------------------------------
| Robust
ntl_log | Coefficient std. err. z P&amp;gt;|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ATE |
treat_3v1 |
(High price |
vs |
Low price) | .5945629 .0313138 18.99 0.000 .5331891 .6559368
-------------+----------------------------------------------------------------
POmean |
treat_3v1 |
Low price | -1.12839 .0280085 -40.29 0.000 -1.183285 -1.073494
------------------------------------------------------------------------------
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-stata">* --- AIPW estimator ---
preserve
keep if treatment == 3 | treatment == 1
gen byte treat_3v1 = (treatment == 3)
cate aipw (ntl_log $catevars) (treat_3v1), ///
controls($controls) ///
rseed(12345) xfolds(5) ///
omethod(rforest) tmethod(rforest)
estimates store aipw_ntl_3v1
restore
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Conditional average treatment effects Number of observations = 300
Estimator: Augmented IPW Number of folds in cross-fit = 5
Outcome model: Random forest Number of outcome controls = 28
Treatment model: Random forest Number of treatment controls = 28
CATE model: Random forest Number of CATE variables = 10
------------------------------------------------------------------------------
| Robust
ntl_log | Coefficient std. err. z P&amp;gt;|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ATE |
treat_3v1 |
(High price |
vs |
Low price) | .4052631 .0254935 15.90 0.000 .3552968 .4552293
-------------+----------------------------------------------------------------
POmean |
treat_3v1 |
Low price | -1.029871 .0240718 -42.78 0.000 -1.077051 -.9826917
------------------------------------------------------------------------------
&lt;/code>&lt;/pre>
&lt;p>The PO ATE is &lt;strong>0.595&lt;/strong> (SE = 0.031) and the AIPW ATE is &lt;strong>0.405&lt;/strong> (SE = 0.025) — both large and highly significant, confirming that the price premium from low to high is substantial (ground truth = 0.30). With only 300 observations and 5-fold cross-fitting, estimates are noisier than the 1-0 contrast, and both overshoot the ground truth, but the directional finding is robust.&lt;/p>
&lt;h3 id="73-ntl-remaining-comparisons-aipw-only">7.3 NTL: Remaining comparisons (AIPW only)&lt;/h3>
&lt;p>For the remaining four NTL contrasts we use AIPW with default lasso methods (faster than random forests on smaller subsamples).&lt;/p>
&lt;pre>&lt;code class="language-stata">* --- NTL: 2 vs 0 (medium mining vs no mining) ---
preserve
keep if treatment == 2 | treatment == 0
gen byte treat_2v0 = (treatment == 2)
cate aipw (ntl_log $catevars) (treat_2v0), ///
controls($controls) rseed(12345) xfolds(5)
estimates store aipw_ntl_2v0
restore
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Conditional average treatment effects Number of observations = 2,700
Estimator: Augmented IPW Number of folds in cross-fit = 5
Outcome model: Linear lasso Number of outcome controls = 28
Treatment model: Logit lasso Number of treatment controls = 28
CATE model: Random forest Number of CATE variables = 10
------------------------------------------------------------------------------
| Robust
ntl_log | Coefficient std. err. z P&amp;gt;|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ATE |
treat_2v0 |
(1 vs 0) | .2891968 .0250557 11.54 0.000 .2400886 .3383049
------------------------------------------------------------------------------
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-stata">* --- NTL: 3 vs 0 (high mining vs no mining) ---
preserve
keep if treatment == 3 | treatment == 0
gen byte treat_3v0 = (treatment == 3)
cate aipw (ntl_log $catevars) (treat_3v0), ///
controls($controls) rseed(12345) xfolds(5)
estimates store aipw_ntl_3v0
restore
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Conditional average treatment effects Number of observations = 2,700
Estimator: Augmented IPW Number of folds in cross-fit = 5
Outcome model: Linear lasso Number of outcome controls = 28
Treatment model: Logit lasso Number of treatment controls = 28
CATE model: Random forest Number of CATE variables = 10
------------------------------------------------------------------------------
| Robust
ntl_log | Coefficient std. err. z P&amp;gt;|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ATE |
treat_3v0 |
(1 vs 0) | .6111885 .0250606 24.39 0.000 .5620707 .6603063
------------------------------------------------------------------------------
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-stata">* --- NTL: 2 vs 1 (medium vs low prices, within mining) ---
preserve
keep if treatment == 2 | treatment == 1
gen byte treat_2v1 = (treatment == 2)
cate aipw (ntl_log $catevars) (treat_2v1), ///
controls($controls) rseed(12345) xfolds(5)
estimates store aipw_ntl_2v1
restore
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Conditional average treatment effects Number of observations = 300
Estimator: Augmented IPW Number of folds in cross-fit = 5
Outcome model: Linear lasso Number of outcome controls = 28
Treatment model: Logit lasso Number of treatment controls = 28
CATE model: Random forest Number of CATE variables = 10
------------------------------------------------------------------------------
| Robust
ntl_log | Coefficient std. err. z P&amp;gt;|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ATE |
treat_2v1 |
(1 vs 0) | -.0112177 .0883033 -0.13 0.899 -.1842889 .1618535
------------------------------------------------------------------------------
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-stata">* --- NTL: 3 vs 2 (high vs medium prices, within mining) ---
* Note: AIPW fails on this tiny subsample (N=300) due to propensity
* score overlap violations. PO with relaxed tolerance handles this,
* but the estimate is unreliable due to the extreme small sample.
preserve
keep if treatment == 3 | treatment == 2
gen byte treat_3v2 = (treatment == 3)
cate po (ntl_log $catevars) (treat_3v2), ///
controls($controls) rseed(12345) xfolds(5) ///
pstolerance(1e-8)
estimates store aipw_ntl_3v2
restore
&lt;/code>&lt;/pre>
&lt;blockquote>
&lt;p>&lt;strong>Overlap failure.&lt;/strong> The 3-2 comparison (high vs medium prices) has only 300 observations split roughly evenly between two treated groups. The AIPW estimator fails entirely due to propensity scores near zero. Even the PO estimator with &lt;code>pstolerance(1e-8)&lt;/code> &amp;mdash; which relaxes the minimum acceptable propensity score from the default 1e-5 to 1e-8 &amp;mdash; produces an unreliable ATE of &amp;ndash;43,825 (SE = 43,752, p = 0.317). This comparison is excluded from the summary table below. The remaining five contrasts are well-identified.&lt;/p>
&lt;/blockquote>
&lt;h3 id="74-conflict-mining-effect-1-0-----po-vs-aipw">7.4 Conflict: Mining effect (1-0) &amp;mdash; PO vs AIPW&lt;/h3>
&lt;p>Does mining increase conflict? We show both estimators for the key contrast. Unlike NTL, the conflict ground truths are not specified in the DGP, so we interpret directionally.&lt;/p>
&lt;pre>&lt;code class="language-stata">* --- Conflict: 1 vs 0 (PO estimator) ---
preserve
keep if treatment == 1 | treatment == 0
gen byte treat_1v0 = (treatment == 1)
cate po (conflict $catevars) (treat_1v0), ///
controls($controls) ///
rseed(12345) xfolds(5) ///
omethod(rforest) tmethod(rforest)
estimates store po_conf_1v0
restore
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Conditional average treatment effects Number of observations = 2,700
Estimator: Partialing out Number of folds in cross-fit = 5
Outcome model: Random forest Number of outcome controls = 28
Treatment model: Random forest Number of treatment controls = 28
CATE model: Random forest Number of CATE variables = 10
------------------------------------------------------------------------------
| Robust
conflict | Coefficient std. err. z P&amp;gt;|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ATE |
treat_1v0 |
(1 vs 0) | .0630853 .0130031 4.85 0.000 .0375997 .0885709
------------------------------------------------------------------------------
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-stata">* --- Conflict: 1 vs 0 (AIPW estimator) ---
preserve
keep if treatment == 1 | treatment == 0
gen byte treat_1v0 = (treatment == 1)
cate aipw (conflict $catevars) (treat_1v0), ///
controls($controls) ///
rseed(12345) xfolds(5) ///
omethod(rforest) tmethod(rforest)
estimates store aipw_conf_1v0
restore
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Conditional average treatment effects Number of observations = 2,700
Estimator: Augmented IPW Number of folds in cross-fit = 5
Outcome model: Random forest Number of outcome controls = 28
Treatment model: Random forest Number of treatment controls = 28
CATE model: Random forest Number of CATE variables = 10
------------------------------------------------------------------------------
| Robust
conflict | Coefficient std. err. z P&amp;gt;|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ATE |
treat_1v0 |
(1 vs 0) | .0659767 .0122036 5.41 0.000 .042058 .0898954
------------------------------------------------------------------------------
&lt;/code>&lt;/pre>
&lt;p>Both estimators produce positive and significant ATEs (PO = 0.063, AIPW = 0.066, both p &amp;lt; 0.001), confirming Finding 1: mining increases both nighttime lights &lt;em>and&lt;/em> conflict. The baseline conflict probability for non-mining districts is approximately 10.7%, and mining increases it by about 6.5 percentage points.&lt;/p>
&lt;h3 id="75-conflict-remaining-comparisons-aipw-only">7.5 Conflict: Remaining comparisons (AIPW only)&lt;/h3>
&lt;pre>&lt;code class="language-stata">* Loop over remaining conflict comparisons
local comparisons &amp;quot;2_0 3_0 2_1 3_1 3_2&amp;quot;
foreach comp of local comparisons {
local t_hi = substr(&amp;quot;`comp'&amp;quot;, 1, 1)
local t_lo = substr(&amp;quot;`comp'&amp;quot;, 3, 1)
preserve
keep if treatment == `t_hi' | treatment == `t_lo'
gen byte treat_bin = (treatment == `t_hi')
quietly cate aipw (conflict $catevars) (treat_bin), ///
controls($controls) rseed(12345) xfolds(5)
matrix b = e(b)
matrix V = e(V)
display as result &amp;quot;Conflict `t_hi' vs `t_lo': ATE = &amp;quot; %7.4f b[1,1] ///
&amp;quot; SE = &amp;quot; %7.4f sqrt(V[1,1])
estimates store aipw_conf_`t_hi'v`t_lo'
restore
}
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">=== Conflict: Treatment 2 vs 0 ===
N = 2700
ATE = 0.0728 SE = 0.0330
=== Conflict: Treatment 3 vs 0 ===
N = 2700
ATE = 0.1586 SE = 0.0380
=== Conflict: Treatment 2 vs 1 ===
N = 300
ATE = -0.0677 SE = 0.0497
=== Conflict: Treatment 3 vs 1 ===
N = 300
ATE = 0.1126 SE = 0.0293
=== Conflict: Treatment 3 vs 2 ===
N = 300
ATE = 3.5e+04 SE = 3.5e+04 (overlap failure -- unreliable)
&lt;/code>&lt;/pre>
&lt;h3 id="76-ate-summary">7.6 ATE summary&lt;/h3>
&lt;pre>&lt;code class="language-stata">* Compile NTL AIPW ATEs into a comparison table
display as text &amp;quot;{hline 70}&amp;quot;
display as text &amp;quot;SUMMARY: Average Treatment Effects (NTL Outcome)&amp;quot;
display as text &amp;quot;{hline 70}&amp;quot;
display as text &amp;quot;Contrast&amp;quot; _col(15) &amp;quot;AIPW ATE&amp;quot; _col(30) &amp;quot;SE&amp;quot; _col(42) &amp;quot;Ground Truth&amp;quot;
display as text &amp;quot;{hline 70}&amp;quot;
local comps &amp;quot;1v0 2v0 3v0 2v1 3v1 3v2&amp;quot;
local gts &amp;quot;0.25 0.30 0.55 0.05 0.30 0.25&amp;quot;
local i = 1
foreach comp of local comps {
local gt : word `i' of `gts'
quietly estimates restore aipw_ntl_`comp'
matrix b = e(b)
matrix V = e(V)
local ate = b[1,1]
local se = sqrt(V[1,1])
display as result &amp;quot;`comp'&amp;quot; _col(15) %7.4f `ate' _col(30) %7.4f `se' _col(42) &amp;quot;`gt'&amp;quot;
local ++i
}
display as text &amp;quot;{hline 70}&amp;quot;
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">----------------------------------------------------------------------
SUMMARY: Average Treatment Effects (NTL Outcome)
----------------------------------------------------------------------
Contrast ATE SE Ground Truth
----------------------------------------------------------------------
1v0 0.1490 0.0106 0.25
2v0 0.2892 0.0251 0.30
3v0 0.6112 0.0251 0.55
2v1 -0.0112 0.0883 0.05
3v1 0.4053 0.0255 0.30
3v2 (overlap failure) 0.25
----------------------------------------------------------------------
&lt;/code>&lt;/pre>
&lt;p>Several estimates deviate from the ground truth (e.g., 1v0 = 0.149 vs truth 0.25; 3v1 = 0.405 vs truth 0.30). These deviations reflect finite-sample variability, the particular random seed, and the challenge of estimating effects with only 150 treated observations per group. The directional patterns are robust: all mining effects are positive, and the price non-linearity is clear. With larger samples or different seeds, estimates would converge closer to the DGP values.&lt;/p>
&lt;p>Two findings emerge from the ATE summary:&lt;/p>
&lt;p>&lt;strong>Finding 1: Mining increases nighttime lights.&lt;/strong> All three mining-vs-no-mining comparisons (1-0, 2-0, 3-0) show positive and significant ATEs (0.149, 0.289, 0.611), with magnitudes increasing as the mineral price level rises. The 3-0 contrast (high-price mining vs no mining) is the largest at 0.611 &amp;mdash; the combined effect of mining itself plus the high price premium.&lt;/p>
&lt;p>&lt;strong>Finding 2: Price effects are non-linear.&lt;/strong> The within-mining price contrasts confirm non-linearity: 2-1 (medium vs low prices) is essentially zero (&amp;ndash;0.011, p = 0.90), while 3-1 (high vs low prices) is large and significant (0.405, p &amp;lt; 0.001). Price effects are &lt;em>not&lt;/em> a smooth dose-response &amp;mdash; they &amp;ldquo;jump&amp;rdquo; sharply only at high prices. The step from low to medium prices does nothing; the step from low to high prices does a lot.&lt;/p>
&lt;hr>
&lt;h2 id="8-treatment-effect-heterogeneity-gates">8. Treatment effect heterogeneity (GATEs)&lt;/h2>
&lt;p>The key innovation of causal machine learning is detecting &lt;em>how&lt;/em> treatment effects vary across subgroups. We compute &lt;strong>GATEs (Group Average Treatment Effects)&lt;/strong> by institutional variables to test Finding 3: institutions moderate mining effects but NOT price effects.&lt;/p>
&lt;h3 id="81-gates-by-executive-constraints-mining-effect-1-0">8.1 GATEs by executive constraints: Mining effect (1-0)&lt;/h3>
&lt;pre>&lt;code class="language-stata">preserve
keep if treatment == 1 | treatment == 0
gen byte treat_1v0 = (treatment == 1)
cate aipw (ntl_log $catevars) (treat_1v0), ///
controls($controls) ///
group(exec_con) ///
rseed(12345) xfolds(5) ///
omethod(rforest) tmethod(rforest)
categraph gateplot
estat gatetest
estimates store gate_ntl_1v0_exec
restore
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Conditional average treatment effects Number of observations = 2,700
Estimator: Augmented IPW Number of folds in cross-fit = 5
------------------------------------------------------------------------------
| Robust
ntl_log | Coefficient std. err. z P&amp;gt;|z| [95% conf. interval]
-------------+----------------------------------------------------------------
GATE |
exec_con |
1 | .2748407 .0403765 6.81 0.000 .1957042 .3539772
2 | .3155337 .0204714 15.41 0.000 .2754106 .3556569
3 | .1674459 .020837 8.04 0.000 .1266061 .2082857
4 | .1131603 .0263687 4.29 0.000 .0614785 .164842
5 | .0998745 .0296118 3.37 0.001 .0418364 .1579127
6 | .0508165 .0220009 2.31 0.021 .0076955 .0939374
-------------+----------------------------------------------------------------
ATE |
treat_1v0 |
(1 vs 0) | .1517508 .0111077 13.66 0.000 .12998 .1735216
------------------------------------------------------------------------------
Group treatment-effects heterogeneity test
H0: Group average treatment effects are homogeneous
chi2(5) = 96.90
Prob &amp;gt; chi2 = 0.0000
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_cate2_gate_ntl_1v0_exec.png" alt="GATEs for NTL mining effect (1-0) by Executive Constraints.">&lt;/p>
&lt;p>The GATE plot reveals a &lt;strong>downward slope&lt;/strong>: districts with &lt;em>weaker&lt;/em> executive constraints (lower values on the x-axis) experience larger mining effects on nighttime lights (GATE = 0.275 at exec_con = 1 vs 0.051 at exec_con = 6). The &lt;code>estat gatetest&lt;/code> strongly rejects GATE equality (&lt;strong>chi2(5) = 96.90, p &amp;lt; 0.0001&lt;/strong>), confirming that institutional quality moderates mining effects.&lt;/p>
&lt;p>This pattern &amp;mdash; weaker institutions, larger mining benefit &amp;mdash; differs from the sign that Hodler et al. (2023) found in real Sub-Saharan African data. In the full paper, stronger institutions &lt;em>amplified&lt;/em> the development benefits of mining. In our simulated data, the DGP produces the opposite sign: mining has a larger positive effect on NTL in weakly-governed districts, perhaps because these districts start from a lower baseline and have more room for growth when mining begins. The key takeaway is that &lt;strong>institutional moderation exists&lt;/strong> (the GATEs are clearly heterogeneous), even though the direction differs from the full paper&amp;rsquo;s parametrization.&lt;/p>
&lt;h3 id="82-gates-by-executive-constraints-price-effect-3-1">8.2 GATEs by executive constraints: Price effect (3-1)&lt;/h3>
&lt;pre>&lt;code class="language-stata">preserve
keep if treatment == 3 | treatment == 1
gen byte treat_3v1 = (treatment == 3)
cate aipw (ntl_log $catevars) (treat_3v1), ///
controls($controls) ///
group(exec_con) ///
rseed(12345) xfolds(5) ///
omethod(rforest) tmethod(rforest)
categraph gateplot
estat gatetest
estimates store gate_ntl_3v1_exec
restore
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Conditional average treatment effects Number of observations = 300
Estimator: Augmented IPW Number of folds in cross-fit = 5
------------------------------------------------------------------------------
| Robust
ntl_log | Coefficient std. err. z P&amp;gt;|z| [95% conf. interval]
-------------+----------------------------------------------------------------
GATE |
exec_con |
1 | .3211384 .0790347 4.06 0.000 .1662332 .4760437
2 | .2868582 .0726244 3.95 0.000 .1445171 .4291993
3 | .3729897 .0413647 9.02 0.000 .2919164 .4540629
4 | .5891193 .0542141 10.87 0.000 .4828616 .6953771
5 | .4870458 .0590345 8.25 0.000 .3713403 .6027514
6 | .3400699 .0596378 5.70 0.000 .2231819 .4569579
-------------+----------------------------------------------------------------
ATE |
treat_3v1 |
(1 vs 0) | .4062996 .0252193 16.11 0.000 .3568707 .4557284
------------------------------------------------------------------------------
Group treatment-effects heterogeneity test
H0: Group average treatment effects are homogeneous
chi2(5) = 18.92
Prob &amp;gt; chi2 = 0.0020
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_cate2_gate_ntl_3v1_exec.png" alt="GATEs for NTL price effect (3-1) by Executive Constraints.">&lt;/p>
&lt;p>The GATE plot for the price effect (3-1) shows a &lt;strong>non-monotone pattern&lt;/strong>: GATEs range from 0.29 to 0.59 across executive constraint levels with no clear directional trend. While the &lt;code>estat gatetest&lt;/code> rejects equality (chi2(5) = 18.92, p = 0.002), the pattern lacks the clear monotone slope seen in the mining effect. The price effect is positive everywhere and does not systematically vary with institutional quality in the same way.&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>The key contrast (Finding 3).&lt;/strong> Compare the two GATE plots above. Mining effect (1-0): clear &lt;strong>monotone downward slope&lt;/strong> &amp;mdash; a strong, systematic relationship between institutional quality and the mining effect (chi2 = 96.90). Price effect (3-1): &lt;strong>no monotone pattern&lt;/strong> &amp;mdash; while some variation exists, there is no clear directional relationship between institutions and the price premium. This asymmetry supports the paper&amp;rsquo;s core insight: &lt;strong>institutional quality systematically moderates mining effects, but does not systematically shape how global commodity price shocks affect local economic activity.&lt;/strong>&lt;/p>
&lt;/blockquote>
&lt;h3 id="83-gates-by-quality-of-government-mining-effect-1-0">8.3 GATEs by quality of government: Mining effect (1-0)&lt;/h3>
&lt;p>We repeat the analysis using an alternative institutional measure &amp;mdash; quality of government &amp;mdash; discretized into quartiles.&lt;/p>
&lt;pre>&lt;code class="language-stata">preserve
keep if treatment == 1 | treatment == 0
gen byte treat_1v0 = (treatment == 1)
egen qog_cat = cut(quality_of_govt), group(4) label
cate aipw (ntl_log $catevars) (treat_1v0), ///
controls($controls) ///
group(qog_cat) ///
rseed(12345) xfolds(5) ///
omethod(rforest) tmethod(rforest)
categraph gateplot
estat gatetest
estimates store gate_ntl_1v0_qog
restore
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">GATE |
qog_cat |
.22- | .2978846 .0215258 13.84 0.000 .2556947 .3400745
.32- | .168479 .0205681 8.19 0.000 .1281663 .2087917
.42- | .1080724 .0242792 4.45 0.000 .060486 .1556589
.58- | .0728521 .0179392 4.06 0.000 .037692 .1080123
-------------+----------------------------------------------------------------
ATE |
(1 vs 0) | .1504898 .0107088 14.05 0.000 .1295009 .1714786
------------------------------------------------------------------------------
Group treatment-effects heterogeneity test
H0: Group average treatment effects are homogeneous
chi2(3) = 69.19
Prob &amp;gt; chi2 = 0.0000
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_cate2_gate_ntl_1v0_qog.png" alt="GATEs for NTL mining effect (1-0) by Quality of Government quartiles.">&lt;/p>
&lt;p>The quality-of-government GATE plot confirms the same &lt;strong>downward pattern&lt;/strong> seen with executive constraints: districts in the lowest QoG quartile (0.22&amp;ndash;0.32) show a GATE of 0.298, while the highest quartile (0.58+) shows only 0.073. The &lt;code>estat gatetest&lt;/code> strongly rejects equality (chi2(3) = 69.19, p &amp;lt; 0.0001). Both institutional measures tell the same story: weaker governance, larger mining effect.&lt;/p>
&lt;h3 id="84-gates-by-quality-of-government-price-effect-3-1">8.4 GATEs by quality of government: Price effect (3-1)&lt;/h3>
&lt;pre>&lt;code class="language-stata">preserve
keep if treatment == 3 | treatment == 1
gen byte treat_3v1 = (treatment == 3)
egen qog_cat = cut(quality_of_govt), group(4) label
cate aipw (ntl_log $catevars) (treat_3v1), ///
controls($controls) ///
group(qog_cat) ///
rseed(12345) xfolds(5) ///
omethod(rforest) tmethod(rforest)
categraph gateplot
estat gatetest
estimates store gate_ntl_3v1_qog
restore
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">GATE |
qog_cat |
.22- | .3224114 .0784869 4.11 0.000 .1685799 .476243
.28- | .3510705 .0408896 8.59 0.000 .2709285 .4312126
.38- | .4843956 .0567094 8.54 0.000 .3732473 .5955438
.48- | .4447202 .039344 11.30 0.000 .3676074 .5218329
-------------+----------------------------------------------------------------
ATE |
(1 vs 0) | .4057735 .0253689 15.99 0.000 .3560514 .4554957
------------------------------------------------------------------------------
Group treatment-effects heterogeneity test
H0: Group average treatment effects are homogeneous
chi2(3) = 5.81
Prob &amp;gt; chi2 = 0.1211
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_cate2_gate_ntl_3v1_qog.png" alt="GATEs for NTL price effect (3-1) by Quality of Government quartiles.">&lt;/p>
&lt;p>The price effect GATEs by QoG range from 0.322 to 0.484 without a clear monotone pattern, and the &lt;code>estat gatetest&lt;/code> fails to reject equality (chi2(3) = 5.81, p = 0.121). This confirms the asymmetry: institutional quality does not systematically moderate the price premium, unlike the mining effect where the relationship is strong and monotone.&lt;/p>
&lt;h3 id="85-gates-for-conflict-mining-effect-1-0">8.5 GATEs for conflict: Mining effect (1-0)&lt;/h3>
&lt;pre>&lt;code class="language-stata">preserve
keep if treatment == 1 | treatment == 0
gen byte treat_1v0 = (treatment == 1)
cate aipw (conflict $catevars) (treat_1v0), ///
controls($controls) ///
group(exec_con) ///
rseed(12345) xfolds(5) ///
omethod(rforest) tmethod(rforest)
categraph gateplot
estat gatetest
estimates store gate_conf_1v0_exec
restore
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">GATE |
exec_con |
1 | .0924768 .0440987 2.10 0.036 .0060449 .1789087
2 | .032707 .0348295 0.94 0.348 -.0355576 .1009715
3 | .0600398 .0292415 2.05 0.040 .0027275 .1173522
4 | .0486042 .0273151 1.78 0.075 -.0049326 .1021409
5 | .0643314 .0205048 3.14 0.002 .0241427 .1045202
6 | .1057752 .021425 4.94 0.000 .0637831 .1477674
-------------+----------------------------------------------------------------
ATE |
(1 vs 0) | .0648653 .0122278 5.30 0.000 .0408994 .0888313
------------------------------------------------------------------------------
Group treatment-effects heterogeneity test
H0: Group average treatment effects are homogeneous
chi2(5) = 5.00
Prob &amp;gt; chi2 = 0.4162
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_cate2_gate_conf_1v0_exec.png" alt="GATEs for Conflict mining effect (1-0) by Executive Constraints.">&lt;/p>
&lt;p>For conflict, the GATEs range from 0.033 to 0.106 across executive constraint levels with no clear monotone pattern. The &lt;code>estat gatetest&lt;/code> fails to reject equality (chi2(5) = 5.00, p = 0.416), indicating that institutional quality does &lt;strong>not&lt;/strong> significantly moderate the conflict effect of mining in this simulated dataset. All groups show a positive conflict effect, but without systematic variation.&lt;/p>
&lt;hr>
&lt;h2 id="9-advanced-diagnostics">9. Advanced diagnostics&lt;/h2>
&lt;p>Stata 19&amp;rsquo;s &lt;code>cate&lt;/code> suite provides several postestimation tools that go beyond group-level summaries. This section demonstrates IATE distributions, formal heterogeneity tests, subpopulation ATEs, linear projections, and IATE function plots.&lt;/p>
&lt;h3 id="91-iate-distribution-and-heterogeneity-test">9.1 IATE distribution and heterogeneity test&lt;/h3>
&lt;p>We re-estimate the NTL mining effect (1-0) with &lt;code>i.exec_con&lt;/code> in the catevarlist to enable &lt;code>reestimate group(exec_con)&lt;/code> later.&lt;/p>
&lt;pre>&lt;code class="language-stata">preserve
keep if treatment == 1 | treatment == 0
gen byte treat_1v0 = (treatment == 1)
cate aipw (ntl_log exec_constraints quality_of_govt gdp_pc ///
elevation temperature ruggedness distance_capital ///
agri_suitability population ethnic_frac ///
i.exec_con) (treat_1v0), ///
controls($controls) ///
rseed(12345) xfolds(5) ///
omethod(rforest) tmethod(rforest)
* Distribution of individual effects
categraph histogram
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Conditional average treatment effects Number of observations = 2,700
Estimator: Augmented IPW Number of folds in cross-fit = 5
Outcome model: Random forest Number of outcome controls = 34
Treatment model: Random forest Number of treatment controls = 34
CATE model: Random forest Number of CATE variables = 16
------------------------------------------------------------------------------
| Robust
ntl_log | Coefficient std. err. z P&amp;gt;|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ATE |
treat_1v0 |
(1 vs 0) | .1517508 .0111077 13.66 0.000 .12998 .1735216
------------------------------------------------------------------------------
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_cate2_iate_histogram.png" alt="Distribution of IATE predictions for the NTL mining effect (1-0). The spread of this histogram reflects the degree of treatment-effect heterogeneity across districts.">&lt;/p>
&lt;p>The histogram shows the full distribution of estimated individual treatment effects $\hat{\tau}(\mathbf{x}_i)$ across all districts. A wide spread indicates substantial heterogeneity; a spike at one value would indicate near-homogeneity. The distribution is centered around the ATE of approximately 0.15, with meaningful spread reflecting how institutional quality, geography, and economic conditions create different mining effects across districts.&lt;/p>
&lt;pre>&lt;code class="language-stata">* Formal test: are treatment effects heterogeneous?
estat heterogeneity
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Treatment-effects heterogeneity test
H0: Treatment effects are homogeneous
chi2(1) = 53.05
Prob &amp;gt; chi2 = 0.0000
&lt;/code>&lt;/pre>
&lt;blockquote>
&lt;p>&lt;strong>Interpreting the heterogeneity test.&lt;/strong> The &lt;code>estat heterogeneity&lt;/code> test uses the method of Chernozhukov et al. (2006). A significant result (p &amp;lt; 0.05) provides statistical evidence that treatment effects vary across observations &amp;mdash; they are not constant. This justifies the use of CATE methods rather than a simple ATE.&lt;/p>
&lt;/blockquote>
&lt;h3 id="92-gate-equality-test-with-reestimate">9.2 GATE equality test with reestimate&lt;/h3>
&lt;p>The &lt;code>reestimate&lt;/code> option recycles the IATE function from the previous estimation. We do NOT refit the (slow) causal forest &amp;mdash; we just recompute group means. This makes it fast to explore different grouping variables.&lt;/p>
&lt;pre>&lt;code class="language-stata">* Recompute GATEs by Executive Constraints from existing IATEs
cate, reestimate group(exec_con)
* Test H0: GATEs are equal across executive constraint levels
estat gatetest
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">GATE |
exec_con |
1 | .2748407 .0403765 6.81 0.000 .1957042 .3539772
2 | .3155337 .0204714 15.41 0.000 .2754106 .3556569
3 | .1674459 .020837 8.04 0.000 .1266061 .2082857
4 | .1131603 .0263687 4.29 0.000 .0614785 .164842
5 | .0998745 .0296118 3.37 0.001 .0418364 .1579127
6 | .0508165 .0220009 2.31 0.021 .0076955 .0939374
Group treatment-effects heterogeneity test
H0: Group average treatment effects are homogeneous
chi2(5) = 96.90
Prob &amp;gt; chi2 = 0.0000
&lt;/code>&lt;/pre>
&lt;h3 id="93-ate-for-subpopulations">9.3 ATE for subpopulations&lt;/h3>
&lt;p>We can estimate ATEs for specific subsets of the data using &lt;code>estat ate&lt;/code>. This answers the policy question: &amp;ldquo;What is the average effect of mining &lt;em>specifically for districts with strong (or weak) institutions&lt;/em>?&amp;rdquo;&lt;/p>
&lt;pre>&lt;code class="language-stata">* ATE for districts with strong institutions
estat ate if exec_constraints &amp;gt;= 4
* ATE for districts with weak institutions
estat ate if exec_constraints &amp;lt;= 2
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">--- ATE for districts with exec_constraints &amp;gt;= 4 ---
Treatment-effects estimation Number of obs = 1,526
------------------------------------------------------------------------------
| Robust
| Coefficient std. err. z P&amp;gt;|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ATE |
treat_1v0 |
(1 vs 0) | .0922992 .0156897 5.88 0.000 .0615479 .1230506
------------------------------------------------------------------------------
--- ATE for districts with exec_constraints &amp;lt;= 2 ---
Treatment-effects estimation Number of obs = 558
------------------------------------------------------------------------------
| Robust
| Coefficient std. err. z P&amp;gt;|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ATE |
treat_1v0 |
(1 vs 0) | .2970104 .0215156 13.80 0.000 .2548407 .3391801
------------------------------------------------------------------------------
&lt;/code>&lt;/pre>
&lt;p>The subpopulation ATEs reveal a stark difference: districts with &lt;strong>weak institutions&lt;/strong> (exec_constraints $\leq$ 2) have an ATE of &lt;strong>0.297&lt;/strong> (SE = 0.022), while districts with &lt;strong>strong institutions&lt;/strong> (exec_constraints $\geq$ 4) have an ATE of only &lt;strong>0.092&lt;/strong> (SE = 0.016). The mining effect is more than three times larger in weakly-governed districts. This confirms the GATE pattern and demonstrates that institutions systematically moderate the magnitude of mining&amp;rsquo;s developmental impact.&lt;/p>
&lt;h3 id="94-linear-projection-of-iates">9.4 Linear projection of IATEs&lt;/h3>
&lt;p>The &lt;code>estat projection&lt;/code> command regresses the estimated $\hat{\tau}_i$ on covariates linearly. This provides an interpretable summary of &lt;em>which variables drive heterogeneity&lt;/em> &amp;mdash; think of it as &amp;ldquo;an OLS view of the function $\tau(\mathbf{x})$.&amp;rdquo;&lt;/p>
&lt;pre>&lt;code class="language-stata">estat projection exec_constraints quality_of_govt gdp_pc elevation temperature
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Treatment-effects linear projection Number of obs = 2,700
F(5, 2694) = 13.95
Prob &amp;gt; F = 0.0000
R-squared = 0.0235
------------------------------------------------------------------------------
| Robust
| Coefficient std. err. t P&amp;gt;|t| [95% conf. interval]
-------------+----------------------------------------------------------------
exec_const~s | -.026133 .0470456 -0.56 0.579 -.1183822 .0661162
quality_of~t | -.862502 .6669174 -1.29 0.196 -2.170224 .4452196
gdp_pc | .000067 .0000388 1.73 0.084 -9.06e-06 .000143
elevation | -.0001258 .0000341 -3.69 0.000 -.0001928 -.0000589
temperature | .0045969 .0022266 2.06 0.039 .0002309 .0089629
_cons | .4351266 .1043275 4.17 0.000 .2305565 .6396967
------------------------------------------------------------------------------
&lt;/code>&lt;/pre>
&lt;p>The linear projection (R-squared = 0.024) reveals that &lt;strong>elevation&lt;/strong> (coeff = &amp;ndash;0.0001, p &amp;lt; 0.001) and &lt;strong>temperature&lt;/strong> (coeff = 0.005, p = 0.039) are the strongest linear predictors of the individual treatment effect. Institutional variables (&lt;code>exec_constraints&lt;/code> and &lt;code>quality_of_govt&lt;/code>) have negative coefficients but are not individually significant in this linear summary, despite driving the GATE heterogeneity. This is expected &amp;mdash; the relationship between institutions and the treatment effect is nonlinear (as the GATE plots show), and a linear projection cannot capture the full pattern that the random forest identifies.&lt;/p>
&lt;h3 id="95-iate-function-plots">9.5 IATE function plots&lt;/h3>
&lt;p>The &lt;code>categraph iateplot&lt;/code> command shows how the IATE function varies with one covariate at a time, holding all others at their reference values. This is the most intuitive visualization of heterogeneity.&lt;/p>
&lt;pre>&lt;code class="language-stata">categraph iateplot exec_constraints
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_cate2_iateplot_exec.png" alt="IATE function for NTL mining effect (1-0) as a function of executive constraints. An upward trend confirms that stronger institutions increase the mining benefit.">&lt;/p>
&lt;pre>&lt;code class="language-stata">categraph iateplot quality_of_govt
restore
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_cate2_iateplot_qog.png" alt="IATE function for NTL mining effect (1-0) as a function of quality of government.">&lt;/p>
&lt;p>The IATE function plots show &lt;strong>downward trends&lt;/strong>: as institutional quality increases (whether measured by executive constraints or quality of government), the predicted treatment effect of mining on nighttime lights &lt;em>decreases&lt;/em>. This provides visual confirmation of the GATE findings and complements the bar charts with a continuous view of the relationship. The downward slope is consistent with the subpopulation ATEs: mining has a larger developmental effect in weakly-governed districts.&lt;/p>
&lt;hr>
&lt;h2 id="10-conclusion">10. Conclusion&lt;/h2>
&lt;h3 id="101-mapping-results-to-paper-findings">10.1 Mapping results to paper findings&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Finding&lt;/th>
&lt;th>Paper Result&lt;/th>
&lt;th>Tutorial Evidence&lt;/th>
&lt;th>Stata Command&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>1. Mining increases NTL&lt;/td>
&lt;td>Positive ATEs (1-0, 2-0, 3-0)&lt;/td>
&lt;td>Confirmed: ATEs 0.15&amp;ndash;0.61&lt;/td>
&lt;td>&lt;code>cate aipw ... (treat_1v0)&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>2. Non-linear prices&lt;/td>
&lt;td>ATE(2-1) &amp;laquo; ATE(3-1)&lt;/td>
&lt;td>Confirmed: -0.01 vs 0.41&lt;/td>
&lt;td>&lt;code>estimates restore&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3. Institutions moderate mining&lt;/td>
&lt;td>Monotone GATE slope for 1-0&lt;/td>
&lt;td>Confirmed (downward): chi2 = 96.9&lt;/td>
&lt;td>&lt;code>cate ... group(exec_con)&lt;/code>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>3. NOT prices&lt;/td>
&lt;td>No monotone GATE for 3-1&lt;/td>
&lt;td>Confirmed: chi2 = 5.81, p = 0.12 (QoG)&lt;/td>
&lt;td>&lt;code>cate ... group(qog_cat)&lt;/code>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;blockquote>
&lt;p>&lt;strong>Note on direction.&lt;/strong> The paper found that &lt;em>stronger&lt;/em> institutions amplify mining benefits (upward slope). Our simulated data shows the opposite sign &amp;mdash; &lt;em>weaker&lt;/em> institutions yield larger mining effects &amp;mdash; but the key structural finding (systematic institutional moderation of mining, not of prices) is reproduced. The direction difference reflects DGP parametrization, not a methodological failure.&lt;/p>
&lt;/blockquote>
&lt;h3 id="102-what-differs-from-the-full-paper">10.2 What differs from the full paper&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Aspect&lt;/th>
&lt;th>Tutorial&lt;/th>
&lt;th>Full Paper&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Data&lt;/td>
&lt;td>3,000 simulated obs&lt;/td>
&lt;td>60,121 real obs&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Districts&lt;/td>
&lt;td>300&lt;/td>
&lt;td>3,800&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Countries&lt;/td>
&lt;td>8 (fictional)&lt;/td>
&lt;td>42 (Sub-Saharan Africa)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Covariates&lt;/td>
&lt;td>10&lt;/td>
&lt;td>60+&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Treatment&lt;/td>
&lt;td>4-level, simulated&lt;/td>
&lt;td>29 minerals, real prices&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Inference&lt;/td>
&lt;td>5-fold cross-fitting&lt;/td>
&lt;td>1,000 bootstrap&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Outcomes&lt;/td>
&lt;td>2 (NTL, Conflict)&lt;/td>
&lt;td>2 (same)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Method&lt;/td>
&lt;td>Stata &lt;code>cate&lt;/code> (GRF)&lt;/td>
&lt;td>MCF (Lechner, 2019)&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="103-glossary">10.3 Glossary&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Term&lt;/th>
&lt;th>Definition&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>ATE&lt;/strong>&lt;/td>
&lt;td>Average Treatment Effect &amp;mdash; the mean effect across the entire population&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>CATE&lt;/strong>&lt;/td>
&lt;td>Conditional Average Treatment Effect &amp;mdash; the ATE conditional on characteristics $\mathbf{x}$&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>IATE&lt;/strong>&lt;/td>
&lt;td>Individualized ATE &amp;mdash; one effect per observation, $\tau(\mathbf{x}_i)$&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>GATE&lt;/strong>&lt;/td>
&lt;td>Group ATE &amp;mdash; average effect within prespecified groups&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>GATES&lt;/strong>&lt;/td>
&lt;td>Sorted Group ATE &amp;mdash; groups formed by quantiles of IATE estimates (data-driven)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>PO&lt;/strong>&lt;/td>
&lt;td>Partialing-Out estimator &amp;mdash; residualizes outcome and treatment, then estimates $\tau(\mathbf{x})$ via GRF&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>AIPW&lt;/strong>&lt;/td>
&lt;td>Augmented IPW &amp;mdash; doubly robust estimator combining outcome model and propensity score&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>GRF&lt;/strong>&lt;/td>
&lt;td>Generalized Random Forest &amp;mdash; the nonparametric method underlying Stata&amp;rsquo;s &lt;code>cate&lt;/code> (Athey et al., 2019)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Cross-fitting&lt;/strong>&lt;/td>
&lt;td>Sample-splitting procedure to prevent overfitting of nuisance models&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Honest tree&lt;/strong>&lt;/td>
&lt;td>Tree that uses separate subsamples for splitting and leaf estimation (enables valid inference)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>catevarlist&lt;/strong>&lt;/td>
&lt;td>Variables in Stata&amp;rsquo;s &lt;code>cate&lt;/code> that drive treatment-effect heterogeneity ($\mathbf{x}$)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>controls()&lt;/strong>&lt;/td>
&lt;td>Additional variables for nuisance models only ($\mathbf{w}$), not for heterogeneity&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="104-key-advantages-of-stata-19s-cate">10.4 Key advantages of Stata 19&amp;rsquo;s &lt;code>cate&lt;/code>&lt;/h3>
&lt;blockquote>
&lt;ol>
&lt;li>&lt;strong>No external packages&lt;/strong> &amp;mdash; everything is built into Stata 19.&lt;/li>
&lt;li>&lt;strong>Formal hypothesis tests&lt;/strong> &amp;mdash; &lt;code>estat heterogeneity&lt;/code> and &lt;code>estat gatetest&lt;/code> provide rigorous inference that specialized Python packages do not offer natively.&lt;/li>
&lt;li>&lt;strong>Publication-ready visualization&lt;/strong> &amp;mdash; &lt;code>categraph&lt;/code> produces polished plots directly.&lt;/li>
&lt;li>&lt;strong>Integrated workflow&lt;/strong> &amp;mdash; seamlessly combines with Stata&amp;rsquo;s ecosystem (&lt;code>estimates store&lt;/code>, &lt;code>preserve&lt;/code>/&lt;code>restore&lt;/code>, &lt;code>margins&lt;/code>).&lt;/li>
&lt;li>&lt;strong>Doubly robust&lt;/strong> &amp;mdash; AIPW estimator is consistent even if one nuisance model is wrong.&lt;/li>
&lt;/ol>
&lt;/blockquote>
&lt;h3 id="105-exercises">10.5 Exercises&lt;/h3>
&lt;ol>
&lt;li>&lt;strong>Change the estimator:&lt;/strong> Re-run the NTL 1-0 comparison using &lt;code>omethod(lasso) tmethod(lasso)&lt;/code> instead of random forest. Do the ATE and GATEs change substantially?&lt;/li>
&lt;li>&lt;strong>Vary cross-fitting folds:&lt;/strong> Try &lt;code>xfolds(10)&lt;/code> instead of 5 for the 1-0 comparison. Is there a precision gain? Does the ATE shift?&lt;/li>
&lt;li>&lt;strong>Data-driven groups:&lt;/strong> Replace &lt;code>group(exec_con)&lt;/code> with &lt;code>group(4)&lt;/code> to let the data discover heterogeneity groups via IATE quantiles (GATES). Do the data-driven groups align with institutional quality?&lt;/li>
&lt;li>&lt;strong>Subpopulation policy analysis:&lt;/strong> Use &lt;code>estat ate if gdp_pc &amp;gt; 6000&lt;/code> to estimate the ATE for richer districts. How does the mining effect compare to the full-sample ATE?&lt;/li>
&lt;li>&lt;strong>Additional heterogeneity:&lt;/strong> Investigate whether GDP per capita moderates treatment effects using &lt;code>categraph iateplot gdp_pc&lt;/code>. Is there a clear pattern?&lt;/li>
&lt;/ol>
&lt;h3 id="106-references">10.6 References&lt;/h3>
&lt;ol>
&lt;li>Hodler, R., Lechner, M., &amp;amp; Raschky, P. A. (2023). Institutions and the resource curse: New insights from causal machine learning. &lt;em>PLoS ONE&lt;/em>, 18(5), e0284968.&lt;/li>
&lt;li>Athey, S., Tibshirani, J., &amp;amp; Wager, S. (2019). Generalized random forests. &lt;em>Annals of Statistics&lt;/em>, 47(2), 1148&amp;ndash;1178.&lt;/li>
&lt;li>Nie, X., &amp;amp; Wager, S. (2021). Quasi-oracle estimation of heterogeneous treatment effects. &lt;em>Biometrika&lt;/em>, 108(2), 299&amp;ndash;319.&lt;/li>
&lt;li>Knaus, M. C. (2022). Double machine learning-based programme evaluation under unconfoundedness. &lt;em>Econometrics Journal&lt;/em>, 25(3), 602&amp;ndash;627.&lt;/li>
&lt;li>Kennedy, E. H. (2023). Towards optimal doubly robust estimation of heterogeneous causal effects. &lt;em>Electronic Journal of Statistics&lt;/em>, 17(2), 3008&amp;ndash;3049.&lt;/li>
&lt;li>Sachs, J. D., &amp;amp; Warner, A. M. (1995). Natural resource abundance and economic growth. &lt;em>NBER Working Paper&lt;/em> No. 5398.&lt;/li>
&lt;li>Mehlum, H., Moene, K., &amp;amp; Torvik, R. (2006). Institutions and the resource curse. &lt;em>The Economic Journal&lt;/em>, 116(508), 1&amp;ndash;20.&lt;/li>
&lt;li>StataCorp. (2025). &lt;em>Stata 19 Treatment-Effects Reference Manual: cate&lt;/em>. College Station, TX: Stata Press.&lt;/li>
&lt;/ol></description></item><item><title>Conditional Average Treatment Effects (CATE) with Stata 19</title><link>https://carlos-mendez.org/post/stata_cate/</link><pubDate>Fri, 01 May 2026 00:00:00 +0000</pubDate><guid>https://carlos-mendez.org/post/stata_cate/</guid><description>&lt;h2 id="1-overview">1. Overview&lt;/h2>
&lt;p>The textbook causal-inference workflow ends with a single number — the &lt;strong>Average Treatment Effect (ATE)&lt;/strong>. But policy makers, doctors, and managers rarely care only about the average. They want to know &lt;em>for whom&lt;/em> the program works best, &lt;em>for whom&lt;/em> it does little, and &lt;em>whether&lt;/em> the gains are worth the cost in any particular subgroup. This question — how the treatment effect varies across the covariates — is captured by the &lt;strong>Conditional Average Treatment Effect (CATE)&lt;/strong>, also written $\tau(x) = E\{y(1) - y(0) \mid x = x\}$.&lt;/p>
&lt;p>Until very recently, estimating CATE in Stata required hand-rolled &lt;code>forvalues&lt;/code> loops, careful interactions, and uncomfortably-ad-hoc inference. Stata 19 changed that with the new &lt;code>cate&lt;/code> command, which builds on the doubly robust scores of Athey, Tibshirani &amp;amp; Wager (2019) and the partialing-out workflow of Chernozhukov et al. (2018). With one command, Stata 19 now runs cross-fitted lasso for the nuisance functions, a generalized random forest for the individual-effect function $\tau(x)$, and an honest-tree bootstrap for confidence intervals. Postestimation tools — &lt;code>estat heterogeneity&lt;/code>, &lt;code>estat projection&lt;/code>, &lt;code>categraph gateplot&lt;/code>, &lt;code>estat classification&lt;/code>, &lt;code>estat series&lt;/code> — turn the resulting object into pictures that beginners can read directly.&lt;/p>
&lt;p>This tutorial walks through the full &lt;code>cate&lt;/code> workflow on the canonical 401(k) eligibility study (&lt;code>webuse assets3&lt;/code>, 9,913 households). We start with a single ATE, show that it hides a wide fan of household-level effects, and then peel back the heterogeneity in five complementary ways: a histogram of individual effects, an IATE-by-covariate plot, a GATE on prespecified income groups, GATES on data-driven quartiles, and a smooth nonparametric series fit. The result is a complete picture of &lt;em>who benefits&lt;/em> — and a reusable template you can drop into your own observational data.&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>Prerequisite.&lt;/strong> This post requires &lt;strong>Stata 19 or later&lt;/strong>. The &lt;code>cate&lt;/code> command does not exist in Stata 18. The do-file aborts on startup if it detects an older Stata.&lt;/p>
&lt;/blockquote>
&lt;h3 id="11-learning-objectives">1.1 Learning objectives&lt;/h3>
&lt;p>By the end of this tutorial you should be able to:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Understand&lt;/strong> why the ATE alone can mislead and how the CATE function $\tau(x)$ describes treatment effect heterogeneity.&lt;/li>
&lt;li>&lt;strong>Implement&lt;/strong> Stata 19&amp;rsquo;s &lt;code>cate&lt;/code> command using both the partialing-out (PO) and the augmented inverse-probability weighting (AIPW) estimators on observational data.&lt;/li>
&lt;li>&lt;strong>Estimate&lt;/strong> group-level effects (GATE) on prespecified groups and data-driven quartiles (GATES) of the predicted effect.&lt;/li>
&lt;li>&lt;strong>Diagnose&lt;/strong> treatment-effect heterogeneity with &lt;code>estat heterogeneity&lt;/code>, summarize who responds with &lt;code>estat projection&lt;/code> and &lt;code>estat classification&lt;/code>, and visualize the dose-response with &lt;code>estat series&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Compare&lt;/strong> doubly robust ML estimates (PO, AIPW) to a parametric &lt;code>teffects aipw&lt;/code> benchmark and judge whether the average is hiding important variation.&lt;/li>
&lt;/ul>
&lt;h3 id="12-methodological-overview">1.2 Methodological overview&lt;/h3>
&lt;p>The diagram below shows the two routes through the &lt;code>cate&lt;/code> command and the postestimation tools that probe the resulting CATE object.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">flowchart TB
A[&amp;quot;assets3 dataset&amp;lt;br/&amp;gt;9,913 households&amp;lt;br/&amp;gt;e401k -&amp;gt; assets&amp;quot;]:::data
A --&amp;gt; B{cate command}:::main
B --&amp;gt;|&amp;quot;&amp;lt;b&amp;gt;cate po&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;Partial-linear model&amp;lt;br/&amp;gt;Robust to small propensities&amp;quot;| C[&amp;quot;PO estimator&amp;lt;br/&amp;gt;Cross-fit lasso + causal forest&amp;quot;]:::po
B --&amp;gt;|&amp;quot;&amp;lt;b&amp;gt;cate aipw&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;Fully interactive model&amp;lt;br/&amp;gt;Doubly robust, more efficient&amp;quot;| D[&amp;quot;AIPW estimator&amp;lt;br/&amp;gt;Cross-fit lasso + causal forest&amp;quot;]:::aipw
C --&amp;gt; E[&amp;quot;IATE function&amp;lt;br/&amp;gt;tau-hat(x_i) per household&amp;quot;]:::iate
D --&amp;gt; E
E --&amp;gt; F1[&amp;quot;categraph histogram&amp;lt;br/&amp;gt;distribution of effects&amp;quot;]:::post
E --&amp;gt; F2[&amp;quot;categraph iateplot&amp;lt;br/&amp;gt;tau vs covariate&amp;quot;]:::post
E --&amp;gt; F3[&amp;quot;estat heterogeneity&amp;lt;br/&amp;gt;H0: tau(x) constant&amp;quot;]:::post
E --&amp;gt; F4[&amp;quot;estat projection&amp;lt;br/&amp;gt;linear summary of who&amp;quot;]:::post
E --&amp;gt; F5[&amp;quot;GATE / GATES&amp;lt;br/&amp;gt;group-level effects&amp;quot;]:::post
E --&amp;gt; F6[&amp;quot;estat classification&amp;lt;br/&amp;gt;top vs bottom profile&amp;quot;]:::post
E --&amp;gt; F7[&amp;quot;estat series&amp;lt;br/&amp;gt;smooth derivative&amp;quot;]:::post
classDef data fill:#6a9bcc,stroke:#141413,color:#fff
classDef main fill:#141413,stroke:#141413,color:#fff
classDef po fill:#6a9bcc,stroke:#141413,color:#fff
classDef aipw fill:#d97757,stroke:#141413,color:#fff
classDef iate fill:#00d4c8,stroke:#141413,color:#141413
classDef post fill:#f5f5f5,stroke:#141413,color:#141413
&lt;/code>&lt;/pre>
&lt;p>The two branches (PO and AIPW) make different model assumptions but produce the same kind of object: a function $\hat{\tau}(x_i)$ that returns a predicted treatment effect for every household. Postestimation commands then summarize that function in different ways — as a distribution (histogram), a function of one covariate (&lt;code>iateplot&lt;/code>), a test (&lt;code>estat heterogeneity&lt;/code>), a regression summary (&lt;code>estat projection&lt;/code>), or a group-level table (GATE / GATES). All seven postestimation views answer slightly different questions, and the last three sections of this post show why a beginner should look at all of them rather than picking one favorite.&lt;/p>
&lt;h3 id="13-key-concepts-at-a-glance">1.3 Key concepts at a glance&lt;/h3>
&lt;p>The post leans on a small vocabulary repeatedly. The rest of the tutorial assumes you can move between these terms quickly. Each concept below has three parts. The &lt;strong>definition&lt;/strong> is always visible. The &lt;strong>example&lt;/strong> and &lt;strong>analogy&lt;/strong> sit behind clickable cards: open them when you need them, leave them collapsed for a quick scan. If a later section mentions &amp;ldquo;GATE vs GATES&amp;rdquo; or &amp;ldquo;doubly robust&amp;rdquo; and the term feels slippery, this is the section to re-read.&lt;/p>
&lt;p>&lt;strong>1. Potential outcomes&lt;/strong> $Y_i(t)$.
The outcome unit $i$ &lt;strong>would&lt;/strong> take under treatment value $t$. Each household has two potential outcomes here: assets if eligible for a 401(k), assets if not. We observe only one. The other is &lt;em>counterfactual&lt;/em>. It lives in a world we never see.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Take household 1234 with &lt;code>e401k = 1&lt;/code> (eligible). We observe its &lt;code>assets&lt;/code> under eligibility. Its potential outcome under non-eligibility, $Y_{1234}(0)$, is forever invisible. Causal inference is the art of imputing that missing potential outcome from comparable ineligible households.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Every life decision is a fork in the road. You took one fork. The parallel-universe versions of yourself took the other. Their lives are real conceptual objects, but you cannot directly observe them.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>2. CATE&lt;/strong> &amp;mdash; Conditional Average Treatment Effect, $\tau(\mathbf{x})$.
The average treatment effect for households with covariate profile $\mathbf{x}$. The CATE is a &lt;strong>function&lt;/strong> of $\mathbf{x}$, not a single number. Where it bends with $\mathbf{x}$, eligibility helps some households more than others.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>For a high-income household (&lt;code>income&lt;/code> in the top quintile), the CATE is roughly \$20,511. For a low-income household, it is closer to \$4,087. Same &lt;code>e401k = 1&lt;/code>, very different effects on &lt;code>assets&lt;/code>.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>A drug&amp;rsquo;s &amp;ldquo;average effect&amp;rdquo; is a 5-point reduction in blood pressure. But a doctor cares about a specific patient. Maybe a 65-year-old male with diabetes. The CATE is that personalized effect.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>3. ATE&lt;/strong> &amp;mdash; Average Treatment Effect, $E[\tau(\mathbf{X})]$.
The CATE averaged across the entire sample. The headline policy number. It answers a single question: if we made everyone eligible, what would the average bump in &lt;code>assets&lt;/code> be?&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>AIPW gives an ATE of \$8,120 (95% CI [\$5,846, \$10,395]) on our 9,913 households. PO gives \$7,937 (95% CI [\$5,677, \$10,197]). The two estimates are within \$200. Their joint message: eligibility raises mean assets by about \$8,000.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>&amp;ldquo;This drug lowers cholesterol by 12 points on average.&amp;rdquo; Single number. Suitable for a press release. Says nothing about who responds best.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>4. GATE&lt;/strong> &amp;mdash; Group Average Treatment Effect.
The CATE averaged inside a &lt;em>pre-specified&lt;/em> subgroup. The subgroup is fixed before estimation. GATEs test moderation hypotheses you formulated in advance: &amp;ldquo;do high-income households benefit more than low-income ones?&amp;rdquo;&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Sort households by &lt;code>incomecat&lt;/code> (lowest to highest income quintile). Average CATEs inside each level. The lowest quintile gets \$4,087. The highest gets \$20,511. The pattern is monotone and steep.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>A nationwide marketing campaign lifts sales 5% on average. Before scaling up, you ask: did it work better in cities than rural towns? Same data, broken down by a subgroup you defined in advance.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>5. GATES&lt;/strong> &amp;mdash; Group Average Treatment Effects via &lt;em>predicted&lt;/em> effect quartiles.
A &lt;em>data-driven&lt;/em> version of GATE. Sort households by their estimated CATE $\hat{\tau}_i$, slice into quartiles Q1&amp;ndash;Q4, then average the actual response in each quartile. The contrast Q4-vs-Q1 is the strongest moderation signal a beginner can find without naming the moderator.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>GATES Q1 (lowest predicted effect) = \$17,279. GATES Q4 (highest predicted effect) = \$2,919. The top-to-bottom ratio is 5.9×. Note that GATES is sorted by &lt;em>predicted&lt;/em> effect, so the labels feel inverted: we let the model tell us who responds.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Letting the data sort the patients for you. You do not need to know in advance whether age, gender, or kidney function matters. You ask: &amp;ldquo;based on the model, who is in the top 25% of predicted responders?&amp;rdquo; Then you check whether they actually respond more.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>6. PO vs AIPW estimators&lt;/strong>.
Two ways to map nuisance estimates into a CATE. &lt;strong>PO&lt;/strong> (Partialing Out, partial-linear model) residualizes both &lt;code>assets&lt;/code> and &lt;code>e401k&lt;/code> against the covariates, then regresses one residual on the other. Simple, transparent, sensitive to extreme propensity scores. &lt;strong>AIPW&lt;/strong> (Augmented Inverse-Probability Weighting) reweights by inverse propensity and adds a regression correction. More machinery, but doubly robust.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>This post fits both. They land within \$200 (PO \$7,937 vs AIPW \$8,120). When PO and AIPW are close, the model-disagreement diagnostic is green. When they diverge, the overlap is suspect or one of the nuisance models is mis-specified.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Two judges hear the same case via different reasoning. When their verdicts agree, you trust the case. When they disagree, you re-read the evidence.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>7. Heterogeneity test&lt;/strong>.
A formal $\chi^2$ test that $\tau(\mathbf{x})$ varies with $\mathbf{x}$. The null is constant treatment effects: every household responds the same way. Rejection licenses the CATE / GATE / GATES interpretation.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>After &lt;code>cate&lt;/code>, &lt;code>estat heterogeneity&lt;/code> returns χ²(1) = 4.11 (p = 0.043) for PO and χ²(1) = 5.54 (p = 0.019) for AIPW. Both reject the constant-effect null at conventional levels. The post&amp;rsquo;s heterogeneity story has formal backing.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>A metal detector for hidden moderation. It does not tell you &lt;em>where&lt;/em> in the field the metal is buried. It only tells you whether to keep digging.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>8. Doubly robust property&lt;/strong>.
A property of AIPW (and other DR estimators). The estimator stays consistent for the ATE if &lt;strong>either&lt;/strong> the outcome model is correctly specified &lt;strong>or&lt;/strong> the propensity model is correctly specified. Both right is gravy. Only one right is enough. Both wrong is the only failure mode.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>This is why AIPW (\$8,120) is given more weight in our discussion than IPW alone would be. Even if our random forests under-fit either nuisance, AIPW still recovers the truth.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Belt and suspenders. If the belt fails, the suspenders hold. If the suspenders fail, the belt holds. Two failures simultaneously? Time to buy new pants.&lt;/p>
&lt;/details>
&lt;/div>
&lt;hr>
&lt;h2 id="2-the-dataset-401k-eligibility-and-household-assets">2. The dataset: 401(k) eligibility and household assets&lt;/h2>
&lt;p>We use &lt;code>assets3&lt;/code>, an excerpt from Chernozhukov &amp;amp; Hansen (2004) shipped with Stata 19. Each row is one household. The outcome is total net financial assets in dollars; the treatment is whether the household head&amp;rsquo;s employer offers a 401(k) plan (i.e. eligibility, not actual participation). The economic question is whether eligibility on its own — independent of contribution choices — increases retirement wealth, and the standard concern is that eligible workers differ systematically from ineligible workers (they earn more, are older, work for larger employers).&lt;/p>
&lt;p>We load the data, declare which variables describe the heterogeneity we care about, and inspect the basic descriptive stats:&lt;/p>
&lt;pre>&lt;code class="language-stata">webuse assets3, clear
* Define the heterogeneity-of-interest covariates and (for this tutorial)
* the same set as nuisance controls.
global catecovars age educ i.incomecat i.pension i.married i.twoearn i.ira i.ownhome
global controls age educ i.incomecat i.pension i.married i.twoearn i.ira i.ownhome
global rseed 12345671
describe asset e401k age educ income incomecat pension married twoearn ira ownhome
summarize asset e401k age educ income, detail
tab e401k, missing
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------
assets float %9.0g Net total financial assets
e401k byte %12.0g lbe401 401(k) eligibility
age byte %9.0g Age
educ byte %9.0g Years of education
income float %9.0g Household income
incomecat byte %9.0g Income category
pension byte %16.0g lbpen Pension benefits
married byte %11.0g lbmar Marital status
twoearn byte %9.0g lbyes Two-earner household
ira byte %9.0g lbyes IRA participation
ownhome byte %9.0g lbyes Homeowner
401(k) |
eligibility | Freq. Percent Cum.
-------------+-----------------------------------
Not eligible | 6,231 62.86 62.86
Eligible | 3,682 37.14 100.00
Total | 9,913 100.00
&lt;/code>&lt;/pre>
&lt;p>The dataset contains &lt;strong>9,913 households&lt;/strong>, of which &lt;strong>3,682 (37.1%) are eligible&lt;/strong> for a 401(k) and &lt;strong>6,231 (62.9%) are not&lt;/strong>. The asset distribution is extraordinarily right-skewed — mean \$18,054 against a median of just \$1,499, with a maximum of \$1.5 million and a minimum of −\$502,302 (households with negative net worth). Income, age, and education show much milder skew. Four key features matter for what follows: the treatment is roughly balanced (37% vs 63%, plenty of overlap on average), the outcome has heavy tails (so the treatment effect almost certainly varies across the distribution), and we have a rich set of demographic covariates to condition on.&lt;/p>
&lt;hr>
&lt;h2 id="3-the-naive-view-and-why-it-fails">3. The naive view (and why it fails)&lt;/h2>
&lt;p>Before reaching for any causal estimator it is healthy to look at the raw mean difference. If &lt;code>e401k&lt;/code> were randomly assigned the comparison would be the ATE. It isn&amp;rsquo;t — eligibility is a function of who chooses what employer — so the raw difference is biased. Showing this gap explicitly motivates everything that follows.&lt;/p>
&lt;pre>&lt;code class="language-stata">* Raw means by eligibility
tabstat asset, by(e401k) statistics(mean sd n)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Summary for variables: assets
Group variable: e401k (401(k) eligibility)
e401k | Mean SD N
-------------+------------------------------
Not eligible | 10789.9 54527.02 6231
Eligible | 30347.39 74800.21 3682
-------------+------------------------------
Total | 18054.17 63528.63 9913
&lt;/code>&lt;/pre>
&lt;p>Eligible households hold an average of &lt;strong>\$30,347&lt;/strong> in net financial assets versus &lt;strong>\$10,790&lt;/strong> for ineligible ones — a raw gap of &lt;strong>\$19,557&lt;/strong>. If we believed in random assignment we would call that the average effect of eligibility. But eligible workers are systematically different: they tend to be older, more educated, and earn substantially more. Some of that \$19,557 is causal, but a meaningful share is just selection. The next section pins down how much of the gap is causal once we adjust for those covariates.&lt;/p>
&lt;hr>
&lt;h2 id="4-a-first-ate-parametric-teffects-aipw">4. A first ATE: parametric &lt;code>teffects aipw&lt;/code>&lt;/h2>
&lt;p>Stata&amp;rsquo;s mature &lt;code>teffects&lt;/code> suite already supports doubly robust ATE estimation with parametric models. We use it here as a familiar, fast benchmark before introducing the new &lt;code>cate&lt;/code> command. The estimand is&lt;/p>
&lt;p>$$\text{ATE} = E\{y(1) - y(0)\}$$&lt;/p>
&lt;p>In words, this is the &lt;em>average&lt;/em> treatment effect across all households in the population. The augmented inverse-probability weighting (AIPW) estimator is doubly robust: it returns the right ATE if either the outcome model or the propensity score model is correctly specified — we don&amp;rsquo;t need both.&lt;/p>
&lt;pre>&lt;code class="language-stata">teffects aipw ///
(asset c.age c.educ i.incomecat i.pension i.married i.twoearn i.ira i.ownhome) ///
(e401k c.age c.educ i.incomecat i.pension i.married i.twoearn i.ira i.ownhome)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Treatment-effects estimation Number of obs = 9,913
Estimator : augmented IPW
Outcome model : linear by ML
Treatment model: logit
------------------------------------------------------------------------------
ATE |
e401k |
(Eligible |
vs |
Not elig..) | 8019.463 1152.038 6.96 0.000 5761.51 10277.42
-------------+----------------------------------------------------------------
POmean |
e401k |
Not eligi.. | 13930.46 817.613 17.04 0.000 12327.97 15532.96
------------------------------------------------------------------------------
&lt;/code>&lt;/pre>
&lt;p>The doubly robust ATE is &lt;strong>\$8,019&lt;/strong> with a 95% confidence interval of &lt;strong>[\$5,762, \$10,277]&lt;/strong> — about 58% above the average baseline assets of ineligible households (\$13,930). The naive raw gap (\$19,557) was therefore inflated by a factor of 2.4: roughly 60% of the observed asset gap between eligible and ineligible households is selection — they would have held more assets even without the program — and only 40% is the causal effect of eligibility. That said, \$8,019 is still just &lt;em>one&lt;/em> number. The cross-tabulation of mean assets by income category and eligibility (which we computed but suppressed for length here, see &lt;code>analysis.log&lt;/code>) shows differences ranging from \$5,011 in the lowest income category to \$20,949 in the highest — a 4× spread that the ATE flattens out. That spread is the CATE we now estimate properly.&lt;/p>
&lt;hr>
&lt;h2 id="5-the-cate-definition-model-and-the-cate-command">5. The CATE: definition, model, and the &lt;code>cate&lt;/code> command&lt;/h2>
&lt;p>The Conditional Average Treatment Effect at covariate value $x$ is defined as&lt;/p>
&lt;p>$$\tau(\mathbf{x}) = E\{y(1) - y(0) \mid \mathbf{x} = \mathbf{x}\}$$&lt;/p>
&lt;p>In words, this says: among all households whose covariates are $\mathbf{x}$, what is their &lt;em>average&lt;/em> treatment effect? The CATE is a &lt;em>function&lt;/em> of covariates, not a single number. If $\tau(\mathbf{x})$ happened to be constant, we&amp;rsquo;d be back at the ATE. Whenever it varies, the ATE is an average of these subgroup effects weighted by how common each $\mathbf{x}$ is in the data.&lt;/p>
&lt;p>To estimate $\tau(\mathbf{x})$ Stata 19 offers two model specifications:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Partial-linear (PO) model.&lt;/strong> Assumes the outcome can be written as&lt;/li>
&lt;/ol>
&lt;p>$$y = d \cdot \tau(\mathbf{x}) + g(\mathbf{x}, \mathbf{w}) + \epsilon, \qquad d = f(\mathbf{x}, \mathbf{w}) + u$$&lt;/p>
&lt;p>In words, the outcome is the treatment $d$ times the per-household effect $\tau(\mathbf{x})$, plus a flexible function $g$ of all covariates, plus noise; and the treatment itself is a flexible function $f$ of those covariates plus its own noise. PO partials out $g$ and $f$ using out-of-sample predictions (cross-fitting), then fits a generalized random forest on the residuals to recover $\tau(\mathbf{x})$. PO is the more robust choice when propensity scores can get close to 0 or 1.&lt;/p>
&lt;ol start="2">
&lt;li>&lt;strong>Fully interactive (AIPW) model.&lt;/strong> Assumes $y(1) = g_1(\mathbf{x}, \mathbf{w}) + \epsilon_1$ and $y(0) = g_0(\mathbf{x}, \mathbf{w}) + \epsilon_0$ — separate outcome models for treated and untreated households — and combines them with the propensity score to form the doubly-robust AIPW score (Section 9). AIPW is more efficient (narrower CIs) when both models are well-specified, but more sensitive to extreme propensities.&lt;/li>
&lt;/ol>
&lt;p>We start with PO. The variables in &lt;code>$catecovars&lt;/code> are the inputs to $\tau(\mathbf{x})$ — the dimensions on which we want to see heterogeneity — and the &lt;code>controls&lt;/code> (left at the default, which equals &lt;code>catecovars&lt;/code>) are passed to the nuisance models $g$ and $f$. The &lt;code>rseed()&lt;/code> option fixes the cross-fitting and random-forest internals so the run is reproducible.&lt;/p>
&lt;pre>&lt;code class="language-stata">cate po (asset $catecovars) (e401k), rseed($rseed)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Conditional average treatment effects Number of observations = 9,913
Estimator: Partialing out Number of folds in cross-fit = 10
Outcome model: Linear lasso Number of outcome controls = 17
Treatment model: Logit lasso Number of treatment controls = 17
CATE model: Random forest Number of CATE variables = 17
------------------------------------------------------------------------------
| Robust
assets | Coefficient std. err. z P&amp;gt;|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ATE |
e401k |
(Eligible |
vs |
Not elig..) | 7937.182 1153.017 6.88 0.000 5677.309 10197.05
------------------------------------------------------------------------------
&lt;/code>&lt;/pre>
&lt;p>The PO ATE — averaged over the estimated $\hat{\tau}(\mathbf{x}_i)$ across the sample — is &lt;strong>\$7,937&lt;/strong> with a 95% CI of &lt;strong>[\$5,677, \$10,197]&lt;/strong>. That&amp;rsquo;s within \$80 of the parametric &lt;code>teffects aipw&lt;/code> ATE in the previous section, even though &lt;code>cate po&lt;/code> is doing something fundamentally different under the hood (cross-fit lasso for the nuisance models, causal forest for the IATE). When two very different estimators agree on the average, you can trust that average — and you can move on to looking at the heterogeneity.&lt;/p>
&lt;h3 id="51-is-there-heterogeneity-at-all-estat-heterogeneity">5.1 Is there heterogeneity at all? &lt;code>estat heterogeneity&lt;/code>&lt;/h3>
&lt;p>Before exploring how $\tau(\mathbf{x})$ varies, it is worth asking whether it varies. The &lt;code>estat heterogeneity&lt;/code> command tests the null hypothesis that $\tau(\mathbf{x})$ is constant — that there is, in fact, no heterogeneity to study.&lt;/p>
&lt;pre>&lt;code class="language-stata">estat heterogeneity
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Treatment-effects heterogeneity test
H0: Treatment effects are homogeneous
chi2(1) = 4.11
Prob &amp;gt; chi2 = 0.0427
&lt;/code>&lt;/pre>
&lt;p>The test rejects homogeneity at the 5% level: &lt;strong>χ²(1) = 4.11, p = 0.043&lt;/strong>. In plain English: the data have enough information to distinguish the estimated CATE function $\hat{\tau}(\mathbf{x})$ from a constant. The rest of this tutorial is therefore not a hunt for noise — there is real heterogeneity, and the next sections describe what shape it takes.&lt;/p>
&lt;h3 id="52-who-responds-most-estat-projection">5.2 Who responds most? &lt;code>estat projection&lt;/code>&lt;/h3>
&lt;p>A causal forest fits $\hat{\tau}(\mathbf{x})$ flexibly, but a flexible function is hard to summarize in a paragraph. &lt;code>estat projection&lt;/code> regresses $\hat{\tau}_i$ on the covariates linearly. The coefficients are not causal (they&amp;rsquo;re a &lt;em>projection&lt;/em> of an already-estimated nonlinear function onto a linear basis), but they answer the practical question &amp;ldquo;which variables shift the predicted effect, and by how much?&amp;rdquo;.&lt;/p>
&lt;pre>&lt;code class="language-stata">estat projection $catecovars
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Treatment-effects linear projection Number of obs = 9,913
F(11, 9901) = 4.90
Prob &amp;gt; F = 0.0000
age | 205.12 117.98 1.74 0.082 -26.15 436.39
educ | -442.46 488.47 -0.91 0.365 -1399.96 515.05
incomecat 1 | -2439.22 2013.52 -1.21 0.226 -6386.14 1507.69
incomecat 2 | 1874.82 2295.16 0.82 0.414 -2624.15 6373.79
incomecat 3 | 5707.69 3298.34 1.73 0.084 -757.73 12173.11
incomecat 4 | 18194.60 5398.39 3.37 0.001 7612.65 28776.54
pension Y | 3817.36 2454.44 1.56 0.120 -993.84 8628.55
ownhome Y | 3162.65 1669.59 1.89 0.058 -110.08 6435.38
&lt;/code>&lt;/pre>
&lt;p>The single dominant signal is income. Relative to households in the lowest income category, those in the highest income category have a predicted effect that is &lt;strong>\$18,195 higher&lt;/strong> (p = 0.001) — the only coefficient significant at the 1% level. Homeownership lifts the predicted effect by another \$3,163 (p = 0.058) and each additional year of age adds \$205 (p = 0.082); both are borderline. Education, marriage, two-earner status, and IRA participation are essentially flat. The R² of 0.0045 is not a critique — it tells us most of the heterogeneity is genuinely nonlinear (curvature that the random forest captures and a linear projection cannot). The rest of the post zooms into where that nonlinearity lives.&lt;/p>
&lt;hr>
&lt;h2 id="6-the-shape-of-individual-level-heterogeneity">6. The shape of individual-level heterogeneity&lt;/h2>
&lt;p>Before slicing the CATE by groups, it helps to look at the distribution of household-level effects. &lt;code>categraph histogram&lt;/code> plots the predicted $\hat{\tau}_i$ for every household in the sample.&lt;/p>
&lt;pre>&lt;code class="language-stata">categraph histogram, ///
title(&amp;quot;Distribution of individual treatment effects (PO)&amp;quot;) ///
xtitle(&amp;quot;Estimated tau_hat_i (dollars)&amp;quot;) ///
note(&amp;quot;Source: assets3, Stata 19 cate po&amp;quot;)
graph export &amp;quot;stata_cate_iate_histogram_po.png&amp;quot;, replace width(1200)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_cate_iate_histogram_po.png" alt="Histogram of PO-estimated individual treatment effects across 9,913 households">&lt;/p>
&lt;p>The distribution is &lt;strong>strongly right-skewed&lt;/strong>. Most households cluster around a modest positive effect (the bulk of the mass sits near \$5,000–\$10,000), but a long right tail extends to \$80,000 and beyond. A small left tail dips into negative territory: a meaningful minority of households are estimated to gain little or nothing from 401(k) eligibility. This is the visual answer to &amp;ldquo;is the average hiding something?&amp;rdquo; — the average of \$7,937 is genuinely close to the median, but the spread on either side is huge. The next two views — IATE plots and GATE — describe &lt;em>who&lt;/em> sits in the right tail.&lt;/p>
&lt;h3 id="61-how-does-the-effect-vary-with-one-covariate-iate-plots">6.1 How does the effect vary with one covariate? IATE plots&lt;/h3>
&lt;p>The &lt;code>categraph iateplot&lt;/code> command holds all covariates except one fixed at sample-mean (continuous) or base (factor) values, and varies the one covariate of interest. The result is a slice through the multi-dimensional CATE function with confidence bands.&lt;/p>
&lt;pre>&lt;code class="language-stata">categraph iateplot age, ///
title(&amp;quot;Estimated CATE by age&amp;quot;) ///
ytitle(&amp;quot;tau_hat (dollars)&amp;quot;) xtitle(&amp;quot;Age (years)&amp;quot;)
graph export &amp;quot;stata_cate_iateplot_age.png&amp;quot;, replace width(1200)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_cate_iateplot_age.png" alt="Estimated CATE by age, holding all other covariates at means or base values">&lt;/p>
&lt;p>The age slice is broadly increasing. Younger workers (mid-20s to early 30s) have small or even slightly negative predicted effects; the line crosses into clearly positive territory around age 35–40 and continues climbing through the 50s. The intuition is straightforward: 401(k) eligibility is most valuable to workers with the financial slack and the planning horizon to take advantage of tax-deferred saving. Confidence bands narrow in the middle of the age range where most of the data lives and widen at the extremes.&lt;/p>
&lt;p>The same exercise with education looks rather different:&lt;/p>
&lt;pre>&lt;code class="language-stata">categraph iateplot educ, ///
title(&amp;quot;Estimated CATE by years of education&amp;quot;) ///
ytitle(&amp;quot;tau_hat (dollars)&amp;quot;) xtitle(&amp;quot;Education (years)&amp;quot;)
graph export &amp;quot;stata_cate_iateplot_educ.png&amp;quot;, replace width(1200)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_cate_iateplot_educ.png" alt="Estimated CATE by years of education">&lt;/p>
&lt;p>The education slice is &lt;strong>broadly flat&lt;/strong> at around \$1,000–\$3,000 across the entire range from 8 to 18 years of schooling. This is consistent with the linear projection (where the education coefficient was small and not significant). It is also a useful negative finding — once you condition on income, education adds little to the predicted effect.&lt;/p>
&lt;hr>
&lt;h2 id="7-group-level-effects-gate-on-prespecified-groups">7. Group-level effects: GATE on prespecified groups&lt;/h2>
&lt;p>Individual-level $\hat{\tau}_i$ is informative but noisy. A common practice is to summarize them by &lt;em>groups&lt;/em> — either prespecified (income category, region, education tier) or data-driven (top vs bottom quartile of predicted effect). Stata&amp;rsquo;s GATE and GATES estimators are the formal versions of these two strategies.&lt;/p>
&lt;p>The Group ATE (GATE) on a prespecified group $g$ is&lt;/p>
&lt;p>$$\tau(g) = E\{\Gamma_i \mid G_i = g\}$$&lt;/p>
&lt;p>In words, this says: the average AIPW orthogonal score $\Gamma_i$ within group $g$ — i.e., the doubly robust per-household effect score, averaged over households assigned to that group. We compute it on the income categories &lt;code>incomecat&lt;/code>. The clever bit is &lt;code>reestimate&lt;/code>: after running &lt;code>cate po&lt;/code> once, we tell Stata to recycle the fitted IATE function and just recompute group means, saving a slow second causal-forest fit.&lt;/p>
&lt;pre>&lt;code class="language-stata">cate, group(incomecat) reestimate
estat gatetest
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">GATE | Coefficient Std. err. z P&amp;gt;|z| [95% conf. interval]
incomecat |
0 | 4087.014 987.7124 4.14 0.000 2151.13 6022.90
1 | 1399.398 1663.193 0.84 0.400 -1860.40 4659.20
2 | 5154.329 1349.842 3.82 0.000 2508.69 7799.97
3 | 8532.238 2287.664 3.73 0.000 4048.50 13015.98
4 | 20510.94 4723.741 4.34 0.000 11252.58 29769.30
Group treatment-effects heterogeneity test
H0: Group average treatment effects are homogeneous
chi2(4) = 18.44
Prob &amp;gt; chi2 = 0.0010
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-stata">categraph gateplot, ///
title(&amp;quot;GATE by income category&amp;quot;) ///
ytitle(&amp;quot;tau_hat (dollars)&amp;quot;) xtitle(&amp;quot;Income category (1 = low, 5 = high)&amp;quot;)
graph export &amp;quot;stata_cate_gate_incomecat.png&amp;quot;, replace width(1200)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_cate_gate_incomecat.png" alt="GATE by income category, with 95% confidence bands">&lt;/p>
&lt;p>The five group-level effects span an order of magnitude: \$4,087 (lowest income), \$1,399 (income category 1, not significant at p = 0.40), \$5,154, \$8,532, and &lt;strong>\$20,511&lt;/strong> in the highest income category — roughly five times the average. The joint test of equality (&lt;code>estat gatetest&lt;/code>) rejects strongly: &lt;strong>χ²(4) = 18.44, p = 0.001&lt;/strong>. There is one mild departure from monotonicity at category 1, which is interesting but lies just within sampling variability (its CI overlaps zero). Two important policy facts emerge: the marginal household in the top income category gains an average of about \$20,500 from 401(k) eligibility, and the marginal household at the bottom of the distribution gains about \$4,000 — but the middle-low (category 1) gains effectively nothing.&lt;/p>
&lt;hr>
&lt;h2 id="8-data-driven-groups-gates-on-quartiles-of-hattau">8. Data-driven groups: GATES on quartiles of $\hat{\tau}$&lt;/h2>
&lt;p>GATE on prespecified groups is principled but presupposes that the analyst already knows which groups matter. &lt;strong>GATES&lt;/strong> (&amp;ldquo;Group Average Treatment Effects Sorted&amp;rdquo;) flips this around: it lets the data sort households by their predicted effect, bins them into quantiles, and reports the mean effect within each bin. Cross-fitting protects against p-hacking — each unit&amp;rsquo;s bin is determined by an out-of-sample prediction, so observations cannot leak their own outcomes into their bin assignment.&lt;/p>
&lt;pre>&lt;code class="language-stata">cate po (asset $catecovars) (e401k), rseed($rseed) group(4)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">GATES | Coefficient Std. err. z P&amp;gt;|z| [95% conf. interval]
rank |
1 | 17278.94 3440.125 5.02 0.000 10536.42 24021.46
2 | 8121.04 1691.008 4.80 0.000 4806.73 11435.35
3 | 3443.83 1437.640 2.40 0.017 626.11 6261.56
4 | 2919.20 2110.320 1.38 0.167 -1216.96 7055.35
ATE | 7938.21 1152.994 6.88 0.000 5678.38 10198.04
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-stata">categraph gateplot, ///
title(&amp;quot;GATES by data-driven quartile of estimated effect&amp;quot;) ///
ytitle(&amp;quot;tau_hat (dollars)&amp;quot;) xtitle(&amp;quot;Quartile (1 = highest tau_hat, 4 = lowest)&amp;quot;)
graph export &amp;quot;stata_cate_gates_quartiles.png&amp;quot;, replace width(1200)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_cate_gates_quartiles.png" alt="GATES by data-driven quartile of the estimated treatment effect">&lt;/p>
&lt;p>The data-driven ladder is &lt;strong>clean and monotonic&lt;/strong>: the top quartile gains an average of &lt;strong>\$17,279&lt;/strong> (CI \$10,536–\$24,021), the second \$8,121, the third \$3,444, and the bottom &lt;strong>\$2,919&lt;/strong> — and the bottom quartile is &lt;em>not&lt;/em> statistically distinguishable from zero (p = 0.167). The top-to-bottom ratio is &lt;strong>5.9×&lt;/strong>. This is the single most informative summary of heterogeneity in the dataset because the bins are constructed by the data itself rather than by a researcher choice. Roughly one in four households in the sample appears to gain little or nothing from 401(k) eligibility, while another quarter gains over twice the average effect.&lt;/p>
&lt;h3 id="81-who-is-in-the-top-vs-the-bottom-quartile-estat-classification">8.1 Who is in the top vs the bottom quartile? &lt;code>estat classification&lt;/code>&lt;/h3>
&lt;p>The data sorted itself; now we can ask what makes the top quartile different. &lt;code>estat classification&lt;/code> runs a two-sample t-test for one variable at a time, comparing its mean in the top-effect rank group against its mean in the bottom-effect rank group.&lt;/p>
&lt;pre>&lt;code class="language-stata">estat classification age
estat classification educ
estat classification income
&lt;/code>&lt;/pre>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Variable&lt;/th>
&lt;th style="text-align:right">Top quartile (n=2,480)&lt;/th>
&lt;th style="text-align:right">Bottom quartile (n=2,471)&lt;/th>
&lt;th style="text-align:right">Difference&lt;/th>
&lt;th style="text-align:right">t-statistic&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Age (years)&lt;/td>
&lt;td style="text-align:right">45.15&lt;/td>
&lt;td style="text-align:right">34.98&lt;/td>
&lt;td style="text-align:right">10.17&lt;/td>
&lt;td style="text-align:right">35.67&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Education (years)&lt;/td>
&lt;td style="text-align:right">14.02&lt;/td>
&lt;td style="text-align:right">12.65&lt;/td>
&lt;td style="text-align:right">1.37&lt;/td>
&lt;td style="text-align:right">18.62&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Income (\$)&lt;/td>
&lt;td style="text-align:right">62,739&lt;/td>
&lt;td style="text-align:right">26,861&lt;/td>
&lt;td style="text-align:right">35,878&lt;/td>
&lt;td style="text-align:right">56.22&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>The high-effect quartile is sharply different from the low-effect quartile on every dimension: about &lt;strong>10 years older&lt;/strong> on average (45.1 vs 35.0), with &lt;strong>1.4 more years of education&lt;/strong> (14.0 vs 12.7), and &lt;strong>\$35,878 higher household income&lt;/strong> (\$62,739 vs \$26,861). All three differences are huge in t-statistic terms (19, 36, 56). Income is the dominant marker — exactly what the linear projection and the GATE-by-income picture already suggested. The story behind the numbers: 401(k) eligibility helps people who already have the financial slack and time-horizon to actually use it, and a substantial minority of the population has neither.&lt;/p>
&lt;hr>
&lt;h2 id="9-aipw-a-doubly-robust-contrast">9. AIPW: a doubly-robust contrast&lt;/h2>
&lt;p>So far we have used the partialing-out estimator. The fully interactive AIPW estimator fits separate outcome models for treated and untreated households and combines them with the propensity score via the AIPW orthogonal score:&lt;/p>
&lt;p>$$\Gamma_i = \left[\hat{y}(1)_i + \frac{d_i \, \{y_i - \hat{y}(1)_i\}}{\hat{f}_i}\right] - \left[\hat{y}(0)_i + \frac{(1-d_i) \, \{y_i - \hat{y}(0)_i\}}{1-\hat{f}_i}\right]$$&lt;/p>
&lt;p>In words, this says: the doubly robust per-household effect score is the predicted treated outcome minus the predicted untreated outcome, each corrected by an inverse-propensity weighted residual. It is &amp;ldquo;doubly robust&amp;rdquo; because it stays consistent if &lt;em>either&lt;/em> the outcome models OR the propensity-score model is correct — you only need to get one of them right. The cost is sensitivity to extreme propensities: if some households have $\hat{f}_i$ close to 0 or 1 the inverse weights blow up.&lt;/p>
&lt;pre>&lt;code class="language-stata">cate aipw (asset $catecovars) (e401k), rseed($rseed)
estat heterogeneity
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">ATE |
e401k |
(Eligible |
vs |
Not elig..) | 8120.264 1160.538 7.00 0.000 5845.652 10394.88
Treatment-effects heterogeneity test
H0: Treatment effects are homogeneous
chi2(1) = 5.54
Prob &amp;gt; chi2 = 0.0186
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-stata">categraph histogram, ///
title(&amp;quot;Distribution of individual treatment effects (AIPW)&amp;quot;) ///
xtitle(&amp;quot;Estimated tau_hat_i (dollars)&amp;quot;) ///
note(&amp;quot;Source: assets3, Stata 19 cate aipw&amp;quot;)
graph export &amp;quot;stata_cate_iate_histogram_aipw.png&amp;quot;, replace width(1200)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_cate_iate_histogram_aipw.png" alt="Histogram of AIPW-estimated individual treatment effects">&lt;/p>
&lt;pre>&lt;code class="language-stata">categraph iateplot educ, ///
title(&amp;quot;Estimated CATE by education (AIPW)&amp;quot;) ///
ytitle(&amp;quot;tau_hat (dollars)&amp;quot;) xtitle(&amp;quot;Education (years)&amp;quot;)
graph export &amp;quot;stata_cate_iateplot_educ_aipw.png&amp;quot;, replace width(1200)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_cate_iateplot_educ_aipw.png" alt="AIPW-estimated CATE by education">&lt;/p>
&lt;p>The AIPW ATE is &lt;strong>\$8,120&lt;/strong> — within \$200 of both the parametric &lt;code>teffects aipw&lt;/code> ATE (\$8,019) and the PO ATE (\$7,937). The heterogeneity test now rejects more strongly (&lt;strong>χ²(1) = 5.54, p = 0.019&lt;/strong>) than under PO (p = 0.043), consistent with AIPW&amp;rsquo;s higher efficiency when both nuisance models are well-specified. The AIPW IATE histogram (Figure 6) has the same right-skewed shape as the PO histogram but a slightly wider support — AIPW puts more mass in the tails because of the inverse-propensity correction, which is the visual signature of the overlap-sensitivity warning above. The AIPW education slice (Figure 7) is essentially identical in shape to the PO version: a broadly flat profile around the average. Across estimators, the substantive story does not change.&lt;/p>
&lt;hr>
&lt;h2 id="10-the-smooth-income-gradient-estat-series">10. The smooth income gradient: &lt;code>estat series&lt;/code>&lt;/h2>
&lt;p>&lt;code>categraph iateplot&lt;/code> showed the CATE as a function of one variable with the others fixed at reference values. &lt;code>estat series&lt;/code> is a complementary view — it fits a flexible smoother (cubic B-spline by default) of the predicted effect against one continuous covariate, marginalizing over the joint distribution of the others. For continuous variables like income this gives the cleanest &amp;ldquo;dose-response&amp;rdquo; picture.&lt;/p>
&lt;pre>&lt;code class="language-stata">estat series income if income &amp;lt;= 150000, graph knots(5)
graph export &amp;quot;stata_cate_series_income.png&amp;quot;, replace width(1200)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Nonparametric series regression for IATE
Cubic B-spline estimation Number of obs = 9,884
Number of knots = 5
------------------------------------------------------------------------------
| Robust
| Effect std. err. z P&amp;gt;|z| [95% conf. interval]
-------------+----------------------------------------------------------------
income | .2131162 .0502993 4.24 0.000 .1145313 .311701
------------------------------------------------------------------------------
Note: Effect estimates are averages of derivatives.
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="stata_cate_series_income.png" alt="Cubic B-spline of estimated CATE against household income">&lt;/p>
&lt;p>The reported &amp;ldquo;Effect&amp;rdquo; is the &lt;strong>average derivative&lt;/strong> of the predicted treatment effect with respect to income: &lt;strong>0.213&lt;/strong> (SE 0.050, p &amp;lt; 0.001, 95% CI [0.115, 0.312]). Translated into dollars: &lt;strong>each additional \$1,000 of household income raises the predicted 401(k) treatment effect by about \$213 on average&lt;/strong>. The B-spline fit (Figure 8) reveals that this derivative is not constant — the slope is steepest in the middle of the income distribution and flatter at both ends — which is why a single linear-projection coefficient (\$18,195 for the highest income category) only partially captured the gradient. The series view smooths over the binning entirely.&lt;/p>
&lt;hr>
&lt;h2 id="11-putting-it-all-together-comparison-table">11. Putting it all together: comparison table&lt;/h2>
&lt;p>The four causal estimators we ran agree closely on the average and disagree only marginally on the heterogeneity p-value:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Estimator&lt;/th>
&lt;th style="text-align:right">ATE&lt;/th>
&lt;th style="text-align:center">95% CI&lt;/th>
&lt;th style="text-align:center">Heterogeneity test&lt;/th>
&lt;th>Notes&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Naive raw difference&lt;/td>
&lt;td style="text-align:right">19,557&lt;/td>
&lt;td style="text-align:center">n/a&lt;/td>
&lt;td style="text-align:center">n/a&lt;/td>
&lt;td>Raw mean gap; mostly selection&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>teffects aipw&lt;/code> (parametric)&lt;/td>
&lt;td style="text-align:right">8,019&lt;/td>
&lt;td style="text-align:center">[5,762, 10,277]&lt;/td>
&lt;td style="text-align:center">—&lt;/td>
&lt;td>Mature, fast benchmark&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>cate po&lt;/code> (lasso + causal forest)&lt;/td>
&lt;td style="text-align:right">7,937&lt;/td>
&lt;td style="text-align:center">[5,677, 10,197]&lt;/td>
&lt;td style="text-align:center">χ²(1) = 4.11, p = 0.043&lt;/td>
&lt;td>Robust to extreme propensities&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;code>cate aipw&lt;/code> (lasso + causal forest, doubly robust)&lt;/td>
&lt;td style="text-align:right">8,120&lt;/td>
&lt;td style="text-align:center">[5,846, 10,395]&lt;/td>
&lt;td style="text-align:center">χ²(1) = 5.54, p = 0.019&lt;/td>
&lt;td>Most efficient; uses AIPW score&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Three independent ML and parametric estimators bracket the true ATE within a \$183 spread. Both ML estimators reject homogeneity at the 5% level. The naive raw difference of \$19,557 was inflated by a factor of 2.4 — about \$11,500 of it was selection.&lt;/p>
&lt;hr>
&lt;h2 id="12-discussion-answering-the-question">12. Discussion: answering the question&lt;/h2>
&lt;p>We opened with the question, &lt;em>for whom&lt;/em> does 401(k) eligibility increase financial assets? The eight figures and four estimators in this post answer it concretely:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>The average household gains about \$8,000.&lt;/strong> Across three estimators of the ATE that span very different model assumptions, the answer is \$7,937 to \$8,120 with a narrow range. The naive raw gap of \$19,557 overstated the causal effect by 2.4×.&lt;/li>
&lt;li>&lt;strong>But the average hides substantial heterogeneity.&lt;/strong> Both &lt;code>estat heterogeneity&lt;/code> tests reject a constant CATE at the 5% level; the GATE joint test rejects equality across income groups at p = 0.001; and the GATES quartile ladder spans \$2,919 (bottom quartile, not significant) to \$17,279 (top quartile) — a factor of 5.9.&lt;/li>
&lt;li>&lt;strong>Income is the dominant moderator.&lt;/strong> The linear projection coefficient on the highest income category is \$18,195 (p = 0.001). The smooth B-spline says each extra \$1,000 of income raises the effect by \$213 on average. The classification analysis says households in the top-effect quartile earn \$35,878 more on average than households in the bottom-effect quartile.&lt;/li>
&lt;li>&lt;strong>About a quarter of households gain little or nothing.&lt;/strong> The bottom GATES quartile cannot reject zero (p = 0.167), and a small left tail in both IATE histograms shows households with predicted effects close to zero or even slightly negative.&lt;/li>
&lt;li>&lt;strong>Age and homeownership matter at the margin.&lt;/strong> Older workers and homeowners gain more, but the effects are smaller and more uncertain than the income effect. Education and marital status are essentially flat once income is controlled for.&lt;/li>
&lt;/ul>
&lt;p>The &amp;ldquo;so what?&amp;rdquo; for policy: a 401(k) eligibility expansion targeted at low-income workers will have a much smaller per-capita asset effect than one targeted at high-income workers — but the lowest-income households still gain a real, statistically significant \$4,000 on average, suggesting the program is not pointless for them. A blanket expansion that ignores heterogeneity would systematically underestimate the gains to high earners and overestimate the gains to households in the second income decile.&lt;/p>
&lt;hr>
&lt;h2 id="13-summary-and-next-steps">13. Summary and next steps&lt;/h2>
&lt;p>&lt;strong>Method takeaways.&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Stata 19&amp;rsquo;s &lt;code>cate&lt;/code> command unifies cross-fit ML nuisance estimation, doubly robust scores, causal-forest IATE estimation, and honest-tree inference into a single workflow. Two estimators (PO and AIPW) and seven postestimation views cover almost all practical heterogeneity questions.&lt;/li>
&lt;li>&lt;strong>PO&lt;/strong> (partialing-out) is more robust to extreme propensity scores; &lt;strong>AIPW&lt;/strong> is more efficient when both nuisance models are well-specified. They agree on the ATE in this dataset (\$7,937 vs \$8,120, a difference of 2.3%), which is the strongest possible robustness check.&lt;/li>
&lt;li>The four heterogeneity views — &lt;code>estat heterogeneity&lt;/code>, &lt;code>estat projection&lt;/code>, GATE/GATES, and &lt;code>estat series&lt;/code> — answer different questions. A beginner should look at all of them rather than picking a favorite.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Data takeaways.&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>The 401(k) eligibility ATE on the assets3 sample is \$8,019 ± \$1,150.&lt;/li>
&lt;li>The CATE varies from \$1,399 (income category 1) to \$20,511 (highest income category) — a 15× spread.&lt;/li>
&lt;li>One in four households shows essentially no effect; one in four shows over twice the average.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Limitations.&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>The CATE is identified under unconfoundedness (no unmeasured confounders) given the rich set of demographic covariates. If, for instance, employer match rates differ systematically across the income distribution and we don&amp;rsquo;t observe match rates, that would bias the income gradient.&lt;/li>
&lt;li>The bootstrap-of-little-bags inference behind the IATE confidence bands assumes honest random forests. With the default &lt;code>xfolds(10)&lt;/code> and the default forest settings, runtime is ≈9 minutes on Stata SE 19; StataNow MP cuts this by roughly 3×.&lt;/li>
&lt;li>We did not formally check propensity overlap in this post. As a follow-up, run &lt;code>teffects overlap&lt;/code> after the parametric AIPW or check &lt;code>estat osample&lt;/code> after the &lt;code>cate&lt;/code> command.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Next steps.&lt;/strong> Try &lt;code>cate aipw ..., omethod(rforest) tmethod(rforest) oob&lt;/code> for a fully nonparametric specification with out-of-bag inference (faster and more flexible than the lasso default). Or move to the &lt;code>lung&lt;/code> dataset shipped with Stata 19 and explore &lt;code>estat policyeval&lt;/code> to compare expected outcomes under hypothetical assignment policies (e.g., &amp;ldquo;treat only households with predicted positive effect&amp;rdquo;).&lt;/p>
&lt;hr>
&lt;h2 id="14-exercises">14. Exercises&lt;/h2>
&lt;ol>
&lt;li>&lt;strong>Compare specifications.&lt;/strong> Re-run &lt;code>cate po&lt;/code> with &lt;code>omethod(rforest) tmethod(rforest)&lt;/code> (random-forest nuisance instead of lasso). How much do the GATE-by-income estimates change? Use &lt;code>oob&lt;/code> to speed up the run.&lt;/li>
&lt;li>&lt;strong>Build a custom group.&lt;/strong> Create a &amp;ldquo;high-effect candidate&amp;rdquo; indicator that is 1 if &lt;code>age &amp;gt; 40 &amp;amp; income &amp;gt; 50000 &amp;amp; ownhome == 1&lt;/code>, 0 otherwise. Run &lt;code>cate, group(high_eff_candidate) reestimate&lt;/code> and compare the two GATEs to the GATES top vs bottom quartile in this post.&lt;/li>
&lt;li>&lt;strong>Explore another moderator.&lt;/strong> Use &lt;code>categraph iateplot&lt;/code> to plot the predicted CATE against &lt;code>pension&lt;/code>, &lt;code>married&lt;/code>, &lt;code>twoearn&lt;/code>, and &lt;code>ira&lt;/code>. Which one shows the biggest difference between its categories?&lt;/li>
&lt;/ol>
&lt;hr>
&lt;h2 id="15-references">15. References&lt;/h2>
&lt;ol>
&lt;li>&lt;a href="https://doi.org/10.1214/18-AOS1709" target="_blank" rel="noopener">Athey, S., Tibshirani, J., &amp;amp; Wager, S. (2019). Generalized Random Forests. &lt;em>Annals of Statistics&lt;/em>, 47(2), 1148–1178.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1111/ectj.12097" target="_blank" rel="noopener">Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., &amp;amp; Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. &lt;em>The Econometrics Journal&lt;/em>, 21(1), C1–C68.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1162/0034653041811734" target="_blank" rel="noopener">Chernozhukov, V., &amp;amp; Hansen, C. (2004). The effects of 401(k) participation on the wealth distribution: an instrumental quantile regression analysis. &lt;em>Review of Economics and Statistics&lt;/em>, 86(3), 735–751.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1093/ectj/utac015" target="_blank" rel="noopener">Knaus, M. C. (2022). Double machine learning-based programme evaluation under unconfoundedness. &lt;em>Econometrics Journal&lt;/em>, 25(3), 602–627.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.2307/1912705" target="_blank" rel="noopener">Robinson, P. M. (1988). Root-N-consistent semiparametric regression. &lt;em>Econometrica&lt;/em>, 56(4), 931–954.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.stata.com/manuals/causal.pdf" target="_blank" rel="noopener">StataCorp. (2025). &lt;em>Stata 19 Causal Inference and Treatment-Effects Reference Manual: cate&lt;/em>.&lt;/a>&lt;/li>
&lt;/ol></description></item></channel></rss>