<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>econml | Carlos Mendez</title><link>https://carlos-mendez.org/tag/econml/</link><atom:link href="https://carlos-mendez.org/tag/econml/index.xml" rel="self" type="application/rss+xml"/><description>econml</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><copyright>Carlos Mendez</copyright><lastBuildDate>Thu, 07 May 2026 00:00:00 +0000</lastBuildDate><image><url>https://carlos-mendez.org/media/icon_huedfae549300b4ca5d201a9bd09a3ecd5_79625_512x512_fill_lanczos_center_3.png</url><title>econml</title><link>https://carlos-mendez.org/tag/econml/</link></image><item><title>Causal Machine Learning and the Resource Curse with Python EconML</title><link>https://carlos-mendez.org/post/python_econml/</link><pubDate>Thu, 07 May 2026 00:00:00 +0000</pubDate><guid>https://carlos-mendez.org/post/python_econml/</guid><description>&lt;h2 id="overview">Overview&lt;/h2>
&lt;p>Can natural resource wealth be both a blessing and a curse? And can local institutions determine which way it goes? In this tutorial, we use &lt;strong>EconML&amp;rsquo;s &lt;code>CausalForestDML&lt;/code>&lt;/strong> to estimate &lt;strong>heterogeneous causal effects&lt;/strong> of mining and mineral prices on economic development &amp;mdash; and test whether institutional quality moderates those effects differently for mining versus price shocks.&lt;/p>
&lt;p>We use &lt;strong>simulated data with known ground-truth parameters&lt;/strong> so we can verify that the method recovers the correct answers. The simulated dataset mirrors the structure of Hodler, Lechner &amp;amp; Raschky (2023), who studied 3,800 Sub-Saharan African districts using a Modified Causal Forest. This tutorial focuses on the &lt;strong>DML methodology&lt;/strong>: how the Double Machine Learning framework separates nuisance estimation from causal effect estimation to produce valid, efficient heterogeneous treatment effect estimates.&lt;/p>
&lt;p>For the &lt;strong>economic narrative&lt;/strong> and a companion implementation in Stata 19, see &lt;a href="https://carlos-mendez.org/post/stata_cate2/">Causal Machine Learning and the Resource Curse with Stata 19&lt;/a>.&lt;/p>
&lt;h3 id="learning-objectives">Learning objectives&lt;/h3>
&lt;p>By the end of this tutorial, you will be able to:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Understand&lt;/strong> the Double Machine Learning (DML) framework and the residualization argument that makes it work&lt;/li>
&lt;li>&lt;strong>Distinguish&lt;/strong> heterogeneity features (X) from nuisance controls (W) in &lt;code>CausalForestDML&lt;/code>&lt;/li>
&lt;li>&lt;strong>Configure&lt;/strong> &lt;code>CausalForestDML&lt;/code> for discrete multi-valued treatments with panel data&lt;/li>
&lt;li>&lt;strong>Estimate&lt;/strong> Average Treatment Effects (ATEs) and Group Average Treatment Effects (GATEs), and read the Bootstrap-of-Little-Bags standard errors EconML reports&lt;/li>
&lt;li>&lt;strong>Interpret&lt;/strong> GATE patterns to identify which variables moderate treatment effects&lt;/li>
&lt;li>&lt;strong>Use&lt;/strong> EconML-specific tools like &lt;code>SingleTreeCateInterpreter&lt;/code> for data-driven subgroup discovery&lt;/li>
&lt;li>&lt;strong>Evaluate&lt;/strong> estimated effects against known ground-truth parameters and explain any remaining gap&lt;/li>
&lt;/ol>
&lt;h3 id="key-concepts-at-a-glance">Key concepts at a glance&lt;/h3>
&lt;p>The post leans on a small vocabulary repeatedly. The rest of the tutorial assumes you can move between these terms quickly. Each concept below has three parts. The &lt;strong>definition&lt;/strong> is always visible. The &lt;strong>example&lt;/strong> and &lt;strong>analogy&lt;/strong> sit behind clickable cards: open them when you need them, leave them collapsed for a quick scan. If a later section mentions &amp;ldquo;honest splitting&amp;rdquo; or &amp;ldquo;Neyman orthogonality&amp;rdquo; and the term feels slippery, this is the section to re-read.&lt;/p>
&lt;p>&lt;strong>1. Potential outcomes&lt;/strong> $Y_i(t)$.
The outcome unit $i$ &lt;strong>would&lt;/strong> take under treatment value $t$. Each unit has one potential outcome per treatment level. We observe only one of them: the one matching the treatment actually received. The rest are &lt;em>counterfactual&lt;/em>. They live in worlds we never see.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Take district 47 in 2008. Four potential NTL outcomes exist for it: $Y_{47,2008}(0)$, $Y_{47,2008}(1)$, $Y_{47,2008}(2)$, and $Y_{47,2008}(3)$. They correspond to no mining, low prices, medium prices, and high prices. Only one is in the dataset. It is the one matching whatever treatment that district-year actually had. The other three are forever invisible.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Every life decision is a fork in the road. You took one fork. The parallel-universe versions of yourself took the other forks. Their lives are real conceptual objects. You just cannot directly observe them. Causal inference reconstructs those parallel universes. It does so by looking at people who &lt;em>did&lt;/em> take the other forks.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>2. CATE&lt;/strong> &amp;mdash; Conditional Average Treatment Effect, $\tau(\mathbf{x})$.
The average treatment effect for units with covariate profile $\mathbf{x}$. The CATE is a &lt;strong>function&lt;/strong> of $\mathbf{x}$, not a single number. Where the CATE bends with $\mathbf{x}$, the treatment helps some units more than others.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Take a well-governed district profile in our data: &lt;code>exec_constraints = 6&lt;/code>, &lt;code>quality_of_govt = 0.7&lt;/code>, and so on. For that $\mathbf{x}$ the CATE is $\tau(\mathbf{x}) \approx 0.26$. Mining lifts log-NTL by about 0.26 for that profile. Now move to the weakest-institutions case: &lt;code>exec_constraints = 1&lt;/code>. The same function gives only $\tau(\mathbf{x}) \approx 0.18$. The CATE is what makes this comparison possible.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>A drug&amp;rsquo;s &amp;ldquo;average effect&amp;rdquo; might be a 5-point reduction in blood pressure. But a doctor cares about a specific patient. Maybe a 65-year-old male with diabetes. The CATE &lt;em>is&lt;/em> that personalized effect. It takes a patient profile in. It returns the expected effect for someone like them.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>3. GATE&lt;/strong> &amp;mdash; Group Average Treatment Effect.
The CATE averaged over a &lt;em>pre-specified&lt;/em> subgroup. The subgroup is defined by some variable $Z$. GATEs test targeted moderation hypotheses. A typical question: &amp;ldquo;does institutional quality moderate the effect of mining?&amp;rdquo;&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Sort districts by &lt;code>exec_constraints&lt;/code> (1&amp;ndash;6). Average the per-observation CATEs inside each level. At level 1 we get $\widehat{\mathrm{GATE}} \approx 0.18$. The number climbs to $\approx 0.26$ at level 6. That climb is the moderation pattern Finding 3 reports. It is exactly what the GATE plots in this post visualize.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>A nationwide marketing campaign might lift sales by 5% on average. Before scaling it up, the company asks a simple question: did it work better in cities than in rural towns? The GATE answers exactly that. It reports the campaign&amp;rsquo;s effect &lt;em>inside&lt;/em> each store type. It surfaces heterogeneity that the headline ATE hides.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>4. ATE&lt;/strong> &amp;mdash; Average Treatment Effect.
The CATE averaged over the entire sample, $E[\tau(\mathbf{X})]$. The headline policy number. It answers a single question: if we turned the treatment on for everyone, what average effect would we see?&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Take our 3,000 district-years. The estimated ATE for the 1-vs-0 contrast (mining at low prices vs. no mining) is $\widehat{\mathrm{ATE}} = 0.240$. On average, mining-at-low-prices raises log-NTL by 0.24. In unlogged NTL, that is about a 27% bump.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>&amp;ldquo;This drug lowers cholesterol by 12 points on average.&amp;rdquo; That is an ATE statement. A single number, suitable for a press release. It says nothing about whether the drug works better in some patients than others. That question belongs to GATEs and CATEs.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>5. Nuisance functions&lt;/strong> $g_0, m_0$.
These are two conditional means: $g_0(\mathbf{x}, \mathbf{w}) = E[Y \mid \mathbf{X}, \mathbf{W}]$ and $m_0(\mathbf{x}, \mathbf{w}) = E[T \mid \mathbf{X}, \mathbf{W}]$. We call them &lt;em>nuisance&lt;/em> because we do not care about their values. We estimate them for one reason only. That reason is to strip out the part of $Y$ and $T$ that is predictable from $(\mathbf{X}, \mathbf{W})$. What remains is the variation that identifies the causal effect.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>$\hat g_0$ is a Gradient Boosting regressor. It predicts a district&amp;rsquo;s log-NTL from elevation, ruggedness, ethnic fractionalization, country, year, and so on. It &lt;em>ignores&lt;/em> mining status. $\hat m_0$ is a Gradient Boosting classifier. It predicts the probability of each treatment level from the same covariates. Both predictions matter only as inputs to the residualization step.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Astronomers photograph faint galaxies in two steps. First, they take a &amp;ldquo;dark frame&amp;rdquo; with the lens cap on. The dark frame records sensor noise. Then they subtract it from the real exposure. Nobody hangs the dark frame on their wall. It exists only to be subtracted. $g_0$ and $m_0$ are dark frames for confounding. Their job is to be subtracted out. That is what lets the real causal signal show through.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>6. Cross-fitting&lt;/strong> (sometimes &amp;ldquo;sample-splitting&amp;rdquo; or &amp;ldquo;out-of-fold prediction&amp;rdquo;).
Estimate the nuisance functions on one fold of the data. Apply them to a held-out fold. Rotate so that every observation is residualized using nuisance models that did not see it. Without this rotation, in-sample residuals come out systematically too small. That bias propagates straight into the second stage.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Setting &lt;code>cv=5&lt;/code> in &lt;code>CausalForestDML&lt;/code> splits the 3,000 observations into five folds of 600. The forest fits $\hat g_0$ and $\hat m_0$ on folds 1&amp;ndash;4. It then residualizes fold 5 using those fitted models. The procedure rotates four more times. The end result: each district-year is residualized by nuisance models trained on a strictly disjoint sample.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Suppose you give a class the same problems for practice and for the final exam. Students who memorized the practice will ace the final. The score reflects memorization, not learning. Hiding the final-exam questions until grading time fixes the problem. Cross-fitting does the same trick. It hides each observation from the very nuisance model that will eventually residualize it.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>7. Honest splitting&lt;/strong> (a property of an &lt;em>honest causal forest&lt;/em>).
A causal tree uses one random subsample to &lt;em>choose&lt;/em> its split structure: which variable, which threshold. It uses a &lt;em>separate&lt;/em> random subsample to &lt;em>estimate&lt;/em> the treatment-effect value in each leaf. The split-chooser and the leaf-estimator never share data. This separation is what licenses valid confidence intervals from the forest.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Consider a single tree inside the forest. With &lt;code>honest=True&lt;/code>, half of its bootstrap sample picks the splits. Maybe the choice is &amp;ldquo;split first on &lt;code>distance_capital&lt;/code>, then on &lt;code>exec_constraints&lt;/code>&amp;rdquo;. The other half computes the average CATE in each resulting leaf. Those leaf-level numbers are unbiased. The reason: the splits were chosen without seeing them.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>A jury that hears the evidence should not also write the verdict template. If the same people pick the conclusion language &lt;em>and&lt;/em> hear the case, the verdict reflects their pre-baked preferences. It would not reflect the evidence alone. Splitting the two roles is a basic guard against motivated reasoning. Honesty does the same job inside one tree. Split-choosers and leaf-estimators are different &amp;ldquo;people&amp;rdquo;. The leaf values cannot be tailored to the splits that produced them.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>8. Neyman orthogonality.&lt;/strong>
A property of the DML estimating equation $\psi(W; \tau, \eta)$. Here $\eta = (g_0, m_0)$ collects the nuisance functions. The property is $\left.\partial_\eta E[\psi]\right|_{\eta=\eta_0} = 0$. In words: at the truth, the expected estimating equation is &lt;em>flat&lt;/em> in the nuisance functions. Small errors in $\hat g_0$ and $\hat m_0$ enter the second-stage estimator only at second order.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Suppose $\hat g_0$ misses the true $g_0$ by 10% on average. A naive plug-in two-stage procedure inherits roughly that 10% error in the causal estimate. With Neyman orthogonality, the picture changes. The same 10% nuisance error contributes only on the order of $(0.10)^2 = 0.01$ to the causal estimate. That is one percentage point &amp;mdash; orders of magnitude less than the input. This is why a Gradient Boosting first stage works. It converges at a slower-than-parametric rate. Even so, the second-stage estimate of $\tau$ remains $\sqrt{n}$-consistent and asymptotically normal.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Picture a self-righting boat. You can lean over the rail. You can slosh the cargo. You can even slip on the deck. The hull pulls itself upright every time. Stability is built into its geometry, not into never being disturbed. Neyman orthogonality is the hull design. It lets DML stay upright when the nuisance estimates wobble.&lt;/p>
&lt;/details>
&lt;/div>
&lt;h2 id="the-dml-causal-forest">The DML Causal Forest&lt;/h2>
&lt;h3 id="potential-outcomes-and-the-cate">Potential outcomes and the CATE&lt;/h3>
&lt;p>Causal inference rests on the &lt;strong>potential-outcomes&lt;/strong> framework (Rubin, 1974; Imbens &amp;amp; Rubin, 2015). For each unit $i$ and each treatment value $t$, we imagine an outcome $Y_i(t)$ that would be realized if $i$ received treatment $t$. The catch is the &lt;strong>fundamental problem of causal inference&lt;/strong>: only the potential outcome corresponding to the treatment unit $i$ actually receives is observable. All other potential outcomes for that unit are counterfactual &amp;mdash; they live in a world we never see. Causal inference is therefore an exercise in &lt;em>imputation&lt;/em>: using the observed outcomes of comparable units to stand in for the missing counterfactuals.&lt;/p>
&lt;p>The &lt;strong>Conditional Average Treatment Effect&lt;/strong> (CATE) for a unit with covariates $\mathbf{x}$ is&lt;/p>
&lt;p>$$\tau(\mathbf{x}) = E\{Y_i(1) - Y_i(0) \mid \mathbf{X}_i = \mathbf{x}\}.$$&lt;/p>
&lt;p>In words: among units who look like $\mathbf{x}$, what is the average gap between the treated and untreated potential outcomes? When the function $\tau(\cdot)$ is constant across $\mathbf{x}$, every type of unit responds the same way and a single ATE summarizes everything. When $\tau(\cdot)$ bends with $\mathbf{x}$, we have &lt;strong>treatment effect heterogeneity&lt;/strong> &amp;mdash; mining might raise nighttime lights in well-governed districts and barely move them elsewhere. Estimating that bend, not just its average, is the whole point of a causal forest.&lt;/p>
&lt;h3 id="the-partially-linear-model-with-heterogeneous-effects">The partially linear model with heterogeneous effects&lt;/h3>
&lt;p>EconML&amp;rsquo;s &lt;code>CausalForestDML&lt;/code> works inside the &lt;strong>partially linear model&lt;/strong> of Robinson (1988), extended by Chernozhukov et al. (2018) to allow heterogeneous effects:&lt;/p>
&lt;p>$$Y_i = \tau(\mathbf{X}_i)\, T_i + g_0(\mathbf{X}_i, \mathbf{W}_i) + \varepsilon_i, \qquad E[\varepsilon_i \mid \mathbf{X}_i, \mathbf{W}_i] = 0.$$&lt;/p>
&lt;p>$$T_i = m_0(\mathbf{X}_i, \mathbf{W}_i) + v_i, \qquad E[v_i \mid \mathbf{X}_i, \mathbf{W}_i] = 0.$$&lt;/p>
&lt;p>The &lt;strong>outcome equation&lt;/strong> says that $Y_i$ depends on the treatment $T_i$ multiplied by a &lt;em>unit-specific&lt;/em> effect $\tau(\mathbf{X}_i)$, plus an arbitrary, possibly nonlinear function $g_0$ of the controls, plus mean-zero noise. The &amp;ldquo;partially linear&amp;rdquo; name comes from $T$ entering linearly (multiplied by $\tau$) while $g_0$ is allowed to be any flexible function.&lt;/p>
&lt;p>The &lt;strong>treatment equation&lt;/strong> writes $T_i$ as the conditional-mean treatment $m_0(\mathbf{X}_i, \mathbf{W}_i)$ plus a residual $v_i$. For a continuous treatment, $m_0$ is a regression. For our four-level treatment, $m_0$ is a multi-class classifier &amp;mdash; specifically, a &lt;code>GradientBoostingClassifier&lt;/code> &amp;mdash; and &amp;ldquo;$T - m_0$&amp;rdquo; is shorthand for the residual of treatment around its conditional probabilities.&lt;/p>
&lt;p>The functions $g_0$ and $m_0$ are called &lt;strong>nuisance functions&lt;/strong> because we do not care about their values. We estimate them only to &lt;em>remove&lt;/em> the part of $Y$ and $T$ that is predictable from $(\mathbf{X}, \mathbf{W})$, leaving behind the variation that identifies the causal effect.&lt;/p>
&lt;h4 id="why-two-stages-the-residualization-argument">Why two stages? The residualization argument&lt;/h4>
&lt;p>Subtract $E[Y_i \mid \mathbf{X}, \mathbf{W}] = \tau(\mathbf{X}_i) \, m_0(\mathbf{X}_i, \mathbf{W}_i) + g_0(\mathbf{X}_i, \mathbf{W}_i)$ from the outcome equation. The $g_0$ terms cancel, and after a line of algebra. Define the residualized outcome and treatment as&lt;/p>
&lt;p>$$\tilde Y_i = Y_i - E[Y_i \mid \mathbf{X}, \mathbf{W}], \qquad \tilde T_i = T_i - m_0(\mathbf{X}_i, \mathbf{W}_i).$$&lt;/p>
&lt;p>Plugging these residuals into the partial linear model yields:&lt;/p>
&lt;p>$$\tilde Y_i = \tau(\mathbf{X}_i) \cdot \tilde T_i + \varepsilon_i.$$&lt;/p>
&lt;p>So if we (a) estimate $g_0$ and $m_0$ in a &lt;em>first stage&lt;/em> with any flexible learner, (b) residualize both $Y$ and $T$, and (c) regress $\tilde Y$ on $\tilde T$ with covariate-dependent slope, that slope at point $\mathbf{x}$ recovers $\tau(\mathbf{x})$. This is exactly the &lt;strong>Frisch&amp;ndash;Waugh&amp;ndash;Lovell&lt;/strong> logic &amp;mdash; if you have not seen FWL before, the &lt;a href="https://carlos-mendez.org/post/python_fwl/">tutorial on the Frisch&amp;ndash;Waugh&amp;ndash;Lovell theorem&lt;/a> walks through the linear case in detail.&lt;/p>
&lt;p>The causal forest is the second-stage learner that estimates this covariate-dependent slope from $(\tilde T, \tilde Y, \mathbf{X})$, splitting on $\mathbf{X}$ to find regions where the local slope is approximately constant.&lt;/p>
&lt;h3 id="neyman-orthogonality-why-first-stage-errors-barely-matter">Neyman orthogonality: why first-stage errors barely matter&lt;/h3>
&lt;p>Think of residualization like noise-canceling headphones: the first stage removes the &amp;ldquo;background noise&amp;rdquo; of confounders from both the outcome and the treatment, so the causal forest only hears the &amp;ldquo;signal&amp;rdquo; of the treatment effect.&lt;/p>
&lt;p>The formal version of that intuition is &lt;strong>Neyman orthogonality&lt;/strong>. The DML estimating equation $\psi(W; \tau, \eta)$ &amp;mdash; where $\eta = (g_0, m_0)$ collects the nuisance functions &amp;mdash; satisfies&lt;/p>
&lt;p>$$\left.\frac{\partial}{\partial \eta} E[\psi(W; \tau, \eta)] \right|_{\eta = \eta_0} = 0.$$&lt;/p>
&lt;p>In words: at the truth, the expected estimating equation is &lt;em>flat&lt;/em> in the nuisance functions. Small errors in $\hat g_0$ and $\hat m_0$ enter the second-stage estimator only through second-order terms. The practical consequence is striking: even if Gradient Boosting estimates $g_0$ and $m_0$ at the slow rate $O(n^{-1/4})$, much slower than the parametric $\sqrt{n}$ rate, the resulting estimate of $\tau$ is still $\sqrt{n}$-consistent and asymptotically normal (Chernozhukov et al., 2018, §2.2). A naive plug-in two-stage procedure &amp;mdash; one that does not use the orthogonal moment &amp;mdash; inherits the slower nuisance rate and loses valid inference.&lt;/p>
&lt;h3 id="three-levels-of-effects">Three levels of effects&lt;/h3>
&lt;p>The causal forest produces per-observation CATE estimates, which aggregate to three levels with different uses:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Level&lt;/th>
&lt;th>Notation&lt;/th>
&lt;th>What it measures&lt;/th>
&lt;th>When to report&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>CATE&lt;/strong>&lt;/td>
&lt;td>$\tau(\mathbf{x})$&lt;/td>
&lt;td>Effect for a unit with covariates $\mathbf{x}$&lt;/td>
&lt;td>Exploratory: feed into a decision tree or partial-dependence plot to see how effects vary.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>GATE&lt;/strong>&lt;/td>
&lt;td>$E[\tau(\mathbf{X}) \mid Z = z]$&lt;/td>
&lt;td>Average CATE in a pre-specified subgroup defined by a variable $Z$&lt;/td>
&lt;td>Theory-driven: testing whether a &lt;em>named&lt;/em> covariate (e.g., institutional quality) moderates the effect.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>ATE&lt;/strong>&lt;/td>
&lt;td>$E[\tau(\mathbf{X})]$&lt;/td>
&lt;td>Overall average across all units&lt;/td>
&lt;td>Policy: the headline number for &amp;ldquo;what happens on average if we turn the treatment on?&amp;rdquo;&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="dml-pipeline">DML pipeline&lt;/h3>
&lt;pre>&lt;code class="language-mermaid">flowchart LR
A[&amp;quot;&amp;lt;b&amp;gt;Panel Data&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;3,000 obs&amp;quot;]:::data
B[&amp;quot;&amp;lt;b&amp;gt;First Stage&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;GBM nuisance&amp;lt;br/&amp;gt;models&amp;quot;]:::first
C[&amp;quot;&amp;lt;b&amp;gt;Residualize&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;Y - E[Y | X,W]&amp;lt;br/&amp;gt;T - E[T | X,W]&amp;quot;]:::resid
D[&amp;quot;&amp;lt;b&amp;gt;Causal Forest&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;500 honest trees&amp;quot;]:::forest
E[&amp;quot;&amp;lt;b&amp;gt;CATEs&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;Per-observation&amp;lt;br/&amp;gt;effects&amp;quot;]:::cate
A --&amp;gt; B --&amp;gt; C --&amp;gt; D --&amp;gt; E
classDef data fill:#6a9bcc,stroke:#141413,color:#fff
classDef first fill:#d97757,stroke:#141413,color:#fff
classDef resid fill:#00d4c8,stroke:#141413,color:#141413
classDef forest fill:#141413,stroke:#d97757,color:#fff
classDef cate fill:#6a9bcc,stroke:#141413,color:#fff
&lt;/code>&lt;/pre>
&lt;h2 id="setup-and-configuration">Setup and configuration&lt;/h2>
&lt;p>We use &lt;code>CausalForestDML&lt;/code> from EconML with Gradient Boosting nuisance models. The ground-truth parameters are defined inline so the tutorial is fully self-contained.&lt;/p>
&lt;pre>&lt;code class="language-python">import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from econml.dml import CausalForestDML
from sklearn.ensemble import (GradientBoostingRegressor,
GradientBoostingClassifier)
# Ground-truth ATEs from the data-generating process
TRUE_ATES = {
'1-0': 0.250, # Mining effect
'2-0': 0.300, # Mining + medium price
'3-0': 0.550, # Mining + high price
'2-1': 0.050, # Medium price premium (small)
'3-1': 0.300, # High price premium (large)
'3-2': 0.250, # High vs medium step
}
&lt;/code>&lt;/pre>
&lt;h2 id="load-the-simulated-data">Load the simulated data&lt;/h2>
&lt;p>The dataset simulates 300 districts across 8 countries observed over 10 years (2003&amp;ndash;2012), following the structure of Hodler, Lechner &amp;amp; Raschky (2023). Treatment has four levels: no mining (0), mining at low prices (1), medium prices (2), and high prices (3).&lt;/p>
&lt;pre>&lt;code class="language-python">DATA_URL = (&amp;quot;https://github.com/cmg777/starter-academic-v501&amp;quot;
&amp;quot;/raw/master/content/post/python_EconML/sim_resource_curse.csv&amp;quot;)
df = pd.read_csv(DATA_URL)
print(f&amp;quot;Dataset: {len(df):,} observations&amp;quot;)
print(f&amp;quot;Districts: {df['district_id'].nunique()}, &amp;quot;
f&amp;quot;Countries: {df['country_id'].nunique()}&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Dataset: 3,000 observations
Districts: 300, Countries: 8
&lt;/code>&lt;/pre>
&lt;p>The dataset contains 3,000 district-year observations with a &lt;strong>heavily imbalanced&lt;/strong> treatment: 85% of observations are untreated (no mining), while each of the three mining groups comprises only 5% of the data. This imbalance makes causal inference challenging &amp;mdash; the causal forest must learn from relatively few treated observations.&lt;/p>
&lt;h2 id="descriptive-statistics">Descriptive statistics&lt;/h2>
&lt;h3 id="treatment-distribution">Treatment distribution&lt;/h3>
&lt;pre>&lt;code class="language-python">labels = {0: 'No mining', 1: 'Low prices',
2: 'Med prices', 3: 'High prices'}
for t, n in df['treatment'].value_counts().sort_index().items():
print(f&amp;quot; {t} ({labels[t]}): {n:,} ({n/len(df):.1%})&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> 0 (No mining): 2,550 (85.0%)
1 (Low prices): 150 (5.0%)
2 (Med prices): 150 (5.0%)
3 (High prices): 150 (5.0%)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="python_econml_treatment_dist.png" alt="Treatment distribution across the four groups">
&lt;em>Treatment distribution across the four groups. The 85/5/5/5 imbalance makes causal inference challenging.&lt;/em>&lt;/p>
&lt;p>The 85/5/5/5 split means the causal forest has 2,550 control observations but only 150 per treatment level. For within-mining comparisons (e.g., 3-1), only 300 observations contribute, making standard errors larger for price-effect estimates.&lt;/p>
&lt;h3 id="outcomes-by-treatment-group">Outcomes by treatment group&lt;/h3>
&lt;pre>&lt;code class="language-python">for t in sorted(df['treatment'].unique()):
mask = df['treatment'] == t
m_ntl = df.loc[mask, 'ntl_log'].mean()
m_conf = df.loc[mask, 'conflict'].mean()
print(f&amp;quot; {t} ({labels[t]}): NTL={m_ntl:.3f} Conflict={m_conf:.1%}&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> 0 (No mining): NTL=-1.137 Conflict=10.7%
1 (Low prices): NTL=-1.028 Conflict=18.0%
2 (Med prices): NTL=-0.930 Conflict=18.0%
3 (High prices): NTL=-0.615 Conflict=28.0%
&lt;/code>&lt;/pre>
&lt;p>The raw means show a clear gradient: higher treatment levels are associated with higher NTL and higher conflict rates. But these raw comparisons are &lt;strong>biased&lt;/strong> because mining districts differ systematically from non-mining districts in geography, institutions, and economic development.&lt;/p>
&lt;h2 id="naive-comparison-why-we-need-causal-ml">Naive comparison: why we need causal ML&lt;/h2>
&lt;pre>&lt;code class="language-python">for comp in ['1-0', '2-1', '3-1']:
a, b = int(comp[0]), int(comp[2])
naive = df.loc[df['treatment']==a, 'ntl_log'].mean() - \
df.loc[df['treatment']==b, 'ntl_log'].mean()
truth = TRUE_ATES[comp]
print(f&amp;quot; {comp}: Naive={naive:.3f} Truth={truth:.3f} Bias={naive-truth:+.3f}&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> 1-0: Naive=0.109 Truth=0.250 Bias=-0.141
2-1: Naive=0.098 Truth=0.050 Bias=+0.048
3-1: Naive=0.413 Truth=0.300 Bias=+0.113
&lt;/code>&lt;/pre>
&lt;p>The naive 1-0 estimate of &lt;strong>0.109&lt;/strong> is severely biased downward from the true effect of &lt;strong>0.250&lt;/strong> &amp;mdash; a 56% underestimate. This happens because mining districts tend to have worse geographic and institutional characteristics that independently reduce development. The DML Causal Forest removes this &lt;strong>selection bias&lt;/strong> by residualizing both the outcome and the treatment against observed confounders before estimating the causal effect.&lt;/p>
&lt;h2 id="econml-estimation">EconML estimation&lt;/h2>
&lt;h3 id="configuration">Configuration&lt;/h3>
&lt;p>We separate covariates into two groups with distinct roles in the DML framework:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>X features&lt;/strong> (10 variables): Enter the causal forest and can drive treatment effect heterogeneity. These include &lt;code>exec_constraints&lt;/code>, &lt;code>quality_of_govt&lt;/code>, &lt;code>gdp_pc&lt;/code>, &lt;code>elevation&lt;/code>, &lt;code>temperature&lt;/code>, &lt;code>ruggedness&lt;/code>, &lt;code>distance_capital&lt;/code>, &lt;code>agri_suitability&lt;/code>, &lt;code>population&lt;/code>, and &lt;code>ethnic_frac&lt;/code>.&lt;/li>
&lt;li>&lt;strong>W controls&lt;/strong> (2 variables): Used only in the first-stage nuisance models (&lt;code>country_id&lt;/code>, &lt;code>year&lt;/code>). These absorb country and time fixed effects but do not enter the causal forest.&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-python">X_COLS = ['exec_constraints', 'quality_of_govt', 'gdp_pc',
'elevation', 'temperature', 'ruggedness',
'distance_capital', 'agri_suitability', 'population',
'ethnic_frac']
W_COLS = ['country_id', 'year']
&lt;/code>&lt;/pre>
&lt;h3 id="fitting-the-model">Fitting the model&lt;/h3>
&lt;pre>&lt;code class="language-python">Y = df['ntl_log'].values
T = df['treatment'].values
X = df[X_COLS].values
W = df[W_COLS].values
est_ntl = CausalForestDML(
model_y=GradientBoostingRegressor(n_estimators=200, max_depth=4,
random_state=42),
model_t=GradientBoostingClassifier(n_estimators=200, max_depth=4,
random_state=42),
discrete_treatment=True,
categories=[0, 1, 2, 3],
n_estimators=500,
min_samples_leaf=10,
honest=True, # Separate split/estimation samples
inference=True, # BLB confidence intervals
cv=5, # 5-fold cross-fitting
n_jobs=1,
random_state=42,
)
est_ntl.fit(Y, T, X=X, W=W, groups=df['district_id'].values)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> NTL: fitted in ~25s
&lt;/code>&lt;/pre>
&lt;p>Several configuration choices deserve explanation.&lt;/p>
&lt;p>&lt;strong>Honest trees&lt;/strong> (&lt;code>honest=True&lt;/code>) split the data inside each tree into two halves. One half is used to &lt;em>choose&lt;/em> the splits &amp;mdash; which variable, which threshold &amp;mdash; and the other half is used to &lt;em>estimate&lt;/em> the leaf means. A standard regression tree uses the same observations for both jobs, which lets the tree pick splits that artificially separate noisy observations and then quote the resulting separation back as if it were signal. The &amp;ldquo;exam writer / exam taker&amp;rdquo; analogy: honesty stops the tree from setting questions it has already memorized the answers to. Operationally, honesty is what licenses asymptotically valid confidence intervals &amp;mdash; without it, the leaf estimates are tighter than they should be and &lt;code>inference=True&lt;/code>&amp;rsquo;s reported standard errors would be misleadingly small. Wager &amp;amp; Athey (2018) formalize the result and prove $\sqrt{n}$-asymptotic normality for honest causal forests.&lt;/p>
&lt;p>&lt;strong>Cross-fitting&lt;/strong> (&lt;code>cv=5&lt;/code>) addresses a different overfitting risk. When the same data are used to estimate the nuisance functions $\hat g_0, \hat m_0$ and to apply them as residualizers, in-sample residuals are &lt;em>too small&lt;/em> on average and bias the second stage. Cross-fitting splits the data into 5 folds, fits the nuisance models on 4 of them, applies the fitted models to the held-out fold, and rotates. Each observation is residualized using nuisance estimates that did not see it.&lt;/p>
&lt;p>&lt;strong>GroupKFold via &lt;code>groups=district_id&lt;/code>.&lt;/strong> Our panel observes each district across multiple years. Plain $K$-fold would scatter rows from the same district across folds, so the nuisance models would peek at most of a district&amp;rsquo;s rows when predicting one held-out year &amp;mdash; leakage that artificially shrinks first-stage residuals. Passing &lt;code>groups=df['district_id'].values&lt;/code> to &lt;code>fit()&lt;/code> triggers &lt;code>GroupKFold&lt;/code>, which keeps every district inside one fold.&lt;/p>
&lt;p>A common confusion: GroupKFold is &lt;strong>not&lt;/strong> the same as clustered standard errors. It blocks within-district leakage in cross-fitting; it does not adjust the second-stage variance for within-district correlation in the residuals. The standard errors EconML reports are forest-level Bootstrap-of-Little-Bags SEs that treat observations as independent. With panel data, true clustered SEs would typically be larger. We flag this as a limitation again in the Discussion section.&lt;/p>
&lt;h3 id="identification-the-conditional-independence-assumption">Identification: the Conditional Independence Assumption&lt;/h3>
&lt;p>The causal forest leans on the &lt;strong>Conditional Independence Assumption&lt;/strong> (CIA), also called &lt;em>unconfoundedness&lt;/em> or &lt;em>selection on observables&lt;/em>: after conditioning on the observed covariates $(X, W)$, treatment assignment is as good as random, in the sense that&lt;/p>
&lt;p>$$\{Y_i(0), Y_i(1), Y_i(2), Y_i(3)\} \perp T_i \mid (\mathbf{X}_i, \mathbf{W}_i).$$&lt;/p>
&lt;p>In plain English: once we know a district&amp;rsquo;s geography, institutions, demographics, country, and year, knowing whether mining is active there tells us nothing more about what its potential nighttime-lights outcomes would be. Because we built the simulated data ourselves, the CIA holds by construction &amp;mdash; every confounder we created is in $(X, W)$.&lt;/p>
&lt;p>In real data, the CIA is &lt;em>untestable&lt;/em> and easy to violate. Two concrete violation channels for this application:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Mineral surveys.&lt;/strong> Mining companies often arrive in a district &lt;em>because&lt;/em> a geological survey flagged the geology as promising. The same survey may also predict future infrastructure investment unrelated to mining. If those surveys are not in $(X, W)$, both treatment and the potential outcome are correlated with an unobserved confounder.&lt;/li>
&lt;li>&lt;strong>Political connections.&lt;/strong> Districts whose elites are aligned with the central government may both attract mining concessions &lt;em>and&lt;/em> receive non-mining infrastructure (roads, electrification). An analyst without a measure of political alignment would mis-attribute the infrastructure effect to mining.&lt;/li>
&lt;/ul>
&lt;p>Hodler, Lechner &amp;amp; Raschky (2023) defend the CIA in their setting by including a rich set of geological, geographic, and institutional controls; the methodology in this tutorial is no stronger than that defense.&lt;/p>
&lt;h2 id="average-treatment-effects">Average Treatment Effects&lt;/h2>
&lt;p>EconML&amp;rsquo;s &lt;code>ate_inference()&lt;/code> returns the average causal effect for a chosen pair of treatment levels, together with a standard error and a confidence interval.&lt;/p>
&lt;p>The standard error here is the SE of the &lt;em>forest-level&lt;/em> ATE point estimate, not the SE of any one unit&amp;rsquo;s CATE. It comes from the &lt;strong>Bootstrap of Little Bags&lt;/strong> (BLB), a sub-bootstrap procedure (Athey, Tibshirani &amp;amp; Wager, 2019, §4) tailored to forests. Rather than refit hundreds of full forests &amp;mdash; which would cost $O(B \cdot \text{forest})$ &amp;mdash; BLB partitions the existing forest&amp;rsquo;s trees into &amp;ldquo;bags&amp;rdquo;, computes bag-level estimates, and uses the variance across bags as an estimate of the sampling variance of the full-forest ATE. The trick exploits the conditional independence of trees grown on different sub-samples; it returns valid asymptotic confidence intervals at a fraction of the cost of the obvious resampling scheme. EconML enables BLB whenever you pass &lt;code>inference=True&lt;/code> to the constructor.&lt;/p>
&lt;p>We report 90% intervals (&lt;code>alpha=0.1&lt;/code>) by default &amp;mdash; the convention used in Athey, Tibshirani &amp;amp; Wager (2019) and Hodler, Lechner &amp;amp; Raschky (2023). The substantive conclusions are unchanged at 95%, but the wider intervals make the price-effect comparisons (which have low power because only 150 observations per treatment level contribute) look more uncertain than the asymmetric pattern actually warrants.&lt;/p>
&lt;p>We compute all six pairwise treatment contrasts:&lt;/p>
&lt;pre>&lt;code class="language-python">comparisons = [
('1-0', 0, 1), ('2-0', 0, 2), ('3-0', 0, 3),
('2-1', 1, 2), ('3-1', 1, 3), ('3-2', 2, 3),
]
for comp_label, t0, t1 in comparisons:
res = est_ntl.ate_inference(X, T0=t0, T1=t1)
lo, hi = res.conf_int_mean(alpha=0.1)
print(f&amp;quot; {comp_label}: ATE={res.mean_point:.4f} &amp;quot;
f&amp;quot;SE={res.stderr_mean:.4f} 90%CI=[{lo:.3f}, {hi:.3f}]&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> 1-0: ATE=0.2398 SE=0.0701 90%CI=[0.124, 0.355]
2-0: ATE=0.2684 SE=0.0791 90%CI=[0.138, 0.399]
3-0: ATE=0.4598 SE=0.0811 90%CI=[0.326, 0.593]
2-1: ATE=0.0286 SE=0.1008 90%CI=[-0.137, 0.194]
3-1: ATE=0.2200 SE=0.1013 90%CI=[0.053, 0.387]
3-2: ATE=0.1914 SE=0.1093 90%CI=[0.012, 0.371]
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Finding 1: Mining raises economic activity, after controlling for confounding.&lt;/strong> All three mining-vs-no-mining contrasts (1-0, 2-0, 3-0) are positive, with point estimates well separated from zero relative to their standard errors. The basic mining effect 1-0 is &lt;strong>0.240&lt;/strong> (SE = 0.070, 90% CI = [0.124, 0.355]) &amp;mdash; comfortably above zero and within sampling error of the ground-truth 0.250. The naive difference-in-means for the same contrast was 0.109; the DML forest has eliminated nearly all of that confounding bias. Because the outcome is log nighttime lights, an effect of 0.24 corresponds to roughly a 27% increase in unlogged NTL ($e^{0.24} - 1 \approx 0.27$).&lt;/p>
&lt;p>&lt;strong>Finding 2: The price gradient is non-linear.&lt;/strong> Comparing medium prices to low prices (2-1) returns an ATE of &lt;strong>0.029&lt;/strong> with an SE of 0.101 &amp;mdash; the 90% interval [-0.137, 0.194] easily contains zero. Medium prices, in this DGP, add nothing detectable beyond the basic mining effect. The high-vs-low contrast (3-1), in contrast, is &lt;strong>0.220&lt;/strong> (SE = 0.101) and significant at the 5% level, with a 90% interval that excludes zero. The high-vs-medium step (3-2) is &lt;strong>0.191&lt;/strong> and significant at 10%. The forest has recovered the qualitative shape of the true price-response curve &amp;mdash; flat at low-to-medium prices, jumping at high prices &amp;mdash; without being told to look for a non-linearity. This is the kind of finding causal ML buys you: shape discovery without functional-form pre-specification.&lt;/p>
&lt;h2 id="treatment-effect-heterogeneity-gates">Treatment effect heterogeneity (GATEs)&lt;/h2>
&lt;h3 id="computing-gates-from-per-observation-cates">Computing GATEs from per-observation CATEs&lt;/h3>
&lt;p>EconML returns per-observation CATEs through &lt;code>effect_inference()&lt;/code>. To form a GATE we average those CATEs within a chosen subgroup, and to form a standard error we propagate the per-observation BLB standard errors. Doing this by hand is more illuminating than a one-line API call &amp;mdash; it makes the relationship between CATE-level heterogeneity and group-level effects visible.&lt;/p>
&lt;pre>&lt;code class="language-python">def compute_gate(est, df, z_var, t0, t1):
inf = est.effect_inference(X, T0=t0, T1=t1)
ite, ite_se = inf.point_estimate, inf.stderr
for z in sorted(df[z_var].unique()):
mask = df[z_var].values == z
gate = ite[mask].mean()
# Propagate BLB standard errors (see derivation below)
gate_se = np.sqrt(np.mean(ite_se[mask]**2) / mask.sum())
&lt;/code>&lt;/pre>
&lt;p>For a subgroup $g$ of size $n_g$, the GATE estimator is the simple average of the per-observation CATE estimates,&lt;/p>
&lt;p>$$\widehat{\mathrm{GATE}}_g = \frac{1}{n_g} \sum_{i \in g} \widehat\tau(\mathbf{X}_i).$$&lt;/p>
&lt;p>If we treat the $\widehat\tau(\mathbf{X}_i)$ as approximately uncorrelated within the group &amp;mdash; a working assumption, since EconML&amp;rsquo;s BLB does not return their full covariance matrix &amp;mdash; the variance of their average is&lt;/p>
&lt;p>$$\mathrm{Var}\left(\widehat{\mathrm{GATE}}_g\right) \approx \frac{1}{n_g^2} \sum_{i \in g} \mathrm{Var}\left(\widehat\tau(\mathbf{X}_i)\right) = \frac{1}{n_g} \cdot \overline{\mathrm{SE}_i^2}.$$&lt;/p>
&lt;p>Taking the square root gives the formula in the code: &lt;code>sqrt(mean(se_i^2) / n_g)&lt;/code>. The CIs we report are point $\pm 1.645 \cdot \widehat{\mathrm{SE}}$ for a 90% level. Two caveats are worth flagging up front: (i) the within-group independence assumption probably understates the SE in panel data where the same district appears multiple times in the same group, and (ii) this SE captures estimation uncertainty in the CATE function only, not sampling variability of the subgroup composition. As with the ATE, the headline qualitative pattern survives at 95% intervals.&lt;/p>
&lt;h3 id="gates-by-executive-constraints">GATEs by Executive Constraints&lt;/h3>
&lt;p>The mining effect (1-0) should vary with institutional quality, while the price effect (3-1) should be flat:&lt;/p>
&lt;p>&lt;img src="python_econml_gate_ntl_1v0_exec.png" alt="GATEs for NTL mining effect (1-0) by Executive Constraints">
&lt;em>GATEs for the mining effect (1-0) by executive constraints. The upward slope shows that stronger institutions amplify the economic benefits of mining.&lt;/em>&lt;/p>
&lt;p>&lt;img src="python_econml_gate_ntl_3v1_exec.png" alt="GATEs for NTL price effect (3-1) by Executive Constraints">
&lt;em>GATEs for the price effect (3-1) by executive constraints. The flat pattern confirms that institutions do not moderate price effects.&lt;/em>&lt;/p>
&lt;pre>&lt;code class="language-text"> 1-0 (Mining vs No Mining):
Exec. Constr. GATE 90% CI N
----------------------------------------------------
1 0.175 [0.168, 0.182] 300
2 0.255 [0.249, 0.262] 330
3 0.240 [0.236, 0.244] 720
4 0.242 [0.238, 0.246] 780
5 0.243 [0.237, 0.250] 420
6 0.264 [0.259, 0.269] 450
Range: 0.089
3-1 (High vs Low Prices):
Exec. Constr. GATE 90% CI N
----------------------------------------------------
1 0.242 [0.232, 0.252] 300
2 0.197 [0.187, 0.206] 330
3 0.217 [0.211, 0.224] 720
4 0.227 [0.221, 0.233] 780
5 0.224 [0.216, 0.231] 420
6 0.211 [0.204, 0.219] 450
Range: 0.045
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Finding 3: Institutions moderate the mining margin but not the price margin.&lt;/strong> The mining-effect GATEs (1-0) span a range of &lt;strong>0.089&lt;/strong> across executive-constraint levels, climbing roughly monotonically from 0.175 at the weakest institutions to 0.264 at the strongest. Read substantively: weaker institutions cut the development gain from mining roughly in half. The price-effect GATEs (3-1) span only &lt;strong>0.045&lt;/strong> and show no monotone pattern &amp;mdash; a non-finding that is itself the finding. The GATE plot effectively flat-lines because the price step is, by construction, uniform across institutional environments in the DGP.&lt;/p>
&lt;p>This asymmetry &amp;mdash; institutions shaping the mining-vs-no-mining margin but not the price margin &amp;mdash; is the structural prediction of the institutions-and-resources literature (Mehlum, Moene &amp;amp; Torvik, 2006) and the empirical pattern Hodler, Lechner &amp;amp; Raschky (2023) document for Sub-Saharan African districts. A causal forest does not assume the asymmetry; it discovers it. That is the distinguishing payoff of letting the slope $\tau(\mathbf{x})$ be a flexible function rather than fixing it parametrically (e.g., a single $\tau \times \mathrm{exec\_constraints}$ interaction term).&lt;/p>
&lt;h3 id="gates-by-quality-of-government">GATEs by Quality of Government&lt;/h3>
&lt;p>The same pattern appears when we use a continuous institutional measure:&lt;/p>
&lt;p>&lt;img src="python_econml_gate_ntl_1v0_qog.png" alt="GATEs for NTL mining effect (1-0) by Quality of Government">
&lt;em>GATEs for the mining effect (1-0) by quality of government. The positive relationship cross-validates the executive constraints finding.&lt;/em>&lt;/p>
&lt;p>&lt;img src="python_econml_gate_ntl_3v1_qog.png" alt="GATEs for NTL price effect (3-1) by Quality of Government">
&lt;em>GATEs for the price effect (3-1) by quality of government. The flat pattern is consistent across institutional measures.&lt;/em>&lt;/p>
&lt;p>The mining effect (1-0) shows a positive relationship with quality of government, while the price effect (3-1) remains approximately flat across the institutional quality distribution. This cross-validates Finding 3 using a different institutional measure.&lt;/p>
&lt;h2 id="variable-importance">Variable importance&lt;/h2>
&lt;p>EconML reports &lt;code>feature_importances_&lt;/code> for the causal forest &amp;mdash; the normalized contribution of each $X$-variable to treatment-effect &lt;em>heterogeneity&lt;/em> across all splits in all trees:&lt;/p>
&lt;pre>&lt;code class="language-python">importances = est_ntl.feature_importances_
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> distance_capital 0.171
ethnic_frac 0.142
ruggedness 0.135
population 0.126
agri_suitability 0.120
elevation 0.120
temperature 0.120
gdp_pc 0.034
quality_of_govt 0.018
exec_constraints 0.014
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="python_econml_var_importance.png" alt="Feature importance for treatment effect heterogeneity">
&lt;em>Feature importance for treatment effect heterogeneity. Geographic variables dominate splitting frequency, but the GATE plots show that institutional variables are the true moderators in the DGP.&lt;/em>&lt;/p>
&lt;p>This ranking looks paradoxical: the GATE plots above just demonstrated that &lt;code>exec_constraints&lt;/code> is what bends the mining effect, yet &lt;code>exec_constraints&lt;/code> is dead last by importance. The resolution is that &lt;strong>feature importance and moderation are different objects&lt;/strong>.&lt;/p>
&lt;p>A variable $X_j$ is a &lt;strong>moderator&lt;/strong> of the treatment effect if changing it changes the effect:&lt;/p>
&lt;p>$$\frac{\partial \tau(\mathbf{x})}{\partial x_j} \neq 0.$$&lt;/p>
&lt;p>A variable&amp;rsquo;s &lt;strong>forest importance&lt;/strong>, by contrast, is the variance-reduction-weighted frequency with which it is selected as a split variable. The two diverge in a predictable way:&lt;/p>
&lt;ul>
&lt;li>&lt;em>Continuous variables&lt;/em> (e.g., &lt;code>distance_capital&lt;/code>, &lt;code>ethnic_frac&lt;/code>) admit many candidate split thresholds and tend to be picked frequently for fine-grained slicing, even when each individual split contributes only a tiny amount to actual heterogeneity.&lt;/li>
&lt;li>&lt;em>Coarse discrete variables&lt;/em> like &lt;code>exec_constraints&lt;/code> (6 levels) have at most 5 candidate splits. Even when one of those splits captures the dominant moderation pattern, the variable accumulates a smaller total importance than a continuous neighbor that splits 50 times.&lt;/li>
&lt;/ul>
&lt;p>Read importances as a &lt;strong>screening&lt;/strong> signal &amp;mdash; a &amp;ldquo;where might heterogeneity be hiding?&amp;rdquo; first pass. Confirm or reject moderation with a hypothesis-driven GATE, a partial-dependence plot of $\tau(\mathbf{x})$, or the CATE Interpreter described next. The GATE analysis above is what nails the institutional-moderation finding; the importance ranking is what would have made you suspicious enough to draw the GATE plot in the first place.&lt;/p>
&lt;h2 id="cate-interpreter">CATE Interpreter&lt;/h2>
&lt;p>EconML&amp;rsquo;s &lt;code>SingleTreeCateInterpreter&lt;/code> fits a &lt;em>shallow&lt;/em> decision tree to the estimated CATEs themselves &amp;mdash; the tree&amp;rsquo;s outcome is the model&amp;rsquo;s prediction $\widehat\tau(\mathbf{X}_i)$, not the original $Y_i$. By splitting on $\mathbf{X}$, the tree finds the covariates and thresholds that best separate units with different treatment effects, returning a small set of subgroups summarized by their average $\widehat\tau$. It is a &lt;em>summary&lt;/em> of the forest&amp;rsquo;s heterogeneity surface, not a re-estimation of treatment effects.&lt;/p>
&lt;pre>&lt;code class="language-python">from econml.cate_interpreter import SingleTreeCateInterpreter
intrp = SingleTreeCateInterpreter(max_depth=2, min_samples_leaf=100)
intrp.interpret(est_ntl, X)
intrp.plot(feature_names=X_COLS)
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="python_econml_cate_tree.png" alt="Decision tree summarizing CATE heterogeneity for the mining effect">
&lt;em>Depth-2 decision tree summarizing CATE heterogeneity for the mining effect (1-0). Each leaf reports the mean estimated CATE for the subgroup defined by the splits above it.&lt;/em>&lt;/p>
&lt;p>Two design choices control how interpretable the output is. &lt;strong>Tree depth&lt;/strong> trades off detail against communicability: depth 2 produces at most four leaves and a story you can tell out loud; depth 4 or more reveals interaction structure but rarely fits in a paper figure. &lt;strong>Minimum leaf size&lt;/strong> (&lt;code>min_samples_leaf=100&lt;/code>) prevents the tree from carving out tiny, noisy subgroups whose CATE estimates are statistically unreliable. We pull both into the named module constants &lt;code>CATE_TREE_DEPTH&lt;/code> and &lt;code>CATE_TREE_MIN_LEAF&lt;/code> in &lt;code>script.py&lt;/code> so the choice is one place to change rather than scattered magic numbers.&lt;/p>
&lt;p>The CATE Interpreter is a complement to, not a substitute for, the GATE analysis. &lt;strong>GATEs are hypothesis-driven&lt;/strong>: you pre-specify the moderating variable (here, &lt;code>exec_constraints&lt;/code>) and test how the effect varies across its values. &lt;strong>The CATE Interpreter is exploratory&lt;/strong>: it asks &amp;ldquo;of all the covariates, which ones &amp;mdash; at which thresholds &amp;mdash; best separate high-effect from low-effect units?&amp;rdquo; Running both is good practice. If the tree&amp;rsquo;s top split corresponds to a pre-specified moderator, your theory is reinforced; if the tree finds a different split, you have learned something the theory did not predict and have a candidate for follow-up GATE plots.&lt;/p>
&lt;h2 id="discussion">Discussion&lt;/h2>
&lt;h3 id="limitations">Limitations&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>No clustered standard errors.&lt;/strong> &lt;em>Clustered SEs&lt;/em> allow the residual variance to differ across clusters (here, districts) and absorb arbitrary within-cluster correlation. EconML&amp;rsquo;s &lt;code>inference=True&lt;/code> reports forest-level Bootstrap-of-Little-Bags SEs that treat observations as independent. With panel data &amp;mdash; the same district appearing in multiple years &amp;mdash; the BLB SEs are likely too small. We use &lt;code>GroupKFold&lt;/code> by district to prevent first-stage data leakage, but that is a different problem from second-stage variance estimation. The &lt;a href="https://carlos-mendez.org/post/stata_cate2/">companion Stata tutorial&lt;/a> uses Stata 19&amp;rsquo;s &lt;code>cate&lt;/code> command, which supports &lt;code>vce(cluster district_id)&lt;/code> directly.&lt;/li>
&lt;li>&lt;strong>Contemporaneous outcomes.&lt;/strong> Hodler, Lechner &amp;amp; Raschky (2023) use treatment at time $t$ and outcome at $t+1$, which rules out reverse causality from outcome to treatment within the same year. Our simulated data uses contemporaneous treatment and outcomes; in real applications, lagging the outcome is cheap insurance.&lt;/li>
&lt;li>&lt;strong>Simplified covariate set.&lt;/strong> The real analysis uses 60+ covariates spanning geology, geography, demography, institutions, and pre-treatment outcomes; we use 12. The simulated DGP guarantees that the CIA holds because we control for every confounder we built in. Real-world identification is only as strong as the controls support, and &amp;ldquo;we used a causal forest&amp;rdquo; does not relax the CIA.&lt;/li>
&lt;/ul>
&lt;h3 id="assumptions">Assumptions&lt;/h3>
&lt;p>The CATE estimates rely on the &lt;strong>Conditional Independence Assumption&lt;/strong>: treatment is independent of potential outcomes given $(X, W)$. The CIA is untestable from data alone &amp;mdash; it asserts something about the &lt;em>unobserved&lt;/em> potential outcomes. In observational work, the standard defense is a combination of (i) institutional knowledge of the treatment-assignment process, (ii) a rich, theory-motivated set of covariates, and (iii) sensitivity analyses (e.g., Rosenbaum bounds, $E$-values) that ask how strong an unobserved confounder would have to be to overturn the conclusion. None of these is a substitute for randomization. In the simulated data here, we know the CIA holds because we built it that way.&lt;/p>
&lt;h2 id="summary-and-next-steps">Summary and next steps&lt;/h2>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>EconML&amp;rsquo;s &lt;code>CausalForestDML&lt;/code> recovered all three ground-truth findings.&lt;/strong> The ATE for the basic mining effect (1-0 = 0.240) is within sampling error of the true value 0.250 and removes nearly all of the 0.141 confounding bias visible in the naive estimator. Price effects come out non-linear (2-1 = 0.029, n.s.; 3-1 = 0.220, significant at 5%; 3-2 = 0.191, significant at 10%) without any pre-specified non-linearity. GATE patterns reveal that institutions moderate the mining effect (range = 0.089 across executive-constraint levels) but not the price effect (range = 0.045, no monotone pattern).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>The DML two-stage residualization argument is what makes the causal forest valid in observational settings.&lt;/strong> Substituting the treatment equation into the outcome equation reduces causal estimation to a regression of $\tilde Y$ on $\tilde T$, where the residualizers $\hat g_0$ and $\hat m_0$ can be any flexible learner. Neyman orthogonality means errors in the residualizers enter only at second order, so $\sqrt n$-consistent estimates of $\tau$ are recoverable even with $O(n^{-1/4})$ first-stage rates.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Feature importance is a screening tool, not a moderation test.&lt;/strong> Continuous variables accumulate importance because they offer many split points, even when they do not bend the treatment effect. The GATE plot of $\tau$ against the suspected moderator is the right tool for confirming moderation; importance is the right tool for identifying candidates worth plotting.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>The CATE Interpreter is the exploratory dual of GATEs.&lt;/strong> A shallow decision tree on the predicted CATEs surfaces data-driven subgroups, complementing the hypothesis-driven GATE analysis. Use both: GATEs test theory, the interpreter audits theory.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>For the economic story behind these findings and a parallel implementation using Stata 19&amp;rsquo;s built-in &lt;code>cate&lt;/code> command, see the companion tutorial: &lt;a href="https://carlos-mendez.org/post/stata_cate2/">Causal Machine Learning and the Resource Curse with Stata 19&lt;/a>.&lt;/p>
&lt;h2 id="exercises">Exercises&lt;/h2>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Replace the nuisance models.&lt;/strong> Swap &lt;code>GradientBoostingRegressor&lt;/code> with &lt;code>RandomForestRegressor(n_estimators=200)&lt;/code>. Do the ATE and GATE estimates change? Why or why not (think about Neyman orthogonality)?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Vary the number of trees.&lt;/strong> Try &lt;code>n_estimators=100&lt;/code> vs &lt;code>n_estimators=1000&lt;/code>. How do the standard errors and GATE patterns change?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Test the GroupKFold assumption.&lt;/strong> Remove &lt;code>groups=df['district_id'].values&lt;/code> from the &lt;code>fit()&lt;/code> call. What happens to the confidence intervals?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Discretize quality of government.&lt;/strong> Create quartiles of &lt;code>quality_of_govt&lt;/code> and compute GATEs on the quartiles instead of raw values. Do the patterns become clearer?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Explore the CATE interpreter depth.&lt;/strong> Increase &lt;code>max_depth&lt;/code> from 2 to 4 in &lt;code>SingleTreeCateInterpreter&lt;/code>. Do the additional splits reveal meaningful subgroups or just noise?&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h2 id="references">References&lt;/h2>
&lt;ol>
&lt;li>&lt;a href="https://doi.org/10.1371/journal.pone.0284968" target="_blank" rel="noopener">Hodler, R., Lechner, M., &amp;amp; Raschky, P.A. (2023). Institutions and the resource curse: New insights from causal machine learning. &lt;em>PLoS ONE&lt;/em>, 18(6), e0284968.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1111/ectj.12097" target="_blank" rel="noopener">Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., &amp;amp; Robins, J. (2018). Double/Debiased Machine Learning for Treatment and Structural Parameters. &lt;em>The Econometrics Journal&lt;/em>, 21(1), C1&amp;ndash;C68.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1214/18-AOS1709" target="_blank" rel="noopener">Athey, S., Tibshirani, J., &amp;amp; Wager, S. (2019). Generalized Random Forests. &lt;em>The Annals of Statistics&lt;/em>, 47(2), 1148&amp;ndash;1178.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1080/01621459.2017.1319839" target="_blank" rel="noopener">Wager, S. &amp;amp; Athey, S. (2018). Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. &lt;em>Journal of the American Statistical Association&lt;/em>, 113(523), 1228&amp;ndash;1242.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.2307/1912705" target="_blank" rel="noopener">Robinson, P.M. (1988). Root-N-Consistent Semiparametric Regression. &lt;em>Econometrica&lt;/em>, 56(4), 931&amp;ndash;954.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1037/h0037350" target="_blank" rel="noopener">Rubin, D.B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. &lt;em>Journal of Educational Psychology&lt;/em>, 66(5), 688&amp;ndash;701.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1017/CBO9781139025751" target="_blank" rel="noopener">Imbens, G.W. &amp;amp; Rubin, D.B. (2015). &lt;em>Causal Inference for Statistics, Social, and Biomedical Sciences&lt;/em>. Cambridge University Press.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.nber.org/papers/w5398" target="_blank" rel="noopener">Sachs, J.D. &amp;amp; Warner, A.M. (1995). Natural Resource Abundance and Economic Growth. &lt;em>NBER Working Paper&lt;/em> No. 5398.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1111/j.1468-0297.2006.01045.x" target="_blank" rel="noopener">Mehlum, H., Moene, K., &amp;amp; Torvik, R. (2006). Institutions and the Resource Curse. &lt;em>The Economic Journal&lt;/em>, 116(508), 1&amp;ndash;20.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.pywhy.org/EconML/" target="_blank" rel="noopener">EconML Documentation &amp;mdash; PyWhy&lt;/a>&lt;/li>
&lt;/ol></description></item><item><title>Causal Machine Learning for Policy Evaluation: From ATE to IATE to a Better Assignment Rule</title><link>https://carlos-mendez.org/post/python_cml/</link><pubDate>Fri, 01 May 2026 00:00:00 +0000</pubDate><guid>https://carlos-mendez.org/post/python_cml/</guid><description>&lt;h2 id="overview">Overview&lt;/h2>
&lt;p>A government runs a job-training programme for unemployed jobseekers and wants to know three things at once. Does the programme actually &lt;em>cause&lt;/em> people to spend more months in employment over the next two and a half years? Does the effect depend on who the jobseeker is — for example, on how well they speak the local language? And if effects differ across people, can we use those differences to send training to the &lt;em>right&lt;/em> jobseekers, rather than to everyone or to no one? These three questions correspond to three causal estimands — the &lt;strong>ATE&lt;/strong>, the &lt;strong>GATE&lt;/strong>, and the &lt;strong>IATE&lt;/strong> — and answering them is the bread-and-butter of &lt;strong>Causal Machine Learning (CML)&lt;/strong>.&lt;/p>
&lt;p>CML combines two ideas. From causal inference, it borrows the careful framing of treatment effects under unconfoundedness and the doubly-robust scoring functions that protect against modelling mistakes. From machine learning, it borrows flexible nuisance estimators — random forests, gradient-boosted trees, neural nets — that learn complicated outcome surfaces without forcing the analyst to specify them by hand. The result is a small toolbox — DoubleML for the average effect, doubly-robust averaging for subgroup effects, causal forests for individual effects — that turns observational data into actionable, &lt;em>personalised&lt;/em> policy recommendations. This tutorial walks through the full toolbox on a synthetic Flemish-ALMP-style cohort of 5,000 jobseekers, modelled on the empirical case study in &lt;a href="https://doi.org/10.1016/j.labeco.2023.102306" target="_blank" rel="noopener">Cockx, Lechner &amp;amp; Bollens (2023)&lt;/a> and the methodological roadmap in &lt;a href="https://doi.org/10.1186/s41937-023-00113-y" target="_blank" rel="noopener">Lechner (2023)&lt;/a>. Because the data are synthetic, the &lt;em>true&lt;/em> treatment effects are known — so every estimator can be benchmarked against the truth.&lt;/p>
&lt;h2 id="the-cml-roadmap">The CML roadmap&lt;/h2>
&lt;p>CML organises a treatment-effect study into a sequence of progressively finer questions. The diagram below shows the four-step roadmap that this tutorial follows: estimate the &lt;em>average&lt;/em> effect, then break it down into &lt;em>group&lt;/em> effects, then go all the way to &lt;em>individual&lt;/em> effects, and finally turn those individual effects into a &lt;em>policy&lt;/em>.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph LR
A[&amp;quot;&amp;lt;b&amp;gt;1. ATE&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;Population&amp;lt;br/&amp;gt;average effect&amp;quot;] --&amp;gt; B[&amp;quot;&amp;lt;b&amp;gt;2. GATE&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;Effect by&amp;lt;br/&amp;gt;subgroup&amp;quot;]
B --&amp;gt; C[&amp;quot;&amp;lt;b&amp;gt;3. IATE&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;Effect for&amp;lt;br/&amp;gt;each individual&amp;quot;]
C --&amp;gt; D[&amp;quot;&amp;lt;b&amp;gt;4. Policy&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;Welfare-optimal&amp;lt;br/&amp;gt;assignment rule&amp;quot;]
style A fill:#6a9bcc,stroke:#141413,color:#fff
style B fill:#d97757,stroke:#141413,color:#fff
style C fill:#00d4c8,stroke:#141413,color:#fff
style D fill:#999999,stroke:#141413,color:#fff
&lt;/code>&lt;/pre>
&lt;p>The arrows are not just decorative. Each step &lt;em>builds&lt;/em> on the previous one: a credible average effect is the floor on which any subgroup analysis stands, and credible group effects are the floor on which any individual analysis stands. Skipping the first step and jumping straight to a fancy heterogeneity model is the most common mistake in applied CML. We will resist that temptation by starting from the simplest possible baseline and only adding complexity when the data warrant it.&lt;/p>
&lt;p>&lt;strong>Learning objectives:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Distinguish&lt;/strong> the three CML estimands — ATE, GATE, IATE — and write each as a formal expectation.&lt;/li>
&lt;li>&lt;strong>Diagnose&lt;/strong> covariate overlap and explain why selection-on-observables matters in observational data.&lt;/li>
&lt;li>&lt;strong>Estimate&lt;/strong> the population-average effect with &lt;code>DoubleMLIRM&lt;/code>, using random-forest nuisances and 5-fold cross-fitting.&lt;/li>
&lt;li>&lt;strong>Estimate&lt;/strong> group effects via doubly-robust pseudo-outcomes and individual effects via &lt;code>CausalForestDML&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Translate&lt;/strong> the individual-level effect estimates into a welfare-maximising training-assignment rule and benchmark it against treat-all and an oracle.&lt;/li>
&lt;/ul>
&lt;h2 id="key-concepts-at-a-glance">Key concepts at a glance&lt;/h2>
&lt;p>The post leans on a small vocabulary repeatedly. The rest of the tutorial assumes you can move between these terms quickly. Each concept below has three parts. The &lt;strong>definition&lt;/strong> is always visible. The &lt;strong>example&lt;/strong> and &lt;strong>analogy&lt;/strong> sit behind clickable cards: open them when you need them, leave them collapsed for a quick scan. If a later section mentions &amp;ldquo;IATE&amp;rdquo; or &amp;ldquo;welfare-maximising rule&amp;rdquo; and the term feels slippery, this is the section to re-read.&lt;/p>
&lt;p>&lt;strong>1. Potential outcomes&lt;/strong> $Y_i(d)$.
The outcome unit $i$ would have under treatment value $d \in \{0, 1\}$. Each unit has two potential outcomes. We observe only one. The other is &lt;em>counterfactual&lt;/em>. It belongs to a world we never see.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>For unemployed worker 7421 with &lt;code>D = 1&lt;/code> (received training), we observe &lt;code>Y&lt;/code> = 22 months employed. Their counterfactual $Y_{7421}(0)$ — the months they would have worked without training — is forever invisible. Causal inference reconstructs it from comparable untrained workers.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Every life decision is a fork in the road. You took one fork. The parallel-universe versions of you took the other. Their lives are real conceptual objects you cannot directly observe.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>2. ATE&lt;/strong> &amp;mdash; Average Treatment Effect, $E[Y(1) - Y(0)]$.
The mean causal effect across everyone in the population. Headline policy number. It answers a single question: if we trained everyone, what would the average bump in employment be?&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>The naive ATE is 5.111 months. The DoubleML estimate is 5.520. The simulation&amp;rsquo;s ground truth is 5.628. DoubleML closes 92% of the bias the naive estimator carries. The true ATE is the target; DoubleML is the engine.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>&amp;ldquo;This drug lowers cholesterol by 12 points on average.&amp;rdquo; Single number, suitable for a press release. Says nothing about who responds best.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>3. GATE&lt;/strong> &amp;mdash; Group Average Treatment Effect, $E[Y(1) - Y(0) \mid Z = z]$.
The CATE averaged over a &lt;em>pre-specified&lt;/em> subgroup defined by $Z$. GATEs surface heterogeneity along axes you name in advance.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Sort workers by &lt;code>dutch_prof&lt;/code> (1=lowest, 4=highest). The GATEs are 7.47, 6.13, 4.50, 2.91 months. Workers with the weakest Dutch benefit most. The training compensates for a labour-market handicap.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>A nationwide marketing campaign lifts sales 5% on average. Before scaling up, you ask: did it work better in cities than in rural towns? GATE answers exactly that.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>4. IATE&lt;/strong> &amp;mdash; Individual Average Treatment Effect, $\tau(\mathbf{x})$.
The treatment effect &lt;em>as a function&lt;/em> of the full covariate vector. One per unit. Estimated by Causal Forest DML in this post. The IATE is the input to a personalized assignment rule.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>The Causal Forest produces 4,000 IATEs, one per worker. Mean &lt;code>\hat\tau&lt;/code> = 5.456 months. Mean absolute error against truth = 0.40 months. The IATEs feed Step 5&amp;rsquo;s welfare rule.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>A drug&amp;rsquo;s &amp;ldquo;average effect&amp;rdquo; is a 5-point reduction in blood pressure. But a doctor cares about a specific patient — maybe a 65-year-old male with diabetes. The IATE is that personalized effect.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>5. Propensity score and overlap&lt;/strong> $\pi(\mathbf{x})$.
The probability of treatment given covariates. &lt;em>Overlap&lt;/em> requires that $\pi(\mathbf{x})$ is bounded away from 0 and 1 for the kinds of units we want to compare. Without overlap there is no counterfactual to estimate from.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Step 1 of the tutorial plots $\hat\pi$ for treated and untreated workers. Densities overlap across most of the support but thin out at the tails. The overlap diagnostic is the &lt;em>first&lt;/em> check before any DR estimator runs.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Casino&amp;rsquo;s odds for the next card. We never see the casino&amp;rsquo;s algorithm directly; we estimate it from many deals. Overlap is the rule that the deck must contain enough cards of every relevant kind.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>6. Cross-fitting&lt;/strong> (K-fold sample-splitting).
Split the data into $K$ folds. Train nuisances on $K-1$ folds; predict on the held-out fold; rotate. The DoubleML library uses 5 folds by default and rotates internally.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>&lt;code>DoubleMLIRM&lt;/code> in Step 3 runs 5-fold cross-fitting on random-forest nuisances. We never invoke train/test splits ourselves; the library wraps the rotation. The orthogonal score is computed on out-of-fold residuals.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Two-pass exam grading. One TA writes the rubric, a different TA applies it. The separation is what makes the grade defensible.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>7. Causal forest.&lt;/strong>
A random forest adapted for causal estimation. Built honestly: one subsample chooses splits, a different subsample estimates leaf values. Each leaf approximates a local CATE. Aggregating across trees gives the IATE function.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Step 5 uses &lt;code>CausalForestDML&lt;/code> from EconML with 1,000 honest trees. The IATE function it returns is what powers Step 6&amp;rsquo;s assignment rule. Variable importance flags &lt;code>dutch_prof&lt;/code> and &lt;code>prior_emp_months&lt;/code> as the strongest moderators.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>A panel of judges, each on a slightly different jury. Each judge votes a verdict for the case in front of them. Average the verdicts to get the panel&amp;rsquo;s call. Honesty ensures no judge writes the rubric they then enforce.&lt;/p>
&lt;/details>
&lt;/div>
&lt;p>&lt;strong>8. Welfare-maximising assignment rule.&lt;/strong>
A policy that treats units with $\hat\tau_i &amp;gt; 0$ and skips those with $\hat\tau_i \le 0$. Maximises predicted welfare given the IATE estimates. Benchmarked against &lt;em>treat-all&lt;/em> and an &lt;em>oracle&lt;/em> rule.&lt;/p>
&lt;div class="concept-pair">
&lt;details class="concept-card concept-example">
&lt;summary>Example&lt;/summary>
&lt;p>Step 6 evaluates three rules on a held-out sample. Treating everyone yields 5.520 months/person. The IATE rule yields 1.749 — much lower because most workers have positive but small effects. The oracle (using true $\tau$) yields a similar number, suggesting the IATE rule is near-optimal under the simulation&amp;rsquo;s structure.&lt;/p>
&lt;/details>
&lt;details class="concept-card concept-analogy">
&lt;summary>Analogy&lt;/summary>
&lt;p>Giving training tickets only to people who would actually use them. Treat-all sends tickets to everyone. The IATE rule keeps tickets for the responders. The oracle is the rule a perfect-information planner would use.&lt;/p>
&lt;/details>
&lt;/div>
&lt;h2 id="setup-and-imports">Setup and imports&lt;/h2>
&lt;p>Before running anything, install the two CML libraries this tutorial depends on. &lt;code>doubleml&lt;/code> provides the cross-fitted, orthogonal-score machinery for averages; &lt;code>econml&lt;/code> provides the causal forest for individual effects.&lt;/p>
&lt;pre>&lt;code class="language-python">pip install doubleml econml # https://docs.doubleml.org https://econml.azurewebsites.net
&lt;/code>&lt;/pre>
&lt;p>The next block imports the stack and fixes the random seed. Setting &lt;code>np.random.seed(RANDOM_SEED)&lt;/code> is &lt;em>not&lt;/em> dead code: DoubleML&amp;rsquo;s internal cross-fit splitter uses the legacy global numpy RNG, so removing this line causes the ATE to drift by O(1e-3) across runs.&lt;/p>
&lt;pre>&lt;code class="language-python">import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from doubleml import DoubleMLData, DoubleMLIRM
from econml.dml import CausalForestDML
# Silence only the predictable noise from the third-party CML stack;
# real deprecation / convergence warnings still surface.
warnings.filterwarnings(&amp;quot;ignore&amp;quot;, category=FutureWarning)
warnings.filterwarnings(&amp;quot;ignore&amp;quot;, category=UserWarning)
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
X_COLS = [&amp;quot;age&amp;quot;, &amp;quot;edu_years&amp;quot;, &amp;quot;prior_emp_months&amp;quot;, &amp;quot;dutch_prof&amp;quot;, &amp;quot;female&amp;quot;, &amp;quot;migrant&amp;quot;]
&lt;/code>&lt;/pre>
&lt;p>The two CSVs read in the next section (&lt;code>cml_data.csv&lt;/code> for the observed columns and &lt;code>cml_truth.csv&lt;/code> for the hidden ground truth) ship with this post&amp;rsquo;s page bundle. If you&amp;rsquo;re following along outside the bundle, you can regenerate them by running &lt;a href="script.py">&lt;code>script.py&lt;/code>&lt;/a> once — it produces both files plus all six figures.&lt;/p>
&lt;h2 id="data-a-synthetic-almp-cohort">Data: a synthetic ALMP cohort&lt;/h2>
&lt;p>The dataset is a synthetic Flemish-ALMP-style cohort of 5,000 jobseekers. Each row records six pre-treatment covariates ($X$) — age, years of education, months employed in the look-back window, Dutch proficiency on a 0–3 scale, sex, and migrant status — a binary treatment indicator $D$ (whether the jobseeker received training), and an outcome $Y$ measuring months employed during a 30-month follow-up window. Because the data are synthetic, a companion file (&lt;code>cml_truth.csv&lt;/code>) stores the &lt;em>true&lt;/em> individual treatment effect $\tau_i$ for every row, which lets us benchmark each estimator. The reader does not need to know how the data were generated; only that the truth is known.&lt;/p>
&lt;pre>&lt;code class="language-python">df = pd.read_csv(&amp;quot;cml_data.csv&amp;quot;)
truth = pd.read_csv(&amp;quot;cml_truth.csv&amp;quot;)
print(f&amp;quot;Sample size : {len(df):,}&amp;quot;)
print(f&amp;quot;Treatment share P(D=1) : {df['D'].mean():.3f}&amp;quot;)
print(f&amp;quot;Mean outcome E[Y] : {df['Y'].mean():.2f} months employed (out of 30)&amp;quot;)
print(df.describe().round(2))
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Sample size : 5,000
Treatment share P(D=1) : 0.528
Mean outcome E[Y] : 22.68 months employed (out of 30)
age edu_years prior_emp_months dutch_prof female migrant D Y
count 5000.00 5000.00 5000.00 5000.00 5000.00 5000.00 5000.00 5000.00
mean 39.82 12.02 16.99 1.33 0.49 0.30 0.53 22.68
std 11.54 2.95 9.59 1.02 0.50 0.46 0.50 4.18
min 20.02 6.00 0.37 0.00 0.00 0.00 0.00 9.81
25% 29.78 10.01 9.49 0.00 0.00 0.00 0.00 19.73
50% 39.68 11.94 15.80 1.00 0.00 0.00 1.00 22.81
75% 49.95 14.01 23.33 2.00 1.00 1.00 1.00 25.79
max 59.99 20.00 54.75 3.00 1.00 1.00 1.00 30.00
&lt;/code>&lt;/pre>
&lt;p>The cohort is 5,000 jobseekers aged 20–60 (mean 39.8) with about 12 years of education and 17 months of prior employment in the look-back window. The treatment share of 52.8% is high relative to a real-world ALMP study, but it is calibrated so that propensity scores stay safely inside [0.21, 0.81] and so that overlap is preserved across all four Dutch-proficiency strata. The outcome — months employed in the 30-month window — has a mean of 22.68 and a standard deviation of 4.18, leaving plenty of room for a realistic 5-to-8-month treatment effect to be visible without bumping into the floor of zero or the ceiling of thirty.&lt;/p>
&lt;p>The script also stores the &lt;em>true&lt;/em> parameters in &lt;code>true_parameters.csv&lt;/code> for later benchmarking. The true ATE is 5.628 months, and the true GATEs decline monotonically with Dutch proficiency: 7.634 (no Dutch), 6.123 (low), 4.612 (intermediate), 3.130 (native). In words, jobseekers who do not speak Dutch benefit roughly 2.4× more from training than those who already do — a pattern that mirrors the policy-relevant punchline of the Cockx, Lechner &amp;amp; Bollens (2023) study.&lt;/p>
&lt;h2 id="estimands-ate-gate-and-iate">Estimands: ATE, GATE, and IATE&lt;/h2>
&lt;p>Before estimating anything, we have to be precise about &lt;em>what&lt;/em> we are estimating. Causal Machine Learning targets three estimands of increasing granularity. Throughout the post, $Y(1)$ denotes the &lt;em>potential outcome&lt;/em> under treatment and $Y(0)$ the potential outcome without it. Only one of these is observed for each person; the other is the counterfactual that estimation tries to recover.&lt;/p>
&lt;p>The &lt;strong>Average Treatment Effect (ATE)&lt;/strong> is the mean effect of training across the entire population:&lt;/p>
&lt;p>$$\text{ATE} = E[Y(1) - Y(0)]$$&lt;/p>
&lt;p>In words, this says: average the per-person treatment effect over everyone in the population. In code, this is the quantity &lt;code>DoubleMLIRM&lt;/code> returns in &lt;code>dml_irm.coef[0]&lt;/code> after a single call to &lt;code>.fit()&lt;/code>.&lt;/p>
&lt;p>The &lt;strong>Group Average Treatment Effect (GATE)&lt;/strong> restricts the average to a subgroup defined by a categorical variable $Z$:&lt;/p>
&lt;p>$$\text{GATE}(z) = E[Y(1) - Y(0) \mid Z = z]$$&lt;/p>
&lt;p>In words, this says: average the per-person effect only over people who share the value $Z = z$. We use $Z$ = &lt;code>dutch_prof&lt;/code>, so $z \in \{0, 1, 2, 3\}$. In code, the GATE is computed by averaging the doubly-robust pseudo-outcome (defined later) within each value of &lt;code>df[&amp;quot;dutch_prof&amp;quot;]&lt;/code>.&lt;/p>
&lt;p>The &lt;strong>Individual Average Treatment Effect (IATE)&lt;/strong> goes one level deeper, conditioning on the full covariate vector $X$:&lt;/p>
&lt;p>$$\text{IATE}(x) = E[Y(1) - Y(0) \mid X = x]$$&lt;/p>
&lt;p>In words, this says: at every covariate profile $x$, predict the effect of training for somebody with that profile. In code, the IATE is the per-row prediction returned by &lt;code>cf.effect(X_arr)&lt;/code> after fitting &lt;code>CausalForestDML&lt;/code>.&lt;/p>
&lt;p>The framing of this post is &lt;strong>observational&lt;/strong> — we assume &lt;em>unconfoundedness&lt;/em>: conditional on $X$, treatment assignment is as good as random. The naive difference-in-means is therefore &lt;em>genuinely biased&lt;/em> on these data, not just imprecise. CML methods earn their keep by addressing that confounding through flexible nuisance estimators and orthogonal scores.&lt;/p>
&lt;h2 id="step-1--overlap-diagnostic">Step 1 — Overlap diagnostic&lt;/h2>
&lt;p>Causal estimation under unconfoundedness only works if every covariate profile has a non-trivial chance of being treated &lt;em>and&lt;/em> a non-trivial chance of being untreated. Otherwise the model is forced to extrapolate, and small modelling mistakes blow up. The standard diagnostic is to fit a propensity score $\hat{\pi}(X) = \widehat{P}(D = 1 \mid X)$ and check that the histograms of $\hat{\pi}$ for treated and untreated jobseekers overlap. We use a logistic regression here purely for visualisation — DoubleML and CausalForestDML will fit their own nuisance models later.&lt;/p>
&lt;pre>&lt;code class="language-python">ps_lr = LogisticRegression(max_iter=1000, random_state=RANDOM_SEED).fit(df[X_COLS], df[&amp;quot;D&amp;quot;])
ps_hat = ps_lr.predict_proba(df[X_COLS])[:, 1]
print(f&amp;quot;Propensity range : [{ps_hat.min():.3f}, {ps_hat.max():.3f}]&amp;quot;)
print(f&amp;quot;P(D=1 | X) mean (treated) : {ps_hat[df['D']==1].mean():.3f}&amp;quot;)
print(f&amp;quot;P(D=1 | X) mean (untreat.): {ps_hat[df['D']==0].mean():.3f}&amp;quot;)
fig, ax = plt.subplots(figsize=(8.5, 5))
bins = np.linspace(0, 1, 31)
ax.hist(ps_hat[df[&amp;quot;D&amp;quot;] == 0], bins=bins, alpha=0.65, color=&amp;quot;#6a9bcc&amp;quot;,
label=&amp;quot;Untreated (D=0)&amp;quot;, edgecolor=&amp;quot;white&amp;quot;)
ax.hist(ps_hat[df[&amp;quot;D&amp;quot;] == 1], bins=bins, alpha=0.65, color=&amp;quot;#d97757&amp;quot;,
label=&amp;quot;Treated (D=1)&amp;quot;, edgecolor=&amp;quot;white&amp;quot;)
ax.set_xlabel(r&amp;quot;Estimated propensity score $\hat{\pi}(X)$&amp;quot;)
ax.set_ylabel(&amp;quot;Number of individuals&amp;quot;)
ax.set_title(&amp;quot;Covariate overlap: propensity-score distribution by treatment status&amp;quot;)
ax.legend()
plt.savefig(&amp;quot;cml_overlap.png&amp;quot;, dpi=300, bbox_inches=&amp;quot;tight&amp;quot;)
plt.show()
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">Propensity range : [0.208, 0.810]
P(D=1 | X) mean (treated) : 0.551
P(D=1 | X) mean (untreat.): 0.502
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="cml_overlap.png" alt="Histogram of estimated propensity scores split by treatment status; the two distributions overlap heavily across the [0.2, 0.8] range.">&lt;/p>
&lt;p>Estimated propensities fall safely inside [0.21, 0.81], so neither the strict positivity assumption nor the conventional [0.05, 0.95] trimming bounds bind. The treated mean propensity (0.551) sits only 0.049 above the untreated mean (0.502) — a small but real gap that confirms the data are mildly &lt;em>confounded&lt;/em> rather than randomised. That is exactly the regime where doubly-robust methods are designed to outperform a naive baseline: confounding is real, but not so severe that any sensible adjustment will close the gap. Now that overlap is established, we can move on to the simplest possible estimator and watch it fail.&lt;/p>
&lt;h2 id="step-2--naive-baseline-difference-in-means">Step 2 — Naive baseline: difference-in-means&lt;/h2>
&lt;p>The simplest estimator of an average treatment effect is the difference of two sample means: average $Y$ for the treated, average $Y$ for the untreated, subtract. Under unconfoundedness with random assignment this would be unbiased; under unconfoundedness with &lt;em>observational&lt;/em> data it generally is not. We compute it here precisely so we can see the bias.&lt;/p>
&lt;pre>&lt;code class="language-python">y_treated = df.loc[df[&amp;quot;D&amp;quot;] == 1, &amp;quot;Y&amp;quot;].mean()
y_untreated = df.loc[df[&amp;quot;D&amp;quot;] == 0, &amp;quot;Y&amp;quot;].mean()
naive_ate = y_treated - y_untreated
n1, n0 = int((df[&amp;quot;D&amp;quot;] == 1).sum()), int((df[&amp;quot;D&amp;quot;] == 0).sum())
s1, s0 = df.loc[df[&amp;quot;D&amp;quot;] == 1, &amp;quot;Y&amp;quot;].var(ddof=1), df.loc[df[&amp;quot;D&amp;quot;] == 0, &amp;quot;Y&amp;quot;].var(ddof=1)
naive_se = float(np.sqrt(s1 / n1 + s0 / n0))
print(f&amp;quot;True ATE : 5.628&amp;quot;)
print(f&amp;quot;Naive estimate : {naive_ate:.3f} &amp;quot;
f&amp;quot;[95% CI {naive_ate - 1.96 * naive_se:.3f}, {naive_ate + 1.96 * naive_se:.3f}]&amp;quot;)
print(f&amp;quot;Bias : {naive_ate - 5.628:+.3f} months&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">True ATE : 5.628
Naive estimate : 5.111 [95% CI 4.926, 5.296]
Bias : -0.517 months
&lt;/code>&lt;/pre>
&lt;p>The naive difference-in-means delivers 5.111 months with a Welch-style 95% confidence interval of [4.93, 5.30]. Because we know the truth, we can see that the estimator is biased downward by 0.52 months — about 9.2% of the truth — and that its 95% CI &lt;strong>fails to cover&lt;/strong> the true ATE of 5.628. Why? Because in the synthetic DGP, caseworkers steer low-Dutch-proficiency jobseekers (those with the &lt;em>largest&lt;/em> treatment effects) into training, and those same jobseekers also have shorter prior employment and weaker employability. Their outcomes are pulled down by everything the covariates capture, and a simple comparison cannot disentangle the programme&amp;rsquo;s effect from the selection effect. This is a textbook illustration of confounding: &amp;ldquo;the programme seems to work less well than it really does&amp;rdquo; can be an artefact of who got selected into it.&lt;/p>
&lt;h2 id="step-3--ate-via-double-machine-learning">Step 3 — ATE via Double Machine Learning&lt;/h2>
&lt;p>&lt;a href="https://docs.doubleml.org/stable/api/generated/doubleml.DoubleMLIRM.html" target="_blank" rel="noopener">&lt;code>DoubleMLIRM&lt;/code>&lt;/a> implements the &lt;strong>Interactive Regression Model&lt;/strong> of &lt;a href="https://doi.org/10.1111/ectj.12097" target="_blank" rel="noopener">Chernozhukov et al. (2018)&lt;/a>: a cross-fitted, doubly-robust estimator of the ATE under unconfoundedness. Cross-fitting — splitting the data into folds and predicting each fold using nuisance models trained on the other folds — prevents the random forests from overfitting to their own training sample and contaminating the score. A useful analogy is grading homework: imagine assessing each student using a rubric calibrated on &lt;em>other&lt;/em> students' papers, never their own — that way the rubric cannot have been tailored to inflate any individual grade. The doubly-robust score is &lt;em>orthogonal&lt;/em> to small mistakes in either nuisance, which is what gives the estimator its $\sqrt{n}$ rate even when the nuisances are themselves slow-converging machine-learning fits.&lt;/p>
&lt;p>The Interactive Regression Model uses two nuisance functions: an outcome regression $g(d, X) = E[Y \mid D = d, X]$ and a propensity score $m(X) = P(D = 1 \mid X)$. The doubly-robust ATE score, evaluated at observation $i$, is&lt;/p>
&lt;p>$$\psi_i = g_1(X_i) - g_0(X_i) + \frac{D_i \, \bigl(Y_i - g_1(X_i)\bigr)}{m(X_i)} - \frac{(1 - D_i) \, \bigl(Y_i - g_0(X_i)\bigr)}{1 - m(X_i)}.$$&lt;/p>
&lt;p>In words, this says: start from the pure outcome-regression contrast $g_1 - g_0$, and then add a residual correction that weighs each observation by the inverse of its propensity. The clever bit is that $E[\psi_i] = \text{ATE}$ as long as &lt;em>either&lt;/em> $g$ &lt;em>or&lt;/em> $m$ is correctly specified — that is the &amp;ldquo;double&amp;rdquo; in &lt;em>doubly&lt;/em> robust. In code, $g_0(X_i)$ and $g_1(X_i)$ correspond to &lt;code>dml_irm.predictions[&amp;quot;ml_g0&amp;quot;]&lt;/code> and &lt;code>[&amp;quot;ml_g1&amp;quot;]&lt;/code>, $m(X_i)$ to &lt;code>[&amp;quot;ml_m&amp;quot;]&lt;/code>, $D_i$ to &lt;code>df[&amp;quot;D&amp;quot;]&lt;/code>, and $Y_i$ to &lt;code>df[&amp;quot;Y&amp;quot;]&lt;/code>.&lt;/p>
&lt;p>We fit DoubleML with random-forest nuisances and 5-fold cross-fitting, with &lt;code>trimming_threshold=0.01&lt;/code> to discard the (tiny) extreme tails of the propensity.&lt;/p>
&lt;pre>&lt;code class="language-python">dml_data = DoubleMLData(df, y_col=&amp;quot;Y&amp;quot;, d_cols=&amp;quot;D&amp;quot;, x_cols=X_COLS)
ml_g = RandomForestRegressor(n_estimators=200, max_features=&amp;quot;sqrt&amp;quot;,
min_samples_leaf=5, random_state=RANDOM_SEED, n_jobs=-1)
ml_m = RandomForestClassifier(n_estimators=200, max_features=&amp;quot;sqrt&amp;quot;,
min_samples_leaf=5, random_state=RANDOM_SEED, n_jobs=-1)
dml_irm = DoubleMLIRM(
dml_data, ml_g=ml_g, ml_m=ml_m,
n_folds=5, score=&amp;quot;ATE&amp;quot;, trimming_threshold=0.01,
)
dml_irm.fit(store_predictions=True)
ate_dml = float(dml_irm.coef[0])
se_dml = float(dml_irm.se[0])
ci = dml_irm.confint(level=0.95).iloc[0]
ci_low, ci_high = float(ci.iloc[0]), float(ci.iloc[1])
print(f&amp;quot;True ATE : 5.628&amp;quot;)
print(f&amp;quot;DoubleML ATE : {ate_dml:.3f} [95% CI {ci_low:.3f}, {ci_high:.3f}]&amp;quot;)
print(f&amp;quot;95% CI covers truth : {bool(ci_low &amp;lt;= 5.628 &amp;lt;= ci_high)}&amp;quot;)
print(f&amp;quot;Bias : {ate_dml - 5.628:+.3f} months&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">True ATE : 5.628
DoubleML ATE : 5.520 [95% CI 5.361, 5.680]
95% CI covers truth : True
Bias : -0.108 months
&lt;/code>&lt;/pre>
&lt;p>Once the random-forest nuisances absorb the dependence of both treatment assignment and the outcome on the covariates, the residual bias collapses from 0.517 to &lt;strong>0.108 months&lt;/strong> — about a 79% reduction — and the 95% CI [5.36, 5.68] now covers the true ATE. In substantive terms, the corrected estimate raises the implied programme effect from &amp;ldquo;about 5.1 extra months of employment&amp;rdquo; to &amp;ldquo;about 5.5 extra months&amp;rdquo; out of a 30-month window. The standard error also drops from 0.094 (naive) to 0.081, so DoubleML is not just less biased but also slightly &lt;em>more&lt;/em> precise — the cross-fitted nuisance models soak up outcome variance that the naive estimator leaves in the residual.&lt;/p>
&lt;h2 id="step-4--gate-by-dutch-proficiency">Step 4 — GATE by Dutch proficiency&lt;/h2>
&lt;p>The ATE answers &amp;ldquo;what is the average effect across the population?&amp;rdquo; — but a policymaker thinking about who to train wants the next layer down: &amp;ldquo;does the effect depend on who you are?&amp;rdquo;. The cleanest way to extract subgroup effects from a DoubleML fit is to compute the doubly-robust pseudo-outcome $\psi_i$ for every individual, and then &lt;em>average&lt;/em> it within each subgroup. This is the same $\psi_i$ as in the equation above; the trick is that $E[\psi_i \mid Z_i = z] = \text{GATE}(z)$, so a simple group-mean of the pseudo-outcomes is an unbiased estimator of the GATE.&lt;/p>
&lt;pre>&lt;code class="language-python">preds = dml_irm.predictions
g0 = np.asarray(preds[&amp;quot;ml_g0&amp;quot;]).squeeze()
g1 = np.asarray(preds[&amp;quot;ml_g1&amp;quot;]).squeeze()
m = np.asarray(preds[&amp;quot;ml_m&amp;quot;]).squeeze()
y_arr, d_arr = df[&amp;quot;Y&amp;quot;].values, df[&amp;quot;D&amp;quot;].values
psi = (g1 - g0
+ d_arr * (y_arr - g1) / m
- (1 - d_arr) * (y_arr - g0) / (1 - m))
rows = []
for z in [0, 1, 2, 3]:
mask = (df[&amp;quot;dutch_prof&amp;quot;] == z).values
psi_z = psi[mask]
est = psi_z.mean()
se = psi_z.std(ddof=1) / np.sqrt(mask.sum())
rows.append({&amp;quot;dutch_prof&amp;quot;: z, &amp;quot;n&amp;quot;: int(mask.sum()),
&amp;quot;gate_estimate&amp;quot;: est, &amp;quot;std_error&amp;quot;: se,
&amp;quot;ci_low&amp;quot;: est - 1.96 * se, &amp;quot;ci_high&amp;quot;: est + 1.96 * se})
gate_df = pd.DataFrame(rows)
print(gate_df.to_string(index=False, float_format=lambda v: f&amp;quot;{v:7.3f}&amp;quot;))
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> dutch_prof n gate_estimate std_error ci_low ci_high
0 1302 7.465 0.157 7.157 7.772
1 1469 6.127 0.140 5.852 6.402
2 1504 4.503 0.142 4.225 4.781
3 725 2.910 0.214 2.490 3.329
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="cml_gate_dutch.png" alt="Bar chart comparing the estimated GATE (steel blue) with the true GATE (warm orange) at each level of Dutch proficiency; both decline monotonically and the bars almost coincide.">&lt;/p>
&lt;p>Averaging the cross-fitted doubly-robust pseudo-outcomes within each Dutch-proficiency stratum recovers the monotone decline almost exactly: 7.47 / 6.13 / 4.50 / 2.91 estimated against 7.63 / 6.12 / 4.61 / 3.13 truth. Every estimate is within 0.22 months of its target, all four 95% confidence intervals cover their respective truths, and the ratio of the lowest-proficiency to highest-proficiency effect (≈ 2.6× under the estimates, 2.4× under the truths) lines up with the policy punchline of Cockx, Lechner &amp;amp; Bollens (2023): training delivers the biggest payoff to those who are furthest from the local-language labour market. As expected, standard errors widen for the smallest stratum (n = 725, SE 0.214) and tighten where data are densest (n = 1,504, SE 0.142). With clean group effects in hand, the natural next step is to push down to &lt;em>individual&lt;/em> effects.&lt;/p>
&lt;h2 id="step-5--iate-via-causal-forest-dml">Step 5 — IATE via Causal Forest DML&lt;/h2>
&lt;p>The GATE collapses every jobseeker in a Dutch-proficiency stratum into a single number. But two people with the same &lt;code>dutch_prof&lt;/code> value can still differ in age, education, prior employment, and migrant status, and the training programme might help them very differently. The &lt;strong>Individual Average Treatment Effect&lt;/strong> $\tau(x) = E[Y(1) - Y(0) \mid X = x]$ asks for a separate prediction at every covariate profile, and the &lt;a href="https://econml.azurewebsites.net/_autosummary/econml.dml.CausalForestDML.html" target="_blank" rel="noopener">&lt;code>CausalForestDML&lt;/code>&lt;/a> estimator from EconML — a Python implementation of the &lt;a href="https://doi.org/10.1214/18-AOS1709" target="_blank" rel="noopener">generalized random forest&lt;/a> framework of &lt;a href="https://doi.org/10.1214/18-AOS1709" target="_blank" rel="noopener">Athey, Tibshirani &amp;amp; Wager (2019)&lt;/a> — is one of the canonical ways to produce one. Think of a causal forest as a regular random forest, except the trees split on &lt;em>heterogeneity in the treatment effect&lt;/em> rather than on heterogeneity in the outcome — every leaf becomes a small neighbourhood within which the IATE is locally constant, and the forest averages many such trees together.&lt;/p>
&lt;pre>&lt;code class="language-python">cf = CausalForestDML(
model_y=RandomForestRegressor(n_estimators=200, min_samples_leaf=5,
random_state=RANDOM_SEED, n_jobs=-1),
model_t=RandomForestClassifier(n_estimators=200, min_samples_leaf=5,
random_state=RANDOM_SEED, n_jobs=-1),
discrete_treatment=True,
n_estimators=400, min_samples_leaf=15, max_samples=0.5,
random_state=RANDOM_SEED, n_jobs=-1,
)
X_arr = df[X_COLS].values
cf.fit(df[&amp;quot;Y&amp;quot;].values, df[&amp;quot;D&amp;quot;].values, X=X_arr)
iate_hat = np.asarray(cf.effect(X_arr)).ravel()
iate_low, iate_high = cf.effect_interval(X_arr, alpha=0.05)
mae = float(np.abs(iate_hat - truth[&amp;quot;tau&amp;quot;].values).mean())
corr = float(np.corrcoef(iate_hat, truth[&amp;quot;tau&amp;quot;].values)[0, 1])
print(f&amp;quot;True ATE : 5.628&amp;quot;)
print(f&amp;quot;Mean of estimated IATEs : {iate_hat.mean():.3f}&amp;quot;)
print(f&amp;quot;MAE(IATE, truth) : {mae:.3f}&amp;quot;)
print(f&amp;quot;Corr(IATE, truth) : {corr:.3f}&amp;quot;)
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text">True ATE : 5.628
Mean of estimated IATEs : 5.456
MAE(IATE, truth) : 0.397
Corr(IATE, truth) : 0.956
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="cml_iate_scatter.png" alt="Scatter plot of estimated IATE against true individual effect τ for all 5,000 jobseekers, with a 45° reference line; points cluster tightly along the diagonal.">&lt;/p>
&lt;p>The forest produces 5,000 individual-level effect estimates whose Pearson correlation with the &lt;em>true&lt;/em> individual effects is &lt;strong>0.956&lt;/strong> and whose mean absolute error is just &lt;strong>0.40 months&lt;/strong>. The mean of the estimated IATEs (5.456) is within 0.17 months of the true ATE (5.628) — so the forest is not only ranking individuals correctly (the policy-relevant property) but also broadly calibrated in level. The 0.4-month MAE is small relative to the 4.5-month spread of true effects across individuals, which means an assignment rule built on these estimates can hope to identify &lt;em>which&lt;/em> jobseekers benefit most from training, not just whether the average effect is positive.&lt;/p>
&lt;p>To check that the forest also recovers the GATE-style heterogeneity at the individual level, we look at the histogram of estimated IATEs split by Dutch proficiency.&lt;/p>
&lt;p>&lt;img src="cml_iate_distribution.png" alt="Histogram of estimated IATEs by Dutch proficiency (4 colours), with a dashed reference line at the true ATE of 5.63; distributions shift monotonically left as proficiency rises.">&lt;/p>
&lt;p>The four IATE distributions slide leftwards as Dutch proficiency rises — exactly the pattern the GATE bar chart showed at the group level — and their union centres on the true ATE. The forest is internally consistent with the GATE estimates, and the visible spread &lt;em>within&lt;/em> each colour shows that there is meaningful heterogeneity even among jobseekers who share the same &lt;code>dutch_prof&lt;/code> value.&lt;/p>
&lt;h2 id="step-6--method-comparison">Step 6 — Method comparison&lt;/h2>
&lt;p>We now have three estimators of the ATE and one ground truth. A forest plot puts them side by side and lets the reader judge bias and CI coverage at a glance.&lt;/p>
&lt;pre>&lt;code class="language-python">comp = pd.DataFrame({
&amp;quot;method&amp;quot;: [&amp;quot;Naive (DiM)&amp;quot;, &amp;quot;DoubleML (IRM)&amp;quot;,
&amp;quot;CausalForestDML (mean of IATEs)&amp;quot;, &amp;quot;Truth&amp;quot;],
&amp;quot;estimate&amp;quot;: [naive_ate, ate_dml, iate_hat.mean(), 5.628],
&amp;quot;ci_low&amp;quot;: [4.926, 5.361, iate_hat.mean() - 1.96 * iate_hat.std(ddof=1) / np.sqrt(len(iate_hat)), 5.628],
&amp;quot;ci_high&amp;quot;: [5.296, 5.680, iate_hat.mean() + 1.96 * iate_hat.std(ddof=1) / np.sqrt(len(iate_hat)), 5.628],
})
comp[&amp;quot;bias&amp;quot;] = comp[&amp;quot;estimate&amp;quot;] - 5.628
print(comp.to_string(index=False, float_format=lambda v: f&amp;quot;{v:7.3f}&amp;quot;))
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> method estimate ci_low ci_high bias
Naive (DiM) 5.111 4.926 5.296 -0.517
DoubleML (IRM) 5.520 5.361 5.680 -0.108
CausalForestDML (mean of IATEs) 5.456 5.416 5.497 -0.172
Truth 5.628 5.628 5.628 0.000
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="cml_method_comparison.png" alt="Forest plot of point estimates and 95% CIs for the Naive (gray), DoubleML (steel blue) and CausalForestDML mean-of-IATEs (teal) estimators, with the truth (orange star) and a dashed reference line at the true ATE of 5.628.">&lt;/p>
&lt;p>The forest plot tells the story in a single panel. The &lt;strong>naive&lt;/strong> interval [4.93, 5.30] sits entirely below the true ATE — visually obvious confounding bias. &lt;strong>DoubleML&amp;rsquo;s&lt;/strong> [5.36, 5.68] straddles the truth and is the only interval among the three that delivers correct coverage. The &lt;strong>CausalForestDML&lt;/strong> mean-of-IATEs interval [5.42, 5.50] is the &lt;em>tightest&lt;/em> of the three — it pools 5,000 individual estimates so the average is precisely pinned — but it is in fact slightly too narrow, and its upper bound of 5.50 sits 0.13 months below truth. The reason is methodological: this CI captures sampling uncertainty in the &lt;em>average of individual predictions&lt;/em>, not in the population ATE itself, so it does not pick up the small downward calibration bias of the forest as a whole. The practical takeaway is to prefer DoubleML when the question is &amp;ldquo;what is the ATE?&amp;rdquo; and reserve CausalForestDML for ranking and heterogeneity.&lt;/p>
&lt;h2 id="step-7--a-welfare-maximising-assignment-rule">Step 7 — A welfare-maximising assignment rule&lt;/h2>
&lt;p>The whole reason to estimate individual treatment effects, rather than stop at the average, is that they enable &lt;em>personalised&lt;/em> policy. Suppose training has a fixed cost equivalent to four months of employment per jobseeker. The welfare-optimal assignment rule is then trivial in principle: train person $i$ if and only if the &lt;em>true&lt;/em> effect $\tau_i$ exceeds the cost. We don&amp;rsquo;t know the truth in practice, so the obvious surrogate is to plug in the IATE estimate $\hat{\tau}_i$ from the causal forest.&lt;/p>
&lt;p>We benchmark four rules: treat &lt;em>no one&lt;/em>, treat &lt;em>everyone&lt;/em>, treat where $\hat{\tau}_i &amp;gt; 4$ (the IATE rule), and an &lt;em>oracle&lt;/em> that has access to the true $\tau_i$. Welfare under any rule is computed as&lt;/p>
&lt;p>$$W(\text{rule}) = E\bigl[\,\text{rule}(X) \cdot (\tau(X) - c)\,\bigr],$$&lt;/p>
&lt;p>where $c = 4$ months is the cost of training. In words, for every person the rule treats, we add their true treatment effect minus the cost; the welfare of a rule is the average of those net contributions across the cohort.&lt;/p>
&lt;pre>&lt;code class="language-python">COST = 4.0
assign_treat_none = np.zeros(len(df), dtype=int)
assign_treat_all = np.ones(len(df), dtype=int)
assign_iate_rule = (iate_hat &amp;gt; COST).astype(int)
assign_oracle = (truth[&amp;quot;tau&amp;quot;].values &amp;gt; COST).astype(int)
def welfare(rule, tau_true, cost):
return float((rule * (tau_true - cost)).mean())
policy = pd.DataFrame({
&amp;quot;rule&amp;quot;: [&amp;quot;Treat none&amp;quot;, &amp;quot;Treat all&amp;quot;,
&amp;quot;IATE rule (treat where iate_hat &amp;gt; cost)&amp;quot;,
&amp;quot;Oracle (treat where true tau &amp;gt; cost)&amp;quot;],
&amp;quot;share_treated&amp;quot;: [assign_treat_none.mean(), assign_treat_all.mean(),
assign_iate_rule.mean(), assign_oracle.mean()],
&amp;quot;avg_welfare&amp;quot;: [welfare(assign_treat_none, truth[&amp;quot;tau&amp;quot;].values, COST),
welfare(assign_treat_all, truth[&amp;quot;tau&amp;quot;].values, COST),
welfare(assign_iate_rule, truth[&amp;quot;tau&amp;quot;].values, COST),
welfare(assign_oracle, truth[&amp;quot;tau&amp;quot;].values, COST)],
})
print(policy.to_string(index=False, float_format=lambda v: f&amp;quot;{v:7.3f}&amp;quot;))
&lt;/code>&lt;/pre>
&lt;pre>&lt;code class="language-text"> rule share_treated avg_welfare
Treat none 0.000 0.000
Treat all 1.000 1.628
IATE rule (treat where iate_hat &amp;gt; cost) 0.839 1.749
Oracle (treat where true tau &amp;gt; cost) 0.838 1.758
&lt;/code>&lt;/pre>
&lt;p>&lt;img src="cml_policy_welfare.png" alt="Bar chart of average net welfare per individual under four rules: treat-none (0.00), treat-all (1.63), IATE rule (1.75), and oracle (1.76), with each bar annotated by the share of the cohort treated.">&lt;/p>
&lt;p>Once we have credible per-person effect estimates, the welfare comparison is striking. Holding training back from everyone yields zero net welfare. Treating everyone yields 1.63 months of net welfare per person — the ATE of 5.63 minus the cost of 4.0. Switching to a &lt;em>targeted&lt;/em> rule that trains only individuals with estimated IATE above the 4-month cost threshold treats 83.9% of the cohort — almost identical to the 83.8% the oracle would treat — and lifts welfare to &lt;strong>1.749 months per person, recovering 99.5% of the oracle&amp;rsquo;s 1.758-month welfare and beating treat-all by 7.4%&lt;/strong>. The IATE rule&amp;rsquo;s small remaining gap (just 0.009 months per person) reflects the 0.4-month MAE in the individual estimates: the rule occasionally treats a person it shouldn&amp;rsquo;t and skips a person it should, but those errors net out to a tiny welfare loss because the misranked individuals are concentrated near the cost cutoff where the welfare slope is shallow.&lt;/p>
&lt;h2 id="discussion">Discussion&lt;/h2>
&lt;p>We started with three questions. &lt;em>Does training cause more months of employment?&lt;/em> Yes — DoubleML estimates the ATE at 5.520 months [5.36, 5.68], and that 95% CI covers the true 5.628; the simpler naive comparison would have understated the effect by about half a month and produced a CI that misses the truth entirely. &lt;em>Does the effect depend on who the jobseeker is?&lt;/em> Strongly yes — the GATE declines monotonically from 7.47 months for jobseekers with no Dutch to 2.91 months for native speakers, a 2.6× ratio that is a real policy signal, not noise. &lt;em>Can we use those differences to assign training better?&lt;/em> Also yes — feeding the CausalForestDML&amp;rsquo;s IATE estimates into a simple &amp;ldquo;treat where $\hat{\tau}_i &amp;gt; c$&amp;rdquo; rule (with $c$ the per-jobseeker cost of training) captures 99.5% of the welfare an oracle would achieve and improves on treating everyone by 7.4%.&lt;/p>
&lt;p>The methodological discipline behind these answers is what separates CML from a &amp;ldquo;throw a random forest at it&amp;rdquo; approach. DoubleML&amp;rsquo;s cross-fitting and orthogonal scoring give the ATE estimator a $\sqrt{n}$ rate even with slow-converging machine-learning nuisances; the doubly-robust pseudo-outcome lets us reuse those nuisances for an internally consistent GATE without re-fitting; and the causal forest produces individual-level estimates that respect the same identification logic. A practitioner thinking about a real ALMP would now have a defensible answer to the question that matters most: not just &amp;ldquo;should we run this programme?&amp;rdquo; but &amp;ldquo;for whom?&amp;rdquo;.&lt;/p>
&lt;p>The case study also surfaces a subtle but important caveat about &lt;em>which&lt;/em> tool to use for &lt;em>which&lt;/em> question. The CausalForestDML mean-of-IATEs has the tightest 95% CI of any estimator in the comparison, but that interval is for the &lt;em>average of individual predictions&lt;/em>, not for the population ATE. Its upper bound (5.50) does not cover the truth (5.628), and treating it as a competitor to the DoubleML interval would be a methodological mistake. &lt;strong>DoubleML for the ATE; causal forest for ranking and heterogeneity&lt;/strong> — that is the operational division of labour the literature recommends and that this case study demonstrates concretely.&lt;/p>
&lt;h2 id="limitations-and-next-steps">Limitations and next steps&lt;/h2>
&lt;p>The result is encouraging but rests on assumptions that are worth flagging carefully:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Synthetic data with easy overlap.&lt;/strong> Estimated propensities are bounded inside [0.21, 0.81] by construction, so neither the DoubleML &lt;code>trimming_threshold = 0.01&lt;/code> nor the doubly-robust pseudo-outcome&amp;rsquo;s division by $m$ and $1 - m$ is stressed on these data. In a real ALMP cohort, propensities can drift toward 0 or 1, the doubly-robust score becomes sensitive to small denominators, and trimming choices matter much more than they appear to here.&lt;/li>
&lt;li>&lt;strong>Unconfoundedness.&lt;/strong> Every causal claim assumes selection-on-observables: conditional on the six covariates, treatment assignment is as good as random. The synthetic DGP satisfies this by construction; in a real application this is the strong identifying assumption that justifies DoubleML and CausalForestDML over a naive comparison.&lt;/li>
&lt;li>&lt;strong>Treatment share.&lt;/strong> The cohort has 52.8% treated, which is higher than typical real-world ALMP studies. The synthetic DGP is calibrated to keep overlap comfortable in every stratum, so readers should not over-interpret the &lt;em>magnitude&lt;/em> of effects.&lt;/li>
&lt;li>&lt;strong>Forest CI is not a substitute for the DoubleML CI.&lt;/strong> The CausalForestDML mean-of-IATEs interval misses the truth even though the forest is well-calibrated overall. Use it for heterogeneity, not for ATE inference.&lt;/li>
&lt;li>&lt;strong>Cost is fixed and known.&lt;/strong> The welfare comparison takes the four-month cost as given. In practice the cost of an ALMP intervention is itself uncertain and could vary across jobseekers (administrative cost, opportunity cost, displacement effects), and the optimal assignment rule should propagate that uncertainty.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Next steps&lt;/strong> to strengthen and extend the analysis:&lt;/p>
&lt;ul>
&lt;li>Replace the single Dutch-proficiency-based GATE with &lt;strong>policy trees&lt;/strong> (&lt;a href="https://doi.org/10.3982/ECTA15732" target="_blank" rel="noopener">Athey &amp;amp; Wager, 2021&lt;/a>), which learn the assignment rule directly from data rather than relying on a hand-picked stratification variable.&lt;/li>
&lt;li>Compare CausalForestDML against the &lt;strong>Modified Causal Forest (&lt;code>mcf&lt;/code>)&lt;/strong> package used in &lt;a href="https://doi.org/10.1016/j.labeco.2023.102306" target="_blank" rel="noopener">Cockx, Lechner &amp;amp; Bollens (2023)&lt;/a>, which targets exactly this setting.&lt;/li>
&lt;li>Stress-test overlap by drifting the propensity-score distribution toward 0 or 1 and re-running the full pipeline; observe how trimming choices and DR-score variance change.&lt;/li>
&lt;li>Extend to &lt;strong>multi-valued treatments&lt;/strong> (e.g., several training programmes) and use &lt;code>DoubleMLAPO&lt;/code> to estimate the average potential outcome for each arm.&lt;/li>
&lt;li>Run the doubly-robust pipeline on a &lt;strong>real ALMP dataset&lt;/strong> with weaker overlap and check whether the policy-relevant punchline (lower Dutch → larger benefit) survives outside the synthetic DGP.&lt;/li>
&lt;/ul>
&lt;h2 id="takeaways">Takeaways&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Naive difference-in-means is biased on observational data — visibly so.&lt;/strong> It estimates 5.111 months [4.93, 5.30] against a true ATE of 5.628, a 0.52-month downward bias whose 95% CI fails to cover the truth.&lt;/li>
&lt;li>&lt;strong>DoubleML closes 79% of the bias gap&lt;/strong> and delivers correct coverage. The IRM estimate of 5.520 [5.36, 5.68] both covers the true 5.628 and tightens the standard error from 0.094 (naive) to 0.081.&lt;/li>
&lt;li>&lt;strong>Effect heterogeneity by Dutch proficiency is real and policy-relevant.&lt;/strong> Estimated GATEs of 7.47 / 6.13 / 4.50 / 2.91 across levels 0–3 line up against truths 7.63 / 6.12 / 4.61 / 3.13, with all four 95% CIs covering their target.&lt;/li>
&lt;li>&lt;strong>CausalForestDML recovers the individual effect surface with 0.956 correlation and 0.40-month MAE&lt;/strong> — small relative to the 4.5-month spread of true effects across individuals.&lt;/li>
&lt;li>&lt;strong>A simple IATE-based assignment rule recovers 99.5% of oracle welfare&lt;/strong> (1.749 vs 1.758 months per person) and beats treat-all by 7.4% — the central practical reason to estimate individual effects in the first place.&lt;/li>
&lt;li>&lt;strong>CausalForestDML&amp;rsquo;s CI for the &lt;em>average&lt;/em> of IATEs is not a substitute for DoubleML&amp;rsquo;s CI for the ATE.&lt;/strong> The forest interval [5.42, 5.50] misses truth despite the forest being well-calibrated overall — a methodological subtlety worth remembering.&lt;/li>
&lt;li>&lt;strong>Easy overlap in this synthetic DGP is a feature of the case study, not a property of CML.&lt;/strong> Real-world ALMP applications will encounter tighter propensity bounds, and trimming will matter much more than it appears to here.&lt;/li>
&lt;li>&lt;strong>Next step.&lt;/strong> Replace the hand-picked Dutch-proficiency stratification with a learned policy tree to maximise welfare directly; compare CausalForestDML to the &lt;code>mcf&lt;/code> package on a real ALMP cohort.&lt;/li>
&lt;/ul>
&lt;h2 id="exercises">Exercises&lt;/h2>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Change the cost.&lt;/strong> Re-run Step 7 with &lt;code>COST = 2.0&lt;/code> and &lt;code>COST = 6.0&lt;/code> months. How does the IATE rule&amp;rsquo;s share-treated change? At what cost does the rule converge to &amp;ldquo;treat all&amp;rdquo; or &amp;ldquo;treat none&amp;rdquo;, and does the welfare gap to the oracle widen or shrink?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Swap the nuisance learner.&lt;/strong> Re-fit &lt;code>DoubleMLIRM&lt;/code> with &lt;code>LassoCV&lt;/code> for &lt;code>ml_g&lt;/code> and &lt;code>LogisticRegressionCV&lt;/code> for &lt;code>ml_m&lt;/code>. Does the ATE estimate change meaningfully? Does the 95% CI still cover the truth, and is the standard error smaller or larger than with random forests?&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Stress-test heterogeneity.&lt;/strong> Compute the IATE separately for $X$ profiles that differ &lt;em>only&lt;/em> in &lt;code>migrant&lt;/code> (holding the other five covariates at their median values). Does the &lt;code>CausalForestDML&lt;/code> predict a clear &lt;code>migrant&lt;/code> effect, and is it consistent with the GATE pattern by Dutch proficiency?&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h2 id="references">References&lt;/h2>
&lt;ol>
&lt;li>&lt;a href="https://doi.org/10.1186/s41937-023-00113-y" target="_blank" rel="noopener">Lechner, M. (2023). Causal Machine Learning and its use for public policy. &lt;em>Swiss Journal of Economics and Statistics&lt;/em>, 159(8).&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1016/j.labeco.2023.102306" target="_blank" rel="noopener">Cockx, B., Lechner, M. &amp;amp; Bollens, J. (2023). Priority to unemployed immigrants? A causal machine learning evaluation of training in Belgium. &lt;em>Labour Economics&lt;/em>, 80, 102306.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1111/ectj.12097" target="_blank" rel="noopener">Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. &amp;amp; Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. &lt;em>The Econometrics Journal&lt;/em>, 21(1), C1–C68.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.1214/18-AOS1709" target="_blank" rel="noopener">Athey, S., Tibshirani, J. &amp;amp; Wager, S. (2019). Generalized random forests. &lt;em>Annals of Statistics&lt;/em>, 47(2), 1148–1178.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://doi.org/10.3982/ECTA15732" target="_blank" rel="noopener">Athey, S. &amp;amp; Wager, S. (2021). Policy Learning with Observational Data. &lt;em>Econometrica&lt;/em>, 89(1), 133–161.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://docs.doubleml.org/" target="_blank" rel="noopener">DoubleML — Python Package for Double Machine Learning.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://econml.azurewebsites.net/" target="_blank" rel="noopener">EconML — Microsoft Research Python Package for Causal ML.&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://mcfpy.github.io/mcf/" target="_blank" rel="noopener">Modified Causal Forest (&lt;code>mcf&lt;/code>) — Python Package.&lt;/a>&lt;/li>
&lt;/ol>
&lt;h4 id="acknowledgements">Acknowledgements&lt;/h4>
&lt;p>AI tools (Claude Code, Gemini, NotebookLM) were used to make the contents of this post more accessible to students. Nevertheless, the content in this post may still have errors. Caution is needed when applying the contents of this post to true research projects.&lt;/p></description></item></channel></rss>