Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
pension_raw | household (cross-section) | 9,915 × 14 | pension_raw.dta | pension_raw.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_doubleml_pension/data/"
use "${BASE}pension_raw.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_doubleml_pension/data/"
df = pd.read_stata(BASE + "pension_raw.dta")
# load every dataset at once
files = ["pension_raw"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "pension_raw.dta", "pension_raw.dta")
df, meta = pyreadstat.read_dta("pension_raw.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_doubleml_pension/data/"
df <- read_dta(paste0(BASE, "pension_raw.dta"))Overview & sources
Companion data for a hands-on Python tutorial that estimates the causal effect of 401(k) eligibility and participation on household net financial assets using Double Machine Learning (DoubleML). The dataset is the real 401(k) sample drawn from the 1991 U.S. Survey of Income and Program Participation (SIPP) — the same nationally representative cross-section of 9,915 households used by Chernozhukov et al. (2018) and distributed with the DoubleML package via fetch_401K (originally constructed from the SIPP by Poterba, Venti & Wise). About 37 % of households are eligible for a 401(k) and 26 % participate. The post fits three estimators — Partially Linear Regression (PLR) and the doubly robust Interactive Regression Model (IRM), both targeting the Average Treatment Effect of eligibility, and the Interactive IV Model (IIVM), which instruments participation with eligibility to recover the Local Average Treatment Effect on compliers — each fit with four ML learners (Lasso, Random Forest, Decision Tree, XGBoost) under cross-fitting.
pension_raw is a cross-section (no time dimension): each of the 9,915 rows is a single U.S. household observed once in the 1991 SIPP. There is no panel/longitudinal structure — dynamic effects of 401(k) participation over time are not captured. e401 is the (conditionally exogenous) treatment / instrument; p401 is the endogenous treatment; net_tfa is the outcome; the remaining columns are demographic and financial covariates.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| 1991 SIPP (U.S. Census Bureau) | Underlying survey: net financial assets, income, demographics, 401(k) eligibility & participation | U.S. Census Bureau, Survey of Income and Program Participation (SIPP), 1991 wave. https://www.census.gov/programs-surveys/sipp.html |
| DoubleML fetch_401K | The exact analysis-ready extract loaded by the post (14 variables, 9,915 households) | DoubleML Python package, doubleml.datasets.fetch_401K. https://docs.doubleml.org/stable/api/generated/doubleml.datasets.fetch_401K.html |
| Poterba, Venti & Wise (1995) | Original construction of the SIPP 401(k) sample and the eligibility/participation framing | Poterba, J., Venti, S., & Wise, D. (1995). Do 401(k) contributions crowd out other personal saving? Journal of Public Economics, 58(1), 1–32. |
| Chernozhukov et al. (2018) | Method (Double/Debiased Machine Learning) and the canonical use of this 401(k) sample | Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/Debiased Machine Learning for Treatment and Structural Parameters. The Econometrics Journal, 21(1), C1–C68. |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Double Machine Learning with 401(k) Data: From Eligibility Effects to Complier Analysis [Data set]. https://carlos-mendez.org/post/python_doubleml_pension/
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/Debiased Machine Learning for Treatment and Structural Parameters. The Econometrics Journal, 21(1), C1–C68. https://doi.org/10.1111/ectj.12097 — Poterba, J., Venti, S., & Wise, D. (1995). Do 401(k) contributions crowd out other personal saving? Journal of Public Economics, 58(1), 1–32. https://doi.org/10.1016/0047-2727(94)01462-WBibTeX
@misc{mendez2026pythondoublemlpension,
author = {Mendez, Carlos},
title = {Double Machine Learning with 401(k) Data: From Eligibility Effects to Complier Analysis},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/python_doubleml_pension/}},
note = {Data set}
}
@article{chernozhukov2018double,
author = {Chernozhukov, Victor and Chetverikov, Denis and Demirer, Mert and Duflo, Esther and Hansen, Christian and Newey, Whitney and Robins, James},
title = {Double/Debiased Machine Learning for Treatment and Structural Parameters},
journal = {The Econometrics Journal},
volume = {21}, number = {1}, pages = {C1--C68}, year = {2018},
doi = {10.1111/ectj.12097}
}
@article{poterba1995do,
author = {Poterba, James and Venti, Steven and Wise, David},
title = {Do 401(k) contributions crowd out other personal saving?},
journal = {Journal of Public Economics},
volume = {58}, number = {1}, pages = {1--32}, year = {1995},
doi = {10.1016/0047-2727(94)01462-W}
}Variable explorer search & filter all 14 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
age# | continuous | Age of household head (years) | Age of the reference person of the household. | years | pension_raw | 1991 SIPP / fetch_401K | |
db# | dummy | Defined-benefit pension (1=yes) | 1 if the household head has a defined-benefit pension, else 0. | 0/1 | pension_raw | 1991 SIPP / fetch_401K | |
e401# | dummy | 401(k) eligibility (1=eligible) — treatment / instrument | 1 if the household is eligible for a 401(k) plan, else 0. Plausibly exogenous after conditioning on X; the treatment in PLR/IRM and the instrument in IIVM. | 0/1 | pension_raw | 1991 SIPP / fetch_401K | |
educ# | continuous | Education (years) | Years of schooling of the household head. | years | pension_raw | 1991 SIPP / fetch_401K | |
fsize# | continuous | Family size (count) | Number of household members. | persons | pension_raw | 1991 SIPP / fetch_401K | |
hown# | dummy | Home ownership (1=owns) | 1 if the household owns its home, else 0. | 0/1 | pension_raw | 1991 SIPP / fetch_401K | |
inc# | continuous | Household income (US$) | Annual household income — the dominant confounder. | US$ | pension_raw | 1991 SIPP / fetch_401K | |
marr# | dummy | Married (1=yes) | Marital status of the household head. | 0/1 | pension_raw | 1991 SIPP / fetch_401K | |
net_tfa# | continuous | Net total financial assets (US$) — outcome | Household net total financial assets; the outcome variable in the tutorial. | US$ | pension_raw | 1991 SIPP / fetch_401K | |
nifa# | continuous | Non-401(k) financial assets (US$) | Net financial assets held outside 401(k) accounts. | US$ | pension_raw | 1991 SIPP / fetch_401K | |
p401# | dummy | 401(k) participation (1=participates) — endogenous treatment | 1 if the household participates in a 401(k) plan, else 0. The endogenous treatment in IIVM (instrumented by e401). | 0/1 | pension_raw | 1991 SIPP / fetch_401K | |
pira# | dummy | IRA participation (1=yes) | 1 if the household participates in an IRA, else 0. | 0/1 | pension_raw | 1991 SIPP / fetch_401K | |
tw# | continuous | Total wealth (US$) | Household total wealth (broader than net_tfa). | US$ | pension_raw | 1991 SIPP / fetch_401K | |
twoearn# | dummy | Two-earner household (1=yes) | 1 if the household has two earners, else 0. | 0/1 | pension_raw | 1991 SIPP / fetch_401K |
Cross-file variable index
Which file each variable appears in (● = present).
Construction & formulas
The post estimates the causal effect of 401(k) access on net financial assets with three Double Machine Learning (DML) models. In every model the nuisance functions are fit by ML learners (Lasso / Random Forest / Decision Tree / XGBoost) under cross-fitting, and the treatment parameter is recovered from a Neyman-orthogonal score.
- Naive benchmark (biased): difference in mean
net_tfabetween groups,Δ̂ = Ȳ(e401=1) − Ȳ(e401=0)= $19,559 for eligibility (and $27,372 for participation) — conflates the causal effect with confounding from income, education, etc. - PLR — Partially Linear Regression (estimand: ATE of eligibility):
Y = θ₀·D + g₀(X) + εwithD = m₀(X) + V. Partial outg₀(outcome model) andm₀(treatment model), regress the residuals; assumes a constant additive effect θ₀. Mean ATE ≈ $8,730. - IRM — Interactive Regression Model (estimand: ATE of eligibility, doubly
robust / AIPW):
θ₀ = E[ g₀(1,X) − g₀(0,X) + D·(Y−g₀(1,X))/m₀(X) − (1−D)·(Y−g₀(0,X))/(1−m₀(X)) ], wherem₀(X)=P(D=1|X)is the propensity score; consistent if either g₀ or m₀ is correct;trimming_threshold = 0.01. Mean ATE ≈ $8,213. - IIVM — Interactive IV Model (estimand: LATE on compliers): treatment =
p401(participation, endogenous), instrumentZ = e401(eligibility). Wald-type ratioθ_LATE = (E[Y|Z=1] − E[Y|Z=0]) / (E[D|Z=1] − E[D|Z=0]). Mean LATE ≈ $11,746.
Covariates (X). The base specification uses 9 covariates
(age, inc, educ, fsize, marr, twoearn, db, pira, hown); a flexible specification adds
quadratic terms for the four continuous covariates (age, inc, educ, fsize) so the Lasso can capture
nonlinear confounding. nifa and tw are alternative wealth measures present
in the source file but not used as outcome or covariates in the tutorial.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
nifa continuous | Non-401(k) financial assets (US$) | Net financial assets held outside 401(k) accounts. | From the SIPP wealth module (financial assets excluding 401(k) balances). | US$ | 1991 SIPP / fetch_401K | 9,915 households |
net_tfa continuous | Net total financial assets (US$) — outcome | Household net total financial assets; the outcome variable in the tutorial. | Total financial assets net of liabilities, from the SIPP wealth module. | US$ | 1991 SIPP / fetch_401K | 9,915 households |
tw continuous | Total wealth (US$) | Household total wealth (broader than net_tfa). | Total net worth from the SIPP wealth module; not used in the post's models. | US$ | 1991 SIPP / fetch_401K | 9,915 households |
age continuous | Age of household head (years) | Age of the reference person of the household. | SIPP demographic record; sample restricted to working ages (25–64). | years | 1991 SIPP / fetch_401K | 9,915 households |
inc continuous | Household income (US$) | Annual household income — the dominant confounder. | SIPP income record; covariate X in all models. | US$ | 1991 SIPP / fetch_401K | 9,915 households |
fsize continuous | Family size (count) | Number of household members. | SIPP household roster; covariate X. | persons | 1991 SIPP / fetch_401K | 9,915 households |
educ continuous | Education (years) | Years of schooling of the household head. | SIPP education record; covariate X. | years | 1991 SIPP / fetch_401K | 9,915 households |
db dummy | Defined-benefit pension (1=yes) | 1 if the household head has a defined-benefit pension, else 0. | SIPP pension record; covariate X. | 0/1 | 1991 SIPP / fetch_401K | 9,915 households |
marr dummy | Married (1=yes) | Marital status of the household head. | SIPP demographic record; covariate X. | 0/1 | 1991 SIPP / fetch_401K | 9,915 households |
twoearn dummy | Two-earner household (1=yes) | 1 if the household has two earners, else 0. | SIPP income record; covariate X. | 0/1 | 1991 SIPP / fetch_401K | 9,915 households |
e401 dummy | 401(k) eligibility (1=eligible) — treatment / instrument | 1 if the household is eligible for a 401(k) plan, else 0. Plausibly exogenous after conditioning on X; the treatment in PLR/IRM and the instrument in IIVM. | SIPP-derived eligibility flag (employer offers a 401(k)). Eligible: 3,682 / 9,915 (37.1%). | 0/1 | 1991 SIPP / fetch_401K | 9,915 households |
p401 dummy | 401(k) participation (1=participates) — endogenous treatment | 1 if the household participates in a 401(k) plan, else 0. The endogenous treatment in IIVM (instrumented by e401). | SIPP-derived participation flag (household enrolled). Participating: 2,594 / 9,915 (26.2%). | 0/1 | 1991 SIPP / fetch_401K | 9,915 households |
pira dummy | IRA participation (1=yes) | 1 if the household participates in an IRA, else 0. | SIPP pension/saving record; covariate X. | 0/1 | 1991 SIPP / fetch_401K | 9,915 households |
hown dummy | Home ownership (1=owns) | 1 if the household owns its home, else 0. | SIPP housing record; covariate X. | 0/1 | 1991 SIPP / fetch_401K | 9,915 households |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
nifa | 100% | 9,915 | 3,034 | 0 | 13,929 | 1,635.0 | 1,430,298 | 54,905 | |
net_tfa | 100% | 9,915 | 5,168 | -502,302 | 18,052 | 1,499.0 | 1,536,798 | 63,523 | |
tw | 100% | 9,915 | 7,233 | -502,302 | 63,817 | 25,100 | 2,029,910 | 111,530 | |
age | 100% | 9,915 | 40 | 25.00 | 41.06 | 40.00 | 64.00 | 10.34 | |
inc | 100% | 9,915 | 7,334 | -2,652.0 | 37,201 | 31,476 | 242,124 | 24,774 | |
fsize | 100% | 9,915 | 13 | 1.00 | 2.87 | 3.00 | 13.00 | 1.54 | |
educ | 100% | 9,915 | 18 | 1.00 | 13.21 | 12.00 | 18.00 | 2.81 | |
db | 100% | 9,915 | 2 | 0 | 0.271 | 0 | 1.00 | 0.445 | |
marr | 100% | 9,915 | 2 | 0 | 0.605 | 1.00 | 1.00 | 0.489 | |
twoearn | 100% | 9,915 | 2 | 0 | 0.381 | 0 | 1.00 | 0.486 | |
e401 | 100% | 9,915 | 2 | 0 | 0.371 | 0 | 1.00 | 0.483 | |
p401 | 100% | 9,915 | 2 | 0 | 0.262 | 0 | 1.00 | 0.440 | |
pira | 100% | 9,915 | 2 | 0 | 0.242 | 0 | 1.00 | 0.428 | |
hown | 100% | 9,915 | 2 | 0 | 0.635 | 1.00 | 1.00 | 0.481 |
Known limitations & caveats
- Real survey data, single snapshot. These are real 1991 SIPP households, not simulated; but the cross-section is one point in time, so dynamic effects of 401(k) participation over time are not identified.
- Conditional exogeneity. The ATE estimates assume eligibility is as good as randomly assigned after conditioning on the observed covariates. Unobserved factors (e.g. financial literacy) that affect both eligibility and savings would bias the estimates.
- Extreme asset values.
net_tfaranges from −$502,302 to $1,536,798 (andtwup to $2,029,910); a handful of outliers influence the mean-based ATE estimates. - IIVM identifies the LATE, not the ATE. The ≈$11,746 IIVM estimate applies only to compliers (households who participate because eligible) and should not be generalized to the full population without the monotonicity assumption.
- nifa / tw are extras. The tutorial's outcome is
net_tfa;nifa(non-401(k) financial assets) andtw(total wealth) ship in the source file but are not modeled — included here for completeness.