← Back to the post
Interactive data dictionary

Double Machine Learning with 401(k) Pension Data

The real 1991 SIPP 401(k) eligibility & net-wealth sample behind a PLR / IRM / IIVM tutorial.

1
dataset
14
variables
9,915
households
1991
SIPP year

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

DatasetGrainRowsStataSource
pension_rawhousehold (cross-section)9,915 × 14pension_raw.dtapension_raw.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_doubleml_pension/data/"
use "${BASE}pension_raw.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_doubleml_pension/data/"
df = pd.read_stata(BASE + "pension_raw.dta")

# load every dataset at once
files = ["pension_raw"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "pension_raw.dta", "pension_raw.dta")
df, meta = pyreadstat.read_dta("pension_raw.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_doubleml_pension/data/"
df <- read_dta(paste0(BASE, "pension_raw.dta"))

Overview & sources

Companion data for a hands-on Python tutorial that estimates the causal effect of 401(k) eligibility and participation on household net financial assets using Double Machine Learning (DoubleML). The dataset is the real 401(k) sample drawn from the 1991 U.S. Survey of Income and Program Participation (SIPP) — the same nationally representative cross-section of 9,915 households used by Chernozhukov et al. (2018) and distributed with the DoubleML package via fetch_401K (originally constructed from the SIPP by Poterba, Venti & Wise). About 37 % of households are eligible for a 401(k) and 26 % participate. The post fits three estimators — Partially Linear Regression (PLR) and the doubly robust Interactive Regression Model (IRM), both targeting the Average Treatment Effect of eligibility, and the Interactive IV Model (IIVM), which instruments participation with eligibility to recover the Local Average Treatment Effect on compliers — each fit with four ML learners (Lasso, Random Forest, Decision Tree, XGBoost) under cross-fitting.

One file, one row per household. pension_raw is a cross-section (no time dimension): each of the 9,915 rows is a single U.S. household observed once in the 1991 SIPP. There is no panel/longitudinal structure — dynamic effects of 401(k) participation over time are not captured. e401 is the (conditionally exogenous) treatment / instrument; p401 is the endogenous treatment; net_tfa is the outcome; the remaining columns are demographic and financial covariates.

Data sources

SourceProvidesReference / URL
1991 SIPP (U.S. Census Bureau)Underlying survey: net financial assets, income, demographics, 401(k) eligibility &amp; participationU.S. Census Bureau, Survey of Income and Program Participation (SIPP), 1991 wave. https://www.census.gov/programs-surveys/sipp.html
DoubleML fetch_401KThe exact analysis-ready extract loaded by the post (14 variables, 9,915 households)DoubleML Python package, doubleml.datasets.fetch_401K. https://docs.doubleml.org/stable/api/generated/doubleml.datasets.fetch_401K.html
Poterba, Venti &amp; Wise (1995)Original construction of the SIPP 401(k) sample and the eligibility/participation framingPoterba, J., Venti, S., & Wise, D. (1995). Do 401(k) contributions crowd out other personal saving? Journal of Public Economics, 58(1), 1–32.
Chernozhukov et al. (2018)Method (Double/Debiased Machine Learning) and the canonical use of this 401(k) sampleChernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/Debiased Machine Learning for Treatment and Structural Parameters. The Econometrics Journal, 21(1), C1–C68.

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Double Machine Learning with 401(k) Data: From Eligibility Effects to Complier Analysis [Data set]. https://carlos-mendez.org/post/python_doubleml_pension/

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/Debiased Machine Learning for Treatment and Structural Parameters. The Econometrics Journal, 21(1), C1–C68. https://doi.org/10.1111/ectj.12097 — Poterba, J., Venti, S., & Wise, D. (1995). Do 401(k) contributions crowd out other personal saving? Journal of Public Economics, 58(1), 1–32. https://doi.org/10.1016/0047-2727(94)01462-W

BibTeX

@misc{mendez2026pythondoublemlpension,
  author       = {Mendez, Carlos},
  title        = {Double Machine Learning with 401(k) Data: From Eligibility Effects to Complier Analysis},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/python_doubleml_pension/}},
  note         = {Data set}
}

@article{chernozhukov2018double,
  author  = {Chernozhukov, Victor and Chetverikov, Denis and Demirer, Mert and Duflo, Esther and Hansen, Christian and Newey, Whitney and Robins, James},
  title   = {Double/Debiased Machine Learning for Treatment and Structural Parameters},
  journal = {The Econometrics Journal},
  volume  = {21}, number = {1}, pages = {C1--C68}, year = {2018},
  doi     = {10.1111/ectj.12097}
}
@article{poterba1995do,
  author  = {Poterba, James and Venti, Steven and Wise, David},
  title   = {Do 401(k) contributions crowd out other personal saving?},
  journal = {Journal of Public Economics},
  volume  = {58}, number = {1}, pages = {1--32}, year = {1995},
  doi     = {10.1016/0047-2727(94)01462-W}
}

Variable explorer search & filter all 14 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

VariableTypeDistributionLabelDefinitionUnitsIn filesSource
age#continuousmin 25 | median 40 | max 64Age of household head (years)Age of the reference person of the household.yearspension_raw1991 SIPP / fetch_401K
db#dummyshare coded 1 = 0.271Defined-benefit pension (1=yes)1 if the household head has a defined-benefit pension, else 0.0/1pension_raw1991 SIPP / fetch_401K
e401#dummyshare coded 1 = 0.371401(k) eligibility (1=eligible) — treatment / instrument1 if the household is eligible for a 401(k) plan, else 0. Plausibly exogenous after conditioning on X; the treatment in PLR/IRM and the instrument in IIVM.0/1pension_raw1991 SIPP / fetch_401K
educ#continuousmin 1 | median 12 | max 18Education (years)Years of schooling of the household head.yearspension_raw1991 SIPP / fetch_401K
fsize#continuousmin 1 | median 3 | max 13Family size (count)Number of household members.personspension_raw1991 SIPP / fetch_401K
hown#dummyshare coded 1 = 0.635Home ownership (1=owns)1 if the household owns its home, else 0.0/1pension_raw1991 SIPP / fetch_401K
inc#continuousmin -2.65e+03 | median 3.15e+04 | max 2.42e+05Household income (US$)Annual household income — the dominant confounder.US$pension_raw1991 SIPP / fetch_401K
marr#dummyshare coded 1 = 0.605Married (1=yes)Marital status of the household head.0/1pension_raw1991 SIPP / fetch_401K
net_tfa#continuousmin -5.02e+05 | median 1.5e+03 | max 1.54e+06Net total financial assets (US$) — outcomeHousehold net total financial assets; the outcome variable in the tutorial.US$pension_raw1991 SIPP / fetch_401K
nifa#continuousmin 0 | median 1.64e+03 | max 1.43e+06Non-401(k) financial assets (US$)Net financial assets held outside 401(k) accounts.US$pension_raw1991 SIPP / fetch_401K
p401#dummyshare coded 1 = 0.262401(k) participation (1=participates) — endogenous treatment1 if the household participates in a 401(k) plan, else 0. The endogenous treatment in IIVM (instrumented by e401).0/1pension_raw1991 SIPP / fetch_401K
pira#dummyshare coded 1 = 0.242IRA participation (1=yes)1 if the household participates in an IRA, else 0.0/1pension_raw1991 SIPP / fetch_401K
tw#continuousmin -5.02e+05 | median 2.51e+04 | max 2.03e+06Total wealth (US$)Household total wealth (broader than net_tfa).US$pension_raw1991 SIPP / fetch_401K
twoearn#dummyshare coded 1 = 0.381Two-earner household (1=yes)1 if the household has two earners, else 0.0/1pension_raw1991 SIPP / fetch_401K

Cross-file variable index

Which file each variable appears in (● = present).

Variablepension_raw
age
db
e401
educ
fsize
hown
inc
marr
net_tfa
nifa
p401
pira
tw
twoearn

Construction & formulas

The post estimates the causal effect of 401(k) access on net financial assets with three Double Machine Learning (DML) models. In every model the nuisance functions are fit by ML learners (Lasso / Random Forest / Decision Tree / XGBoost) under cross-fitting, and the treatment parameter is recovered from a Neyman-orthogonal score.

Covariates (X). The base specification uses 9 covariates (age, inc, educ, fsize, marr, twoearn, db, pira, hown); a flexible specification adds quadratic terms for the four continuous covariates (age, inc, educ, fsize) so the Lasso can capture nonlinear confounding. nifa and tw are alternative wealth measures present in the source file but not used as outcome or covariates in the tutorial.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

household (cross-section)  9,915 × 14 · 1991 (SIPP survey year) · 9,915 U.S. households

Panel key: one row per household (no explicit id column) · Estimate the causal effect of 401(k) eligibility/participation on net financial assets (PLR / IRM / IIVM).

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
nifa continuousNon-401(k) financial assets (US$)Net financial assets held outside 401(k) accounts.From the SIPP wealth module (financial assets excluding 401(k) balances).US$1991 SIPP / fetch_401K9,915 households
net_tfa continuousNet total financial assets (US$) — outcomeHousehold net total financial assets; the outcome variable in the tutorial.Total financial assets net of liabilities, from the SIPP wealth module.US$1991 SIPP / fetch_401K9,915 households
tw continuousTotal wealth (US$)Household total wealth (broader than net_tfa).Total net worth from the SIPP wealth module; not used in the post's models.US$1991 SIPP / fetch_401K9,915 households
age continuousAge of household head (years)Age of the reference person of the household.SIPP demographic record; sample restricted to working ages (25–64).years1991 SIPP / fetch_401K9,915 households
inc continuousHousehold income (US$)Annual household income — the dominant confounder.SIPP income record; covariate X in all models.US$1991 SIPP / fetch_401K9,915 households
fsize continuousFamily size (count)Number of household members.SIPP household roster; covariate X.persons1991 SIPP / fetch_401K9,915 households
educ continuousEducation (years)Years of schooling of the household head.SIPP education record; covariate X.years1991 SIPP / fetch_401K9,915 households
db dummyDefined-benefit pension (1=yes)1 if the household head has a defined-benefit pension, else 0.SIPP pension record; covariate X.0/11991 SIPP / fetch_401K9,915 households
marr dummyMarried (1=yes)Marital status of the household head.SIPP demographic record; covariate X.0/11991 SIPP / fetch_401K9,915 households
twoearn dummyTwo-earner household (1=yes)1 if the household has two earners, else 0.SIPP income record; covariate X.0/11991 SIPP / fetch_401K9,915 households
e401 dummy401(k) eligibility (1=eligible) — treatment / instrument1 if the household is eligible for a 401(k) plan, else 0. Plausibly exogenous after conditioning on X; the treatment in PLR/IRM and the instrument in IIVM.SIPP-derived eligibility flag (employer offers a 401(k)). Eligible: 3,682 / 9,915 (37.1%).0/11991 SIPP / fetch_401K9,915 households
p401 dummy401(k) participation (1=participates) — endogenous treatment1 if the household participates in a 401(k) plan, else 0. The endogenous treatment in IIVM (instrumented by e401).SIPP-derived participation flag (household enrolled). Participating: 2,594 / 9,915 (26.2%).0/11991 SIPP / fetch_401K9,915 households
pira dummyIRA participation (1=yes)1 if the household participates in an IRA, else 0.SIPP pension/saving record; covariate X.0/11991 SIPP / fetch_401K9,915 households
hown dummyHome ownership (1=owns)1 if the household owns its home, else 0.SIPP housing record; covariate X.0/11991 SIPP / fetch_401K9,915 households

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
nifamin 0 | median 1.64e+03 | max 1.43e+06100%9,9153,034013,9291,635.01,430,29854,905
net_tfamin -5.02e+05 | median 1.5e+03 | max 1.54e+06100%9,9155,168-502,30218,0521,499.01,536,79863,523
twmin -5.02e+05 | median 2.51e+04 | max 2.03e+06100%9,9157,233-502,30263,81725,1002,029,910111,530
agemin 25 | median 40 | max 64100%9,9154025.0041.0640.0064.0010.34
incmin -2.65e+03 | median 3.15e+04 | max 2.42e+05100%9,9157,334-2,652.037,20131,476242,12424,774
fsizemin 1 | median 3 | max 13100%9,915131.002.873.0013.001.54
educmin 1 | median 12 | max 18100%9,915181.0013.2112.0018.002.81
dbshare coded 1 = 0.271100%9,915200.27101.000.445
marrshare coded 1 = 0.605100%9,915200.6051.001.000.489
twoearnshare coded 1 = 0.381100%9,915200.38101.000.486
e401share coded 1 = 0.371100%9,915200.37101.000.483
p401share coded 1 = 0.262100%9,915200.26201.000.440
pirashare coded 1 = 0.242100%9,915200.24201.000.428
hownshare coded 1 = 0.635100%9,915200.6351.001.000.481

Known limitations & caveats