Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
wfh_simulated_data | employee (cross-section) | 5,000 × 5 | wfh_simulated_data.dta | wfh_simulated_data.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dowhy_intro/data/"
use "${BASE}wfh_simulated_data.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dowhy_intro/data/"
df = pd.read_stata(BASE + "wfh_simulated_data.dta")
# load every dataset at once
files = ["wfh_simulated_data"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "wfh_simulated_data.dta", "wfh_simulated_data.dta")
df, meta = pyreadstat.read_dta("wfh_simulated_data.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dowhy_intro/data/"
df <- read_dta(paste0(BASE, "wfh_simulated_data.dta"))Overview & sources
Companion data for a beginner-friendly causal-inference tutorial that uses DoWhy's four-step framework (Model, Identify, Estimate, Refute) to recover a known average treatment effect from observational data. The dataset is fully synthetic: 5,000 employees with a binary work-from-home treatment, a continuous productivity outcome, two confounders (introversion, number of children) that drive self-selection into treatment, and a subway-disruption instrument that satisfies the exclusion restriction by construction. Because the true causal effect is fixed at 1.0 productivity points, every estimator can be scored against the truth: the naive difference in means overshoots to 1.39, while the backdoor methods (regression, IPW, doubly robust) recover ~1.0 and IV (2SLS) returns 0.89 with a much wider interval. The entire data-generating process is open and reproducible.
wfh_simulated_data is a cross-section — one row per simulated employee (5,000 rows), with no time or panel dimension. There is no identifier column; rows are exchangeable draws from the data-generating process.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Synthetic (this study) | All values — simulated via a calibrated data-generating process (open & reproducible) | Mendez, C. (2026). See the post's Python script script.py (function generate_wfh_data) for the full DGP. |
| DoWhy | End-to-end causal-inference library implementing the Model/Identify/Estimate/Refute workflow | Sharma, A., & Kiciman, E. (2020). DoWhy: An end-to-end library for causal inference. arXiv:2011.04216. https://www.pywhy.org/dowhy/ |
| Method references | Estimators and identification concepts (backdoor/IV, propensity score, doubly robust) | Pearl (2009); Rosenbaum & Rubin (1983); Robins, Rotnitzky & Zhao (1994); Angrist & Pischke (2009). |
| Tutorial inspiration | Worked DoWhy walkthrough this beginner's guide adapts | DataCamp. Introduction to Causal AI using the DoWhy library in Python. https://www.datacamp.com/tutorial/intro-to-causal-ai-using-the-dowhy-library-in-python |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). A Beginner's Guide to Causal Inference with DoWhy in Python [Data set]. https://carlos-mendez.org/post/python_dowhy_intro/
Sharma, A., & Kiciman, E. (2020). DoWhy: An end-to-end library for causal inference. arXiv preprint arXiv:2011.04216. Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.BibTeX
@misc{mendez2026pythondowhyintro,
author = {Mendez, Carlos},
title = {A Beginner's Guide to Causal Inference with DoWhy in Python},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/python_dowhy_intro/}},
note = {Data set}
}
@article{sharma2020dowhy,
author = {Sharma, Amit and Kiciman, Emre},
title = {DoWhy: An End-to-End Library for Causal Inference},
journal = {arXiv preprint arXiv:2011.04216},
year = {2020}
}
@book{pearl2009causality,
author = {Pearl, Judea},
title = {Causality: Models, Reasoning, and Inference},
edition = {2nd},
publisher = {Cambridge University Press},
year = {2009}
}Variable explorer search & filter all 5 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
introversion# | continuous | Introversion (confounder) | Personality trait; higher values = more introverted. A confounder of WFH and productivity. | scale (~0-10) | wfh_simulated_data | Simulation | |
num_children# | continuous | Number of children (confounder) | Count of children in the household. A confounder of WFH and productivity. | count | wfh_simulated_data | Simulation | |
productivity# | continuous | Productivity score (outcome) | Continuous outcome: employee productivity, in points. | points | wfh_simulated_data | Simulation | |
subway_disruption# | dummy | Subway line disrupted (instrument, 1=yes) | Binary instrument: 1 if the employee lives near a disrupted subway line. Affects treatment only (exclusion restriction). | 0/1 | wfh_simulated_data | Simulation | |
work_from_home# | dummy | Works from home (treatment, 1=yes) | Binary treatment: 1 if the employee works from home, 0 if office-based. | 0/1 | wfh_simulated_data | Simulation |
Cross-file variable index
Which file each variable appears in (● = present).
| Variable | wfh_simulated_data |
|---|---|
introversion | ● |
num_children | ● |
productivity | ● |
subway_disruption | ● |
work_from_home | ● |
Construction & formulas
The dataset is generated by a known structural model so that every estimator can be checked
against the truth. For employee i the causal graph is
subway_disruption → work_from_home → productivity, with
introversion and num_children pointing into both
treatment and outcome (confounders).
- Confounders.
introversion ~ Normal(5, 1.5);num_children ~ Poisson(1.5). - Instrument.
subway_disruption ~ Bernoulli(0.4)— affects treatment only. - Treatment assignment (observational, non-random):
logit P(WFH) = −1.5 + 0.3·introversion + 0.2·num_children + 1.0·subway_disruption;work_from_home ~ Bernoulli(P(WFH)). - Outcome.
productivity = 50 + 1.0·work_from_home + 0.8·introversion − 0.5·num_children + ε, withε ~ Normal(0, 2). The instrument does not enter this equation — the exclusion restriction holds by construction.
The coefficient on work_from_home in the outcome equation is the
true average treatment effect (ATE = 1.0) — the target every estimator in
the post is judged against. Confounder bias arises because introverts both self-select into WFH
(treatment coefficient 0.3) and are independently more productive (outcome coefficient 0.8), so a
naive WFH−office comparison overstates the effect (1.39 vs the true 1.0).
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
work_from_home dummy | Works from home (treatment, 1=yes) | Binary treatment: 1 if the employee works from home, 0 if office-based. | Bernoulli draw with P from logit(-1.5 + 0.3*introversion + 0.2*num_children + 1.0*subway_disruption); non-random (observational) assignment. | 0/1 | Simulation | 5,000 employees |
productivity continuous | Productivity score (outcome) | Continuous outcome: employee productivity, in points. | 50 + 1.0*work_from_home + 0.8*introversion - 0.5*num_children + Normal(0,2) noise; the coefficient 1.0 on work_from_home is the true ATE. | points | Simulation | 5,000 employees |
introversion continuous | Introversion (confounder) | Personality trait; higher values = more introverted. A confounder of WFH and productivity. | Normal(mean=5, sd=1.5) draw. Raises both the probability of WFH (logit coef 0.3) and productivity (outcome coef 0.8). | scale (~0-10) | Simulation | 5,000 employees |
num_children continuous | Number of children (confounder) | Count of children in the household. A confounder of WFH and productivity. | Poisson(lambda=1.5) draw. Raises the probability of WFH (logit coef 0.2) but lowers productivity (outcome coef -0.5). | count | Simulation | 5,000 employees |
subway_disruption dummy | Subway line disrupted (instrument, 1=yes) | Binary instrument: 1 if the employee lives near a disrupted subway line. Affects treatment only (exclusion restriction). | Bernoulli(0.4) draw. Enters the treatment logit (coef 1.0) but NOT the outcome equation, so the exclusion restriction holds by construction. | 0/1 | Simulation | 5,000 employees |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
work_from_home | 100% | 5,000 | 2 | 0 | 0.662 | 1.00 | 1.00 | 0.473 | |
productivity | 100% | 5,000 | 5,000 | 43.90 | 53.88 | 53.87 | 62.52 | 2.49 | |
introversion | 100% | 5,000 | 5,000 | -0.473 | 4.97 | 4.99 | 10.18 | 1.50 | |
num_children | 100% | 5,000 | 9 | 0 | 1.50 | 1.00 | 8.00 | 1.22 | |
subway_disruption | 100% | 5,000 | 2 | 0 | 0.419 | 0 | 1.00 | 0.493 |
Known limitations & caveats
- Synthetic data. There is no real survey behind this dataset; values are drawn from the data-generating process above. Results are internally consistent with the simulation but are not empirical evidence about real-world work-from-home effects.
- Known truth by design. The true ATE is fixed at 1.0 so methods can be scored. On real data the truth is unknown — that is exactly why identification (DoWhy's Step 2) and refutation (Step 4) matter.
- Cross-section, not a panel. Each row is one simulated employee at a single point; there is no time dimension, no repeated measurement, and no row identifier.
- Exclusion restriction is assumed true here.
subway_disruptionis built to affect productivity only through WFH choice. In practice the exclusion restriction cannot be tested from data and must be justified by domain knowledge. - Selection on observables is satisfiable here. The two confounders are fully observed, so backdoor methods are unbiased by construction; with a missing confounder they would be biased while IV would remain valid.