Data dictionary · Causal Inference with DoWhy: Working From Home & Productivity

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

Dataset	Grain	Rows	Stata	Source
`wfh_simulated_data`	employee (cross-section)	5,000 × 5	wfh_simulated_data.dta	wfh_simulated_data.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dowhy_intro/data/"
use "${BASE}wfh_simulated_data.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dowhy_intro/data/"
df = pd.read_stata(BASE + "wfh_simulated_data.dta")

# load every dataset at once
files = ["wfh_simulated_data"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "wfh_simulated_data.dta", "wfh_simulated_data.dta")
df, meta = pyreadstat.read_dta("wfh_simulated_data.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dowhy_intro/data/"
df <- read_dta(paste0(BASE, "wfh_simulated_data.dta"))

Overview & sources

Companion data for a beginner-friendly causal-inference tutorial that uses DoWhy's four-step framework (Model, Identify, Estimate, Refute) to recover a known average treatment effect from observational data. The dataset is fully synthetic: 5,000 employees with a binary work-from-home treatment, a continuous productivity outcome, two confounders (introversion, number of children) that drive self-selection into treatment, and a subway-disruption instrument that satisfies the exclusion restriction by construction. Because the true causal effect is fixed at 1.0 productivity points, every estimator can be scored against the truth: the naive difference in means overshoots to 1.39, while the backdoor methods (regression, IPW, doubly robust) recover ~1.0 and IV (2SLS) returns 0.89 with a much wider interval. The entire data-generating process is open and reproducible.

One file. wfh_simulated_data is a cross-section — one row per simulated employee (5,000 rows), with no time or panel dimension. There is no identifier column; rows are exchangeable draws from the data-generating process.

Data sources

Source	Provides	Reference / URL
Synthetic (this study)	All values — simulated via a calibrated data-generating process (open & reproducible)	Mendez, C. (2026). See the post's Python script script.py (function generate_wfh_data) for the full DGP.
DoWhy	End-to-end causal-inference library implementing the Model/Identify/Estimate/Refute workflow	Sharma, A., & Kiciman, E. (2020). DoWhy: An end-to-end library for causal inference. arXiv:2011.04216. https://www.pywhy.org/dowhy/
Method references	Estimators and identification concepts (backdoor/IV, propensity score, doubly robust)	Pearl (2009); Rosenbaum & Rubin (1983); Robins, Rotnitzky & Zhao (1994); Angrist & Pischke (2009).
Tutorial inspiration	Worked DoWhy walkthrough this beginner's guide adapts	DataCamp. Introduction to Causal AI using the DoWhy library in Python. https://www.datacamp.com/tutorial/intro-to-causal-ai-using-the-dowhy-library-in-python

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). A Beginner's Guide to Causal Inference with DoWhy in Python [Data set]. https://carlos-mendez.org/post/python_dowhy_intro/

Sharma, A., & Kiciman, E. (2020). DoWhy: An end-to-end library for causal inference. arXiv preprint arXiv:2011.04216. Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.

BibTeX

@misc{mendez2026pythondowhyintro,
  author       = {Mendez, Carlos},
  title        = {A Beginner's Guide to Causal Inference with DoWhy in Python},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/python_dowhy_intro/}},
  note         = {Data set}
}

@article{sharma2020dowhy,
  author  = {Sharma, Amit and Kiciman, Emre},
  title   = {DoWhy: An End-to-End Library for Causal Inference},
  journal = {arXiv preprint arXiv:2011.04216},
  year    = {2020}
}
@book{pearl2009causality,
  author    = {Pearl, Judea},
  title     = {Causality: Models, Reasoning, and Inference},
  edition   = {2nd},
  publisher = {Cambridge University Press},
  year      = {2009}
}

Variable explorer search & filter all 5 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

Variable	Type	Label	Definition	Units	In files	Source
`introversion`#	continuous	Introversion (confounder)	Personality trait; higher values = more introverted. A confounder of WFH and productivity.	scale (~0-10)	wfh_simulated_data	Simulation
`num_children`#	continuous	Number of children (confounder)	Count of children in the household. A confounder of WFH and productivity.	count	wfh_simulated_data	Simulation
`productivity`#	continuous	Productivity score (outcome)	Continuous outcome: employee productivity, in points.	points	wfh_simulated_data	Simulation
`subway_disruption`#	dummy	Subway line disrupted (instrument, 1=yes)	Binary instrument: 1 if the employee lives near a disrupted subway line. Affects treatment only (exclusion restriction).	0/1	wfh_simulated_data	Simulation
`work_from_home`#	dummy	Works from home (treatment, 1=yes)	Binary treatment: 1 if the employee works from home, 0 if office-based.	0/1	wfh_simulated_data	Simulation

Cross-file variable index

Which file each variable appears in (● = present).

Variable	wfh_simulated_data
`introversion`	●
`num_children`	●
`productivity`	●
`subway_disruption`	●
`work_from_home`	●

Construction & formulas

The dataset is generated by a known structural model so that every estimator can be checked against the truth. For employee i the causal graph is subway_disruption → work_from_home → productivity, with introversion and num_children pointing into both treatment and outcome (confounders).

Confounders. introversion ~ Normal(5, 1.5); num_children ~ Poisson(1.5).
Instrument. subway_disruption ~ Bernoulli(0.4) — affects treatment only.
Treatment assignment (observational, non-random): logit P(WFH) = −1.5 + 0.3·introversion + 0.2·num_children + 1.0·subway_disruption; work_from_home ~ Bernoulli(P(WFH)).
Outcome. productivity = 50 + 1.0·work_from_home + 0.8·introversion − 0.5·num_children + ε, with ε ~ Normal(0, 2). The instrument does not enter this equation — the exclusion restriction holds by construction.

The coefficient on work_from_home in the outcome equation is the true average treatment effect (ATE = 1.0) — the target every estimator in the post is judged against. Confounder bias arises because introverts both self-select into WFH (treatment coefficient 0.3) and are independently more productive (outcome coefficient 0.8), so a naive WFH−office comparison overstates the effect (1.39 vs the true 1.0).

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

employee (cross-section) 5,000 × 5 · n/a (no time dimension) · 5,000 simulated employees

Panel key: none (rows are exchangeable; no identifier column) · Recover a known ATE (=1.0) of working from home on productivity via DoWhy's four-step framework; compare naive, regression, IPW, doubly-robust, and IV estimators.

Variable dictionary

Variable	Label	Definition	Construction	Units	Source	Coverage
`work_from_home` dummy	Works from home (treatment, 1=yes)	Binary treatment: 1 if the employee works from home, 0 if office-based.	Bernoulli draw with P from logit(-1.5 + 0.3introversion + 0.2num_children + 1.0*subway_disruption); non-random (observational) assignment.	0/1	Simulation	5,000 employees
`productivity` continuous	Productivity score (outcome)	Continuous outcome: employee productivity, in points.	50 + 1.0work_from_home + 0.8introversion - 0.5*num_children + Normal(0,2) noise; the coefficient 1.0 on work_from_home is the true ATE.	points	Simulation	5,000 employees
`introversion` continuous	Introversion (confounder)	Personality trait; higher values = more introverted. A confounder of WFH and productivity.	Normal(mean=5, sd=1.5) draw. Raises both the probability of WFH (logit coef 0.3) and productivity (outcome coef 0.8).	scale (~0-10)	Simulation	5,000 employees
`num_children` continuous	Number of children (confounder)	Count of children in the household. A confounder of WFH and productivity.	Poisson(lambda=1.5) draw. Raises the probability of WFH (logit coef 0.2) but lowers productivity (outcome coef -0.5).	count	Simulation	5,000 employees
`subway_disruption` dummy	Subway line disrupted (instrument, 1=yes)	Binary instrument: 1 if the employee lives near a disrupted subway line. Affects treatment only (exclusion restriction).	Bernoulli(0.4) draw. Enters the treatment logit (coef 1.0) but NOT the outcome equation, so the exclusion restriction holds by construction.	0/1	Simulation	5,000 employees

Distribution & statistics (click a header to sort)

Variable	Coverage	N	Distinct	Min	Mean	Median	Max	SD
`work_from_home`	100%	5,000	2	0	0.662	1.00	1.00	0.473
`productivity`	100%	5,000	5,000	43.90	53.88	53.87	62.52	2.49
`introversion`	100%	5,000	5,000	-0.473	4.97	4.99	10.18	1.50
`num_children`	100%	5,000	9	0	1.50	1.00	8.00	1.22
`subway_disruption`	100%	5,000	2	0	0.419	0	1.00	0.493

Known limitations & caveats

Synthetic data. There is no real survey behind this dataset; values are drawn from the data-generating process above. Results are internally consistent with the simulation but are not empirical evidence about real-world work-from-home effects.
Known truth by design. The true ATE is fixed at 1.0 so methods can be scored. On real data the truth is unknown — that is exactly why identification (DoWhy's Step 2) and refutation (Step 4) matter.
Cross-section, not a panel. Each row is one simulated employee at a single point; there is no time dimension, no repeated measurement, and no row identifier.
Exclusion restriction is assumed true here. subway_disruption is built to affect productivity only through WFH choice. In practice the exclusion restriction cannot be tested from data and must be justified by domain knowledge.
Selection on observables is satisfiable here. The two confounders are fully observed, so backdoor methods are unbiased by construction; with a missing confounder they would be biased while IV would remain valid.