← Back to the post
Interactive data dictionary

Causal Inference with DoWhy: Working From Home & Productivity

A fully synthetic 5,000-employee observational dataset with a known true treatment effect (ATE = 1.0).

1
dataset
5
variables
5,000
employees
ATE = 1.0
true effect

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

DatasetGrainRowsStataSource
wfh_simulated_dataemployee (cross-section)5,000 × 5wfh_simulated_data.dtawfh_simulated_data.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dowhy_intro/data/"
use "${BASE}wfh_simulated_data.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dowhy_intro/data/"
df = pd.read_stata(BASE + "wfh_simulated_data.dta")

# load every dataset at once
files = ["wfh_simulated_data"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "wfh_simulated_data.dta", "wfh_simulated_data.dta")
df, meta = pyreadstat.read_dta("wfh_simulated_data.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/python_dowhy_intro/data/"
df <- read_dta(paste0(BASE, "wfh_simulated_data.dta"))

Overview & sources

Companion data for a beginner-friendly causal-inference tutorial that uses DoWhy's four-step framework (Model, Identify, Estimate, Refute) to recover a known average treatment effect from observational data. The dataset is fully synthetic: 5,000 employees with a binary work-from-home treatment, a continuous productivity outcome, two confounders (introversion, number of children) that drive self-selection into treatment, and a subway-disruption instrument that satisfies the exclusion restriction by construction. Because the true causal effect is fixed at 1.0 productivity points, every estimator can be scored against the truth: the naive difference in means overshoots to 1.39, while the backdoor methods (regression, IPW, doubly robust) recover ~1.0 and IV (2SLS) returns 0.89 with a much wider interval. The entire data-generating process is open and reproducible.

One file. wfh_simulated_data is a cross-section — one row per simulated employee (5,000 rows), with no time or panel dimension. There is no identifier column; rows are exchangeable draws from the data-generating process.

Data sources

SourceProvidesReference / URL
Synthetic (this study)All values — simulated via a calibrated data-generating process (open &amp; reproducible)Mendez, C. (2026). See the post's Python script script.py (function generate_wfh_data) for the full DGP.
DoWhyEnd-to-end causal-inference library implementing the Model/Identify/Estimate/Refute workflowSharma, A., & Kiciman, E. (2020). DoWhy: An end-to-end library for causal inference. arXiv:2011.04216. https://www.pywhy.org/dowhy/
Method referencesEstimators and identification concepts (backdoor/IV, propensity score, doubly robust)Pearl (2009); Rosenbaum & Rubin (1983); Robins, Rotnitzky & Zhao (1994); Angrist & Pischke (2009).
Tutorial inspirationWorked DoWhy walkthrough this beginner&#x27;s guide adaptsDataCamp. Introduction to Causal AI using the DoWhy library in Python. https://www.datacamp.com/tutorial/intro-to-causal-ai-using-the-dowhy-library-in-python

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). A Beginner's Guide to Causal Inference with DoWhy in Python [Data set]. https://carlos-mendez.org/post/python_dowhy_intro/

Sharma, A., & Kiciman, E. (2020). DoWhy: An end-to-end library for causal inference. arXiv preprint arXiv:2011.04216. Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.

BibTeX

@misc{mendez2026pythondowhyintro,
  author       = {Mendez, Carlos},
  title        = {A Beginner's Guide to Causal Inference with DoWhy in Python},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/python_dowhy_intro/}},
  note         = {Data set}
}

@article{sharma2020dowhy,
  author  = {Sharma, Amit and Kiciman, Emre},
  title   = {DoWhy: An End-to-End Library for Causal Inference},
  journal = {arXiv preprint arXiv:2011.04216},
  year    = {2020}
}
@book{pearl2009causality,
  author    = {Pearl, Judea},
  title     = {Causality: Models, Reasoning, and Inference},
  edition   = {2nd},
  publisher = {Cambridge University Press},
  year      = {2009}
}

Variable explorer search & filter all 5 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

VariableTypeDistributionLabelDefinitionUnitsIn filesSource
introversion#continuousmin -0.473 | median 4.99 | max 10.2Introversion (confounder)Personality trait; higher values = more introverted. A confounder of WFH and productivity.scale (~0-10)wfh_simulated_dataSimulation
num_children#continuousmin 0 | median 1 | max 8Number of children (confounder)Count of children in the household. A confounder of WFH and productivity.countwfh_simulated_dataSimulation
productivity#continuousmin 43.9 | median 53.9 | max 62.5Productivity score (outcome)Continuous outcome: employee productivity, in points.pointswfh_simulated_dataSimulation
subway_disruption#dummyshare coded 1 = 0.419Subway line disrupted (instrument, 1=yes)Binary instrument: 1 if the employee lives near a disrupted subway line. Affects treatment only (exclusion restriction).0/1wfh_simulated_dataSimulation
work_from_home#dummyshare coded 1 = 0.662Works from home (treatment, 1=yes)Binary treatment: 1 if the employee works from home, 0 if office-based.0/1wfh_simulated_dataSimulation

Cross-file variable index

Which file each variable appears in (● = present).

Construction & formulas

The dataset is generated by a known structural model so that every estimator can be checked against the truth. For employee i the causal graph is subway_disruption → work_from_home → productivity, with introversion and num_children pointing into both treatment and outcome (confounders).

The coefficient on work_from_home in the outcome equation is the true average treatment effect (ATE = 1.0) — the target every estimator in the post is judged against. Confounder bias arises because introverts both self-select into WFH (treatment coefficient 0.3) and are independently more productive (outcome coefficient 0.8), so a naive WFH−office comparison overstates the effect (1.39 vs the true 1.0).

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

employee (cross-section)  5,000 × 5 · n/a (no time dimension) · 5,000 simulated employees

Panel key: none (rows are exchangeable; no identifier column) · Recover a known ATE (=1.0) of working from home on productivity via DoWhy's four-step framework; compare naive, regression, IPW, doubly-robust, and IV estimators.

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
work_from_home dummyWorks from home (treatment, 1=yes)Binary treatment: 1 if the employee works from home, 0 if office-based.Bernoulli draw with P from logit(-1.5 + 0.3*introversion + 0.2*num_children + 1.0*subway_disruption); non-random (observational) assignment.0/1Simulation5,000 employees
productivity continuousProductivity score (outcome)Continuous outcome: employee productivity, in points.50 + 1.0*work_from_home + 0.8*introversion - 0.5*num_children + Normal(0,2) noise; the coefficient 1.0 on work_from_home is the true ATE.pointsSimulation5,000 employees
introversion continuousIntroversion (confounder)Personality trait; higher values = more introverted. A confounder of WFH and productivity.Normal(mean=5, sd=1.5) draw. Raises both the probability of WFH (logit coef 0.3) and productivity (outcome coef 0.8).scale (~0-10)Simulation5,000 employees
num_children continuousNumber of children (confounder)Count of children in the household. A confounder of WFH and productivity.Poisson(lambda=1.5) draw. Raises the probability of WFH (logit coef 0.2) but lowers productivity (outcome coef -0.5).countSimulation5,000 employees
subway_disruption dummySubway line disrupted (instrument, 1=yes)Binary instrument: 1 if the employee lives near a disrupted subway line. Affects treatment only (exclusion restriction).Bernoulli(0.4) draw. Enters the treatment logit (coef 1.0) but NOT the outcome equation, so the exclusion restriction holds by construction.0/1Simulation5,000 employees

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
work_from_homeshare coded 1 = 0.662100%5,000200.6621.001.000.473
productivitymin 43.9 | median 53.9 | max 62.5100%5,0005,00043.9053.8853.8762.522.49
introversionmin -0.473 | median 4.99 | max 10.2100%5,0005,000-0.4734.974.9910.181.50
num_childrenmin 0 | median 1 | max 8100%5,000901.501.008.001.22
subway_disruptionshare coded 1 = 0.419100%5,000200.41901.000.493

Known limitations & caveats