← Back to the post
Interactive data dictionary

Latent Group Structures in Panel Data

Source data for the Classifier-LASSO (C-LASSO) tutorial in Stata: a savings panel and a democracy-and-growth panel.

2
datasets
21
variables
56 / 98
countries
1970–2010
years

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

DatasetGrainRowsStataSource
savingcountry-year (balanced)840 × 7saving.dtasaving.dta
democracycountry-year (unbalanced)4,018 × 15democracy.dtademocracy.dta

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_panel_lasso_cluster/data/"
use "${BASE}saving.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_panel_lasso_cluster/data/"
df = pd.read_stata(BASE + "saving.dta")

# load every dataset at once
files = ["saving", "democracy"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "saving.dta", "saving.dta")
df, meta = pyreadstat.read_dta("saving.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_panel_lasso_cluster/data/"
df <- read_dta(paste0(BASE, "saving.dta"))

Overview & sources

Companion data for a hands-on Stata tutorial that uses the Classifier-LASSO (C-LASSO) method of Su, Shi & Phillips (2016), via the classifylasso command (Huang, Wang & Zhou 2024), to discover latent group structures in panel data — subsets of units that share slope coefficients while groups differ. Two source datasets drive the post. saving is a balanced panel of 56 countries observed over 15 years (840 observations) on savings behavior, from the Su–Shi–Phillips (2016) replication. democracy is the Acemoglu, Naidu, Restrepo & Robinson (2019) democracy-and-growth panel of 98 countries from 1970–2010 (4,018 observations). The tutorial shows that the pooled democracy–growth effect of +1.055 masks a +2.151 effect in 57 countries and a −0.936 effect in 41 countries — a sign reversal exemplifying Simpson's paradox.

Two files. saving is a strongly balanced country panel (one row per country × period, 56 countries × 15 periods = 840 rows; year is coded 1–15, corresponding to 1995–2010); all five economic variables are standardized to mean zero, standard deviation one. democracy is an unbalanced country panel (one row per country × year, 4,018 rows over 1970–2010) carrying log per-capita GDP, a binary democracy indicator, four lags of GDP, and demographic/trade/mortality covariates.

Data sources

SourceProvidesReference / URL
Su, Shi &amp; Phillips (2016)C-LASSO method and the savings panel (saving.dta) used for replicationSu, L., Shi, Z., & Phillips, P. C. B. (2016). Identifying latent structures in panel data. Econometrica, 84(6), 2215–2264. https://doi.org/10.3982/ECTA12560
Huang, Wang &amp; Zhou (2024)The classifylasso Stata command implementing C-LASSOHuang, W., Wang, Y., & Zhou, L. (2024). Identify latent group structures in panel data: The classifylasso command. Stata Journal, 24(1), 173–203. https://doi.org/10.1177/1536867X241233664
Acemoglu, Naidu, Restrepo &amp; Robinson (2019)The democracy-and-growth panel (democracy.dta)Acemoglu, D., Naidu, S., Restrepo, P., & Robinson, J. A. (2019). Democracy does cause growth. Journal of Political Economy, 127(1), 47–100. https://doi.org/10.1086/700936
Method referencesDynamic-panel bias correction and estimatorsDhaene & Jochmans (2015), half-panel jackknife; Nickell (1981), dynamic-panel bias; reghdfe (Correia) for cluster-robust two-way fixed effects.

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Identifying Latent Group Structures in Panel Data: The classifylasso Command in Stata [Data set]. https://carlos-mendez.org/post/stata_panel_lasso_cluster/

Su, L., Shi, Z., & Phillips, P. C. B. (2016). Identifying latent structures in panel data. Econometrica, 84(6), 2215–2264. https://doi.org/10.3982/ECTA12560 Acemoglu, D., Naidu, S., Restrepo, P., & Robinson, J. A. (2019). Democracy does cause growth. Journal of Political Economy, 127(1), 47–100. https://doi.org/10.1086/700936 Huang, W., Wang, Y., & Zhou, L. (2024). Identify latent group structures in panel data: The classifylasso command. Stata Journal, 24(1), 173–203. https://doi.org/10.1177/1536867X241233664

BibTeX

@misc{mendez2026statapanellassocluster,
  author       = {Mendez, Carlos},
  title        = {Identifying Latent Group Structures in Panel Data: The classifylasso Command in Stata},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/stata_panel_lasso_cluster/}},
  note         = {Data set}
}

@article{su2016identifying,
  author  = {Su, Liangjun and Shi, Zhentao and Phillips, Peter C. B.},
  title   = {Identifying Latent Structures in Panel Data},
  journal = {Econometrica},
  volume  = {84}, number = {6}, pages = {2215--2264}, year = {2016},
  doi     = {10.3982/ECTA12560}
}
@article{acemoglu2019democracy,
  author  = {Acemoglu, Daron and Naidu, Suresh and Restrepo, Pascual and Robinson, James A.},
  title   = {Democracy Does Cause Growth},
  journal = {Journal of Political Economy},
  volume  = {127}, number = {1}, pages = {47--100}, year = {2019},
  doi     = {10.1086/700936}
}
@article{huang2024classifylasso,
  author  = {Huang, Wenxin and Wang, Yuan and Zhou, Lin},
  title   = {Identify Latent Group Structures in Panel Data: The classifylasso Command},
  journal = {Stata Journal},
  volume  = {24}, number = {1}, pages = {173--203}, year = {2024},
  doi     = {10.1177/1536867X241233664}
}

Variable explorer search & filter all 21 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

VariableTypeDistributionLabelDefinitionUnitsIn filesSource
Democracy#dummyshare coded 1 = 0.545Democracy indicator (ANRR)Binary democracy measure of Acemoglu, Naidu, Restrepo & Robinson: 1 if democratic country-year, else 0.0/1democracyAcemoglu et al. (2019)
code#identifierCountry code (savings panel)Numeric country identifier in the savings panel (1-56).integer codesavingSu, Shi & Phillips (2016)
country#identifierNumeric country code (democracy panel)Generated numeric country code; the panel id for the democracy application.integer codedemocracyAcemoglu et al. (2019)
cpi#continuousmin -2.77 | median -0.208 | max 3.55CPI inflation (standardized)Standardized consumer-price-index inflation — the regressor whose sign reverses across groups.SD unitssavingSu, Shi & Phillips (2016)
gdp#continuousmin -3.55 | median 0.194 | max 2.46GDP growth (standardized)Standardized GDP growth rate; positive in both savings groups.SD unitssavingSu, Shi & Phillips (2016)
interest#continuousmin -3.6 | median 0.00637 | max 3.28Real interest rate (standardized)Standardized real interest rate; also exhibits a group sign reversal.SD unitssavingSu, Shi & Phillips (2016)
lagsavings#continuousmin -2.83 | median -0.0327 | max 2.92Lagged savings-to-GDP ratio (standardized)One-period lag of the standardized savings ratio; the dynamic-model persistence term.SD unitssavingSu, Shi & Phillips (2016)
lnMort#continuousmin 64.2 | median 370 | max 531Log child mortality x100Log of the child mortality rate, multiplied by 100.log x100democracyAcemoglu et al. (2019)
lnPGDP#continuousmin 406 | median 750 | max 1.09e+03Log per-capita GDP x100 (2000 USD)Log of GDP per capita in 2000 constant dollars, multiplied by 100 — the growth outcome.log USD x100democracyAcemoglu et al. (2019)
lnTrade#continuousmin 170 | median 409 | max 607Log trade x100 (% of GDP)Log of trade (exports + imports as % of GDP), multiplied by 100.log x100democracyAcemoglu et al. (2019)
lnpop#continuousmin 12.2 | median 16.1 | max 21Log populationLog of population.log personsdemocracyAcemoglu et al. (2019)
lnpop95#continuousmin 12.2 | median 16.1 | max 20.9Log population in 1995Log of population in 1995 (time-invariant per country).log personsdemocracyAcemoglu et al. (2019)
ly1#continuousmin 406 | median 749 | max 1.09e+03Lag 1 of log per-capita GDP x100First lag of lnPGDP — the dynamic-model persistence term.log USD x100democracyAcemoglu et al. (2019)
ly2#continuousmin 406 | median 748 | max 1.09e+03Lag 2 of log per-capita GDP x100Second lag of lnPGDP (used in the empirical.do lag-structure extensions).log USD x100democracyAcemoglu et al. (2019)
ly3#continuousmin 406 | median 746 | max 1.09e+03Lag 3 of log per-capita GDP x100Third lag of lnPGDP (used in the lag-structure extensions).log USD x100democracyAcemoglu et al. (2019)
ly4#continuousmin 406 | median 746 | max 1.09e+03Lag 4 of log per-capita GDP x100Fourth lag of lnPGDP (used in the lag-structure extensions).log USD x100democracyAcemoglu et al. (2019)
name#identifierCountry nameCountry name string for the democracy panel.stringdemocracyAcemoglu et al. (2019)
savings#continuousmin -2.5 | median -0.0297 | max 2.89Savings-to-GDP ratio (standardized)Standardized savings-to-GDP ratio — the outcome of the savings application.SD unitssavingSu, Shi & Phillips (2016)
transid#dummyshare coded 1 = 0.551Transition-country indicator1 if the country ever experiences a transition, else 0 (time-invariant per country).0/1democracyAcemoglu et al. (2019)
transition#dummyshare coded 1 = 0.025Transition-year indicator1 if the country-year is a democratic-transition year, else 0.0/1democracyAcemoglu et al. (2019)
year#yearTime periodTime index of the panel. In the savings panel coded 1-15 (= 1995-2010); in the democracy panel the calendar year.period / yearsaving, democracySource panels

Cross-file variable index

Which file each variable appears in (● = present).

Variablesavingdemocracy
Democracy
code
country
cpi
gdp
interest
lagsavings
lnMort
lnPGDP
lnTrade
lnpop
lnpop95
ly1
ly2
ly3
ly4
name
savings
transid
transition
year

Construction & formulas

C-LASSO minimizes a penalized least-squares objective that shrinks each unit's slope vector toward one of K group centers, then re-estimates each group by plain OLS (postlasso) for valid inference.

The savings panel is pre-standardized (each variable demeaned and scaled to unit SD), so its coefficients are read in standard-deviation units. The democracy panel's lnPGDP and the GDP lags are log per-capita GDP × 100.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

country-year (balanced)  840 × 7 · 1995-2010 (coded 1-15) · 56 countries (balanced); 840 obs

Panel key: code x year · C-LASSO of savings behavior; pooled/FE baseline and static + dynamic group-specific models.

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
code identifierCountry code (savings panel)Numeric country identifier in the savings panel (1-56).Anonymized integer code; no country-name mapping is provided in the file.integer codeSu, Shi & Phillips (2016)56 countries
year yearTime periodTime index of the panel. In the savings panel coded 1-15 (= 1995-2010); in the democracy panel the calendar year.Sequential period index (savings) / calendar year (democracy).period / yearSource panels
savings continuousSavings-to-GDP ratio (standardized)Standardized savings-to-GDP ratio — the outcome of the savings application.Savings/GDP standardized to mean 0, SD 1 across the panel.SD unitsSu, Shi & Phillips (2016)840 obs
lagsavings continuousLagged savings-to-GDP ratio (standardized)One-period lag of the standardized savings ratio; the dynamic-model persistence term.L.savings within country; standardized to mean 0, SD 1.SD unitsSu, Shi & Phillips (2016)840 obs
cpi continuousCPI inflation (standardized)Standardized consumer-price-index inflation — the regressor whose sign reverses across groups.CPI inflation standardized to mean 0, SD 1.SD unitsSu, Shi & Phillips (2016)840 obs
interest continuousReal interest rate (standardized)Standardized real interest rate; also exhibits a group sign reversal.Real interest rate standardized to mean 0, SD 1.SD unitsSu, Shi & Phillips (2016)840 obs
gdp continuousGDP growth (standardized)Standardized GDP growth rate; positive in both savings groups.GDP growth standardized to mean 0, SD 1.SD unitsSu, Shi & Phillips (2016)840 obs

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
code100%84056
year100%8401518.08154.32
savingsmin -2.5 | median -0.0297 | max 2.89100%840840-2.50-2.87e-08-0.0302.891.00
lagsavingsmin -2.83 | median -0.0327 | max 2.92100%840840-2.835.81e-08-0.0332.921.00
cpimin -2.77 | median -0.208 | max 3.55100%840840-2.773.56e-09-0.2083.551.00
interestmin -3.6 | median 0.00637 | max 3.28100%840840-3.60-7.17e-090.0063.281.00
gdpmin -3.55 | median 0.194 | max 2.46100%840840-3.551.06e-080.1942.461.00

country-year (unbalanced)  4,018 × 15 · 1970-2010 · 98 countries; 4,018 obs

Panel key: country x year · Replicate Acemoglu et al. (2019) pooled FE, then C-LASSO to reveal the +2.151 / -0.936 democracy-effect split.

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
name identifierCountry nameCountry name string for the democracy panel.Country identifier carried alongside the numeric code.stringAcemoglu et al. (2019)98 countries
country identifierNumeric country code (democracy panel)Generated numeric country code; the panel id for the democracy application.Generated integer code (embedded label: 'Generated numeric country code').integer codeAcemoglu et al. (2019)98 countries
year yearTime periodTime index of the panel. In the savings panel coded 1-15 (= 1995-2010); in the democracy panel the calendar year.Sequential period index (savings) / calendar year (democracy).period / yearSource panels
lnPGDP continuousLog per-capita GDP x100 (2000 USD)Log of GDP per capita in 2000 constant dollars, multiplied by 100 — the growth outcome.100 x log(real GDP per capita, 2000 constant USD).log USD x100Acemoglu et al. (2019)4,018 obs
Democracy dummyDemocracy indicator (ANRR)Binary democracy measure of Acemoglu, Naidu, Restrepo & Robinson: 1 if democratic country-year, else 0.ANRR dichotomous democracy classification.0/1Acemoglu et al. (2019)4,018 obs (55% = 1)
ly1 continuousLag 1 of log per-capita GDP x100First lag of lnPGDP — the dynamic-model persistence term.L1.lnPGDP within country.log USD x100Acemoglu et al. (2019)3,920 obs
ly2 continuousLag 2 of log per-capita GDP x100Second lag of lnPGDP (used in the empirical.do lag-structure extensions).L2.lnPGDP within country.log USD x100Acemoglu et al. (2019)3,822 obs
ly3 continuousLag 3 of log per-capita GDP x100Third lag of lnPGDP (used in the lag-structure extensions).L3.lnPGDP within country.log USD x100Acemoglu et al. (2019)3,724 obs
ly4 continuousLag 4 of log per-capita GDP x100Fourth lag of lnPGDP (used in the lag-structure extensions).L4.lnPGDP within country.log USD x100Acemoglu et al. (2019)3,626 obs
lnpop continuousLog populationLog of population.log(population).log personsAcemoglu et al. (2019)4,014 obs
lnpop95 continuousLog population in 1995Log of population in 1995 (time-invariant per country).log(population in 1995).log personsAcemoglu et al. (2019)95 countries
lnTrade continuousLog trade x100 (% of GDP)Log of trade (exports + imports as % of GDP), multiplied by 100.100 x log(trade share of GDP).log x100Acemoglu et al. (2019)3,845 obs
lnMort continuousLog child mortality x100Log of the child mortality rate, multiplied by 100.100 x log(child mortality rate).log x100Acemoglu et al. (2019)3,932 obs
transition dummyTransition-year indicator1 if the country-year is a democratic-transition year, else 0.ANRR transition-year flag.0/1Acemoglu et al. (2019)4,018 obs
transid dummyTransition-country indicator1 if the country ever experiences a transition, else 0 (time-invariant per country).ANRR transition-country flag.0/1Acemoglu et al. (2019)4,018 obs

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
name100%4,01898
country100%4,01898
year100%4,0184119701990.01990201011.83
lnPGDPmin 406 | median 750 | max 1.09e+03100%4,0184,018405.7758.6749.81,094.0162.9
Democracyshare coded 1 = 0.545100%4,018200.5451.001.000.498
ly1min 406 | median 749 | max 1.09e+0398%3,9203,920405.7757.8748.81,094.0162.7
ly2min 406 | median 748 | max 1.09e+0395%3,8223,822405.7757.0747.71,094.0162.4
ly3min 406 | median 746 | max 1.09e+0393%3,7243,724405.7756.2746.41,094.0162.1
ly4min 406 | median 746 | max 1.09e+0390%3,6263,626405.7755.4745.61,089.1161.7
lnpopmin 12.2 | median 16.1 | max 21100%4,0184,01412.2316.1416.0821.011.58
lnpop95min 12.2 | median 16.1 | max 20.997%3,8959512.1716.2316.1120.871.54
lnTrademin 170 | median 409 | max 60796%3,8453,845169.8408.6409.3607.259.55
lnMortmin 64.2 | median 370 | max 53198%3,9321,33164.19346.5370.1531.3111.2
transitionshare coded 1 = 0.025100%4,018200.02501.000.155
transidshare coded 1 = 0.551100%4,018200.5511.001.000.497

Known limitations & caveats