Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
saving | country-year (balanced) | 840 × 7 | saving.dta | saving.dta |
democracy | country-year (unbalanced) | 4,018 × 15 | democracy.dta | democracy.dta |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_panel_lasso_cluster/data/"
use "${BASE}saving.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_panel_lasso_cluster/data/"
df = pd.read_stata(BASE + "saving.dta")
# load every dataset at once
files = ["saving", "democracy"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "saving.dta", "saving.dta")
df, meta = pyreadstat.read_dta("saving.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/stata_panel_lasso_cluster/data/"
df <- read_dta(paste0(BASE, "saving.dta"))Overview & sources
Companion data for a hands-on Stata tutorial that uses the Classifier-LASSO (C-LASSO) method of Su, Shi & Phillips (2016), via the classifylasso command (Huang, Wang & Zhou 2024), to discover latent group structures in panel data — subsets of units that share slope coefficients while groups differ. Two source datasets drive the post. saving is a balanced panel of 56 countries observed over 15 years (840 observations) on savings behavior, from the Su–Shi–Phillips (2016) replication. democracy is the Acemoglu, Naidu, Restrepo & Robinson (2019) democracy-and-growth panel of 98 countries from 1970–2010 (4,018 observations). The tutorial shows that the pooled democracy–growth effect of +1.055 masks a +2.151 effect in 57 countries and a −0.936 effect in 41 countries — a sign reversal exemplifying Simpson's paradox.
saving is a strongly balanced country panel (one row per country × period, 56 countries × 15 periods = 840 rows; year is coded 1–15, corresponding to 1995–2010); all five economic variables are standardized to mean zero, standard deviation one. democracy is an unbalanced country panel (one row per country × year, 4,018 rows over 1970–2010) carrying log per-capita GDP, a binary democracy indicator, four lags of GDP, and demographic/trade/mortality covariates.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Su, Shi & Phillips (2016) | C-LASSO method and the savings panel (saving.dta) used for replication | Su, L., Shi, Z., & Phillips, P. C. B. (2016). Identifying latent structures in panel data. Econometrica, 84(6), 2215–2264. https://doi.org/10.3982/ECTA12560 |
| Huang, Wang & Zhou (2024) | The classifylasso Stata command implementing C-LASSO | Huang, W., Wang, Y., & Zhou, L. (2024). Identify latent group structures in panel data: The classifylasso command. Stata Journal, 24(1), 173–203. https://doi.org/10.1177/1536867X241233664 |
| Acemoglu, Naidu, Restrepo & Robinson (2019) | The democracy-and-growth panel (democracy.dta) | Acemoglu, D., Naidu, S., Restrepo, P., & Robinson, J. A. (2019). Democracy does cause growth. Journal of Political Economy, 127(1), 47–100. https://doi.org/10.1086/700936 |
| Method references | Dynamic-panel bias correction and estimators | Dhaene & Jochmans (2015), half-panel jackknife; Nickell (1981), dynamic-panel bias; reghdfe (Correia) for cluster-robust two-way fixed effects. |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Identifying Latent Group Structures in Panel Data: The classifylasso Command in Stata [Data set]. https://carlos-mendez.org/post/stata_panel_lasso_cluster/
Su, L., Shi, Z., & Phillips, P. C. B. (2016). Identifying latent structures in panel data. Econometrica, 84(6), 2215–2264. https://doi.org/10.3982/ECTA12560 Acemoglu, D., Naidu, S., Restrepo, P., & Robinson, J. A. (2019). Democracy does cause growth. Journal of Political Economy, 127(1), 47–100. https://doi.org/10.1086/700936 Huang, W., Wang, Y., & Zhou, L. (2024). Identify latent group structures in panel data: The classifylasso command. Stata Journal, 24(1), 173–203. https://doi.org/10.1177/1536867X241233664BibTeX
@misc{mendez2026statapanellassocluster,
author = {Mendez, Carlos},
title = {Identifying Latent Group Structures in Panel Data: The classifylasso Command in Stata},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/stata_panel_lasso_cluster/}},
note = {Data set}
}
@article{su2016identifying,
author = {Su, Liangjun and Shi, Zhentao and Phillips, Peter C. B.},
title = {Identifying Latent Structures in Panel Data},
journal = {Econometrica},
volume = {84}, number = {6}, pages = {2215--2264}, year = {2016},
doi = {10.3982/ECTA12560}
}
@article{acemoglu2019democracy,
author = {Acemoglu, Daron and Naidu, Suresh and Restrepo, Pascual and Robinson, James A.},
title = {Democracy Does Cause Growth},
journal = {Journal of Political Economy},
volume = {127}, number = {1}, pages = {47--100}, year = {2019},
doi = {10.1086/700936}
}
@article{huang2024classifylasso,
author = {Huang, Wenxin and Wang, Yuan and Zhou, Lin},
title = {Identify Latent Group Structures in Panel Data: The classifylasso Command},
journal = {Stata Journal},
volume = {24}, number = {1}, pages = {173--203}, year = {2024},
doi = {10.1177/1536867X241233664}
}Variable explorer search & filter all 21 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
Democracy# | dummy | Democracy indicator (ANRR) | Binary democracy measure of Acemoglu, Naidu, Restrepo & Robinson: 1 if democratic country-year, else 0. | 0/1 | democracy | Acemoglu et al. (2019) | |
code# | identifier | – | Country code (savings panel) | Numeric country identifier in the savings panel (1-56). | integer code | saving | Su, Shi & Phillips (2016) |
country# | identifier | – | Numeric country code (democracy panel) | Generated numeric country code; the panel id for the democracy application. | integer code | democracy | Acemoglu et al. (2019) |
cpi# | continuous | CPI inflation (standardized) | Standardized consumer-price-index inflation — the regressor whose sign reverses across groups. | SD units | saving | Su, Shi & Phillips (2016) | |
gdp# | continuous | GDP growth (standardized) | Standardized GDP growth rate; positive in both savings groups. | SD units | saving | Su, Shi & Phillips (2016) | |
interest# | continuous | Real interest rate (standardized) | Standardized real interest rate; also exhibits a group sign reversal. | SD units | saving | Su, Shi & Phillips (2016) | |
lagsavings# | continuous | Lagged savings-to-GDP ratio (standardized) | One-period lag of the standardized savings ratio; the dynamic-model persistence term. | SD units | saving | Su, Shi & Phillips (2016) | |
lnMort# | continuous | Log child mortality x100 | Log of the child mortality rate, multiplied by 100. | log x100 | democracy | Acemoglu et al. (2019) | |
lnPGDP# | continuous | Log per-capita GDP x100 (2000 USD) | Log of GDP per capita in 2000 constant dollars, multiplied by 100 — the growth outcome. | log USD x100 | democracy | Acemoglu et al. (2019) | |
lnTrade# | continuous | Log trade x100 (% of GDP) | Log of trade (exports + imports as % of GDP), multiplied by 100. | log x100 | democracy | Acemoglu et al. (2019) | |
lnpop# | continuous | Log population | Log of population. | log persons | democracy | Acemoglu et al. (2019) | |
lnpop95# | continuous | Log population in 1995 | Log of population in 1995 (time-invariant per country). | log persons | democracy | Acemoglu et al. (2019) | |
ly1# | continuous | Lag 1 of log per-capita GDP x100 | First lag of lnPGDP — the dynamic-model persistence term. | log USD x100 | democracy | Acemoglu et al. (2019) | |
ly2# | continuous | Lag 2 of log per-capita GDP x100 | Second lag of lnPGDP (used in the empirical.do lag-structure extensions). | log USD x100 | democracy | Acemoglu et al. (2019) | |
ly3# | continuous | Lag 3 of log per-capita GDP x100 | Third lag of lnPGDP (used in the lag-structure extensions). | log USD x100 | democracy | Acemoglu et al. (2019) | |
ly4# | continuous | Lag 4 of log per-capita GDP x100 | Fourth lag of lnPGDP (used in the lag-structure extensions). | log USD x100 | democracy | Acemoglu et al. (2019) | |
name# | identifier | – | Country name | Country name string for the democracy panel. | string | democracy | Acemoglu et al. (2019) |
savings# | continuous | Savings-to-GDP ratio (standardized) | Standardized savings-to-GDP ratio — the outcome of the savings application. | SD units | saving | Su, Shi & Phillips (2016) | |
transid# | dummy | Transition-country indicator | 1 if the country ever experiences a transition, else 0 (time-invariant per country). | 0/1 | democracy | Acemoglu et al. (2019) | |
transition# | dummy | Transition-year indicator | 1 if the country-year is a democratic-transition year, else 0. | 0/1 | democracy | Acemoglu et al. (2019) | |
year# | year | – | Time period | Time index of the panel. In the savings panel coded 1-15 (= 1995-2010); in the democracy panel the calendar year. | period / year | saving, democracy | Source panels |
Cross-file variable index
Which file each variable appears in (● = present).
Construction & formulas
C-LASSO minimizes a penalized least-squares objective that shrinks each unit's
slope vector toward one of K group centers, then re-estimates each group by plain OLS
(postlasso) for valid inference.
- Objective:
Q = (1/NT) Σ_i Σ_t (y_it − β_i'x_it)² + (λ/N) Σ_i Π_k ‖β_i − α_k‖— SSR plus a product-form group penalty (the LASSO penaltyλ_NT = c_λ·T^(−1/3)). - Group membership:
β_i = α_kifi ∈ G_k, fork = 1,…,K— each unit takes its group's slope vector. - Number of groups: chosen to minimize an information criterion (IC); the
complexity tuning parameter is
ρ_NT = c_ρ·(NT)^(−1/2). K = 2 is selected in all specifications. - Cluster-robust inference: standard errors are clustered by country
(
cluster(country)), adjusting for within-country serial correlation in the two-way fixed-effects (country + year) postlasso step. - Dynamic bias correction: with a lagged dependent variable (
lagsavings/ly1), thedynamicoption applies the half-panel jackknife to remove Nickell bias before grouping.
The savings panel is pre-standardized (each variable demeaned and scaled to unit SD), so its
coefficients are read in standard-deviation units. The democracy panel's lnPGDP and the
GDP lags are log per-capita GDP × 100.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
code identifier | Country code (savings panel) | Numeric country identifier in the savings panel (1-56). | Anonymized integer code; no country-name mapping is provided in the file. | integer code | Su, Shi & Phillips (2016) | 56 countries |
year year | Time period | Time index of the panel. In the savings panel coded 1-15 (= 1995-2010); in the democracy panel the calendar year. | Sequential period index (savings) / calendar year (democracy). | period / year | Source panels | |
savings continuous | Savings-to-GDP ratio (standardized) | Standardized savings-to-GDP ratio — the outcome of the savings application. | Savings/GDP standardized to mean 0, SD 1 across the panel. | SD units | Su, Shi & Phillips (2016) | 840 obs |
lagsavings continuous | Lagged savings-to-GDP ratio (standardized) | One-period lag of the standardized savings ratio; the dynamic-model persistence term. | L.savings within country; standardized to mean 0, SD 1. | SD units | Su, Shi & Phillips (2016) | 840 obs |
cpi continuous | CPI inflation (standardized) | Standardized consumer-price-index inflation — the regressor whose sign reverses across groups. | CPI inflation standardized to mean 0, SD 1. | SD units | Su, Shi & Phillips (2016) | 840 obs |
interest continuous | Real interest rate (standardized) | Standardized real interest rate; also exhibits a group sign reversal. | Real interest rate standardized to mean 0, SD 1. | SD units | Su, Shi & Phillips (2016) | 840 obs |
gdp continuous | GDP growth (standardized) | Standardized GDP growth rate; positive in both savings groups. | GDP growth standardized to mean 0, SD 1. | SD units | Su, Shi & Phillips (2016) | 840 obs |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
code | – | 100% | 840 | 56 | — | — | — | — | — |
year | – | 100% | 840 | 15 | 1 | 8.0 | 8 | 15 | 4.32 |
savings | 100% | 840 | 840 | -2.50 | -2.87e-08 | -0.030 | 2.89 | 1.00 | |
lagsavings | 100% | 840 | 840 | -2.83 | 5.81e-08 | -0.033 | 2.92 | 1.00 | |
cpi | 100% | 840 | 840 | -2.77 | 3.56e-09 | -0.208 | 3.55 | 1.00 | |
interest | 100% | 840 | 840 | -3.60 | -7.17e-09 | 0.006 | 3.28 | 1.00 | |
gdp | 100% | 840 | 840 | -3.55 | 1.06e-08 | 0.194 | 2.46 | 1.00 |
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
name identifier | Country name | Country name string for the democracy panel. | Country identifier carried alongside the numeric code. | string | Acemoglu et al. (2019) | 98 countries |
country identifier | Numeric country code (democracy panel) | Generated numeric country code; the panel id for the democracy application. | Generated integer code (embedded label: 'Generated numeric country code'). | integer code | Acemoglu et al. (2019) | 98 countries |
year year | Time period | Time index of the panel. In the savings panel coded 1-15 (= 1995-2010); in the democracy panel the calendar year. | Sequential period index (savings) / calendar year (democracy). | period / year | Source panels | |
lnPGDP continuous | Log per-capita GDP x100 (2000 USD) | Log of GDP per capita in 2000 constant dollars, multiplied by 100 — the growth outcome. | 100 x log(real GDP per capita, 2000 constant USD). | log USD x100 | Acemoglu et al. (2019) | 4,018 obs |
Democracy dummy | Democracy indicator (ANRR) | Binary democracy measure of Acemoglu, Naidu, Restrepo & Robinson: 1 if democratic country-year, else 0. | ANRR dichotomous democracy classification. | 0/1 | Acemoglu et al. (2019) | 4,018 obs (55% = 1) |
ly1 continuous | Lag 1 of log per-capita GDP x100 | First lag of lnPGDP — the dynamic-model persistence term. | L1.lnPGDP within country. | log USD x100 | Acemoglu et al. (2019) | 3,920 obs |
ly2 continuous | Lag 2 of log per-capita GDP x100 | Second lag of lnPGDP (used in the empirical.do lag-structure extensions). | L2.lnPGDP within country. | log USD x100 | Acemoglu et al. (2019) | 3,822 obs |
ly3 continuous | Lag 3 of log per-capita GDP x100 | Third lag of lnPGDP (used in the lag-structure extensions). | L3.lnPGDP within country. | log USD x100 | Acemoglu et al. (2019) | 3,724 obs |
ly4 continuous | Lag 4 of log per-capita GDP x100 | Fourth lag of lnPGDP (used in the lag-structure extensions). | L4.lnPGDP within country. | log USD x100 | Acemoglu et al. (2019) | 3,626 obs |
lnpop continuous | Log population | Log of population. | log(population). | log persons | Acemoglu et al. (2019) | 4,014 obs |
lnpop95 continuous | Log population in 1995 | Log of population in 1995 (time-invariant per country). | log(population in 1995). | log persons | Acemoglu et al. (2019) | 95 countries |
lnTrade continuous | Log trade x100 (% of GDP) | Log of trade (exports + imports as % of GDP), multiplied by 100. | 100 x log(trade share of GDP). | log x100 | Acemoglu et al. (2019) | 3,845 obs |
lnMort continuous | Log child mortality x100 | Log of the child mortality rate, multiplied by 100. | 100 x log(child mortality rate). | log x100 | Acemoglu et al. (2019) | 3,932 obs |
transition dummy | Transition-year indicator | 1 if the country-year is a democratic-transition year, else 0. | ANRR transition-year flag. | 0/1 | Acemoglu et al. (2019) | 4,018 obs |
transid dummy | Transition-country indicator | 1 if the country ever experiences a transition, else 0 (time-invariant per country). | ANRR transition-country flag. | 0/1 | Acemoglu et al. (2019) | 4,018 obs |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
name | – | 100% | 4,018 | 98 | — | — | — | — | — |
country | – | 100% | 4,018 | 98 | — | — | — | — | — |
year | – | 100% | 4,018 | 41 | 1970 | 1990.0 | 1990 | 2010 | 11.83 |
lnPGDP | 100% | 4,018 | 4,018 | 405.7 | 758.6 | 749.8 | 1,094.0 | 162.9 | |
Democracy | 100% | 4,018 | 2 | 0 | 0.545 | 1.00 | 1.00 | 0.498 | |
ly1 | 98% | 3,920 | 3,920 | 405.7 | 757.8 | 748.8 | 1,094.0 | 162.7 | |
ly2 | 95% | 3,822 | 3,822 | 405.7 | 757.0 | 747.7 | 1,094.0 | 162.4 | |
ly3 | 93% | 3,724 | 3,724 | 405.7 | 756.2 | 746.4 | 1,094.0 | 162.1 | |
ly4 | 90% | 3,626 | 3,626 | 405.7 | 755.4 | 745.6 | 1,089.1 | 161.7 | |
lnpop | 100% | 4,018 | 4,014 | 12.23 | 16.14 | 16.08 | 21.01 | 1.58 | |
lnpop95 | 97% | 3,895 | 95 | 12.17 | 16.23 | 16.11 | 20.87 | 1.54 | |
lnTrade | 96% | 3,845 | 3,845 | 169.8 | 408.6 | 409.3 | 607.2 | 59.55 | |
lnMort | 98% | 3,932 | 1,331 | 64.19 | 346.5 | 370.1 | 531.3 | 111.2 | |
transition | 100% | 4,018 | 2 | 0 | 0.025 | 0 | 1.00 | 0.155 | |
transid | 100% | 4,018 | 2 | 0 | 0.551 | 1.00 | 1.00 | 0.497 |
Known limitations & caveats
- Standardized savings variables. In
saving.dta, savings, lagsavings, cpi, interest, and gdp are each standardized to mean 0, SD 1 — coefficients are in standard-deviation units, not raw economic magnitudes. - Coded time index. In
saving.dta,yearis coded 1–15 (not calendar years); these 15 periods correspond to 1995–2010 in the source. - Numeric country codes. Both panels identify units by numeric code (
codein savings;countryin democracy). The democracy file carries anamestring; the savings file does not, so savings groups cannot be mapped to named countries. - Scaled GDP variables. In
democracy.dta,lnPGDPand its lagsly1–ly4are log per-capita GDP multiplied by 100;lnTradeandlnMortare likewise scaled by 100. - Associations, not proven causation. The group-specific C-LASSO estimates are conditional associations within the panel model; a causal reading of the democracy effect requires the same identifying assumptions as Acemoglu et al. (2019).
- Close IC in the democracy model. IC values across K = 1–5 span only 3.267–3.280; K = 2 is optimal but not dominant, so group assignments are sensitive to the tuning parameter ρ.