Downloads
Each dataset is available as a labeled Stata .dta and its source file.
⇩ Download all data (ZIP)stata_codebook.do
| Dataset | Grain | Rows | Stata | Source |
|---|---|---|---|---|
sim_country_panel | country-year | 890 × 19 | sim_country_panel.dta | sim_country_panel.csv |
sim_regional_gdp | region-year (base year) | 820 × 5 | sim_regional_gdp.dta | sim_regional_gdp.csv |
Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.
Load directly in code
Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.
Stata
* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_kuznets/data/"
use "${BASE}sim_country_panel.dta", clear
describe
notesPython
!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_kuznets/data/"
df = pd.read_stata(BASE + "sim_country_panel.dta")
# load every dataset at once
files = ["sim_country_panel", "sim_regional_gdp"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}
# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "sim_country_panel.dta", "sim_country_panel.dta")
df, meta = pyreadstat.read_dta("sim_country_panel.dta")Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb
R
# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_kuznets/data/"
df <- read_dta(paste0(BASE, "sim_country_panel.dta"))Overview & sources
Companion data for a hands-on R tutorial that replicates Lessmann (2014) on a fully synthetic dataset of regional GDP per capita for 56 countries over 1980–2009. The post measures spatial inequality with the population-weighted coefficient of variation (WCV) of regional income and estimates the spatial Kuznets curve — the hypothesis that regional inequality first rises then falls as development increases — using cross-section OLS, two-way fixed effects (fixest), and semiparametric methods (Robinson 1988, Baltagi–Li 2002). The inverted-U is robust across estimators; the high-income upturn (N-shape) appears only in cross-country comparisons, not within countries over time. The entire data-generating process is open and reproducible.
sim_country_panel is an annual country panel (one row per country × year, unbalanced over 1980–2009) carrying the inequality indices and covariates. sim_regional_gdp is the regional micro cross-section (one row per region × country, at each country's base year) from which the indices are built.
Data sources
| Source | Provides | Reference / URL |
|---|---|---|
| Lessmann (2014) | Replicated study; calibration targets and country scaffold (region counts, areas) | Lessmann, C. (2014). Spatial inequality and development — Is there an inverted-U relationship? Journal of Development Economics, 106, 35–51. |
| Synthetic (this study) | All values — simulated via a calibrated lognormal data-generating process (open & reproducible) | Mendez, C. (2026). See the post's R script analysis.R for the full DGP. |
| Method references | Estimators and concepts | Kuznets (1955); Williamson (1965); Robinson (1988); Baltagi & Li (2002). |
Cite this data
Please cite this dataset as follows.
APA
Mendez, C. (2026). Spatial Inequality and the Kuznets Curve: Parametric and Semiparametric Estimates in R [Data set]. https://carlos-mendez.org/post/r_kuznets/
Lessmann, C. (2014). Spatial inequality and development — Is there an inverted-U relationship? Journal of Development Economics, 106, 35–51.BibTeX
@misc{mendez2026rkuznets,
author = {Mendez, Carlos},
title = {Spatial Inequality and the Kuznets Curve: Parametric and Semiparametric Estimates in R},
year = {2026},
howpublished = {\url{https://carlos-mendez.org/post/r_kuznets/}},
note = {Data set}
}
@article{lessmann2014spatial,
author = {Lessmann, Christian},
title = {Spatial inequality and development---Is there an inverted-U relationship?},
journal = {Journal of Development Economics},
volume = {106}, pages = {35--51}, year = {2014}
}Variable explorer search & filter all 22 variables
Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.
| Variable | Type | Distribution | Label | Definition | Units | In files | Source |
|---|---|---|---|---|---|---|---|
area_units# | continuous | Area per region (log scale) | Land area relative to the number of regions. | log scale | sim_country_panel | Derived | |
country# | identifier | – | Country name | Synthetic country identifier (real country names used as labels). | string | sim_country_panel, sim_regional_gdp | Scaffold (Lessmann 2014 appendix) |
cv# | continuous | Coefficient of variation (unweighted) | Unweighted spread of regional GDP per capita. | >=0 | sim_country_panel | Computed (this study) | |
ethnic# | continuous | Ethnic fractionalization index | Degree of ethnic/linguistic heterogeneity (time-invariant per country). | 0-1 | sim_country_panel | Simulation | |
federal# | dummy | Federal-state dummy (1=federal) | 1 if the country has a federal constitution, else 0. | 0/1 | sim_country_panel | Assigned | |
gdp_pc# | continuous | Regional GDP per capita | Simulated regional GDP per capita. | US$ | sim_regional_gdp | Simulation | |
gini_reg# | continuous | Regional Gini index | Population-weighted Gini of the regional income distribution. | 0-1 | sim_country_panel | Computed (this study) | |
lnGDP# | continuous | Log GDP per capita | Natural log of country GDP per capita. | log US$ | sim_country_panel | Simulation | |
lnGDP2# | continuous | Log GDP per capita, squared | Quadratic income term for the Kuznets polynomial. | log^2 | sim_country_panel | Derived | |
lnGDP3# | continuous | Log GDP per capita, cubed | Cubic income term (potential N-shape) for the Kuznets polynomial. | log^3 | sim_country_panel | Derived | |
lnarea# | continuous | Log land area (km^2) | Natural log of country land area. | log km^2 | sim_country_panel | Scaffold (Lessmann 2014) | |
lnunits# | continuous | Log number of regions | Natural log of the count of territorial units. | log count | sim_country_panel | Scaffold (Lessmann 2014) | |
nonag# | continuous | Non-agricultural share of GVA (%) | Share of gross value added outside agriculture (structural change). | % (0-100) | sim_country_panel | Simulation | |
period5# | identifier | – | 5-year period group | Categorical 5-year period (1=1980-84 ... 6=2005-09). | 1-6 | sim_country_panel | Derived |
pop_share# | continuous | Regional population share | Fraction of country population in the region (sums to 1 per country). | 0-1 | sim_regional_gdp | Simulation | |
region# | identifier | – | Region index within country | Sequential, persistent region identifier (region 1 is typically the capital). | integer | sim_regional_gdp | Simulation |
region_grp# | identifier | – | World Bank region group | Geographic grouping of the country. | code | sim_country_panel | Assigned |
trade_gdp# | continuous | Trade openness (% of GDP) | Exports + imports as a share of GDP. | % GDP | sim_country_panel | Simulation | |
urbanization# | continuous | Urbanization rate (%) | Share of population in urban areas. | % | sim_country_panel | Simulation | |
wcv# | continuous | Pop-weighted coefficient of variation | Population-weighted spread of regional GDP per capita / country mean (headline index). | 0-1 | sim_country_panel | Computed (this study) | |
wcv_nocap# | continuous | WCV excluding the capital region | WCV recomputed after dropping the largest (capital) region. | 0-1 | sim_country_panel | Computed (this study) | |
year# | year | – | Calendar year | Annual time index. | year | sim_country_panel, sim_regional_gdp | Simulation |
Cross-file variable index
Which file each variable appears in (● = present).
Construction & formulas
Inequality is measured within each country-year, across that country's regions,
on regional GDP per capita y_j with population shares p_j and country mean
ȳ.
- Weighted coefficient of variation (
wcv):WCV = (1/ȳ) · [ Σ_j p_j (ȳ − y_j)² ]^(1/2)— the headline scale-free index. - Unweighted CV (
cv): standard deviation / mean of regionalgdp_pc(robustness). - Regional Gini (
gini_reg): population-weighted Gini over sorted regional income (robustness). - WCV excluding the capital (
wcv_nocap): drop region 1, re-normalize shares, recompute WCV.
Synthetic data-generating process: for region j in country i, year
t, y_ijt = country_mean · exp(δ_it · z_ij), where z_ij is a
persistent regional position (a rich region stays rich) and δ_it (log-dispersion)
follows a structural inverted-U in development. A time-invariant between-country cubic term yields
the cross-section N-shape while the panel keeps a clean inverted-U. Polynomial terms
lnGDP2/lnGDP3 are powers of lnGDP.
The datasets
Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.
expand to search (Ctrl/⌘+F) or print across all datasets
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
country identifier | Country name | Synthetic country identifier (real country names used as labels). | From the hard-coded 56-country scaffold (region counts/areas from Lessmann 2014). | string | Scaffold (Lessmann 2014 appendix) | |
region_grp identifier | World Bank region group | Geographic grouping of the country. | Assigned per country: EAP, ECA, LAC, MENA, NA, SA, SSA. | code | Assigned | |
year year | Calendar year | Annual time index. | Country-specific start/end years (unbalanced coverage). | year | Simulation | |
lnGDP continuous | Log GDP per capita | Natural log of country GDP per capita. | Simulated: 2009 base value minus growth path plus AR-type residual. | log US$ | Simulation | |
wcv continuous | Pop-weighted coefficient of variation | Population-weighted spread of regional GDP per capita / country mean (headline index). | WCV = (1/ȳ)·[Σ p_j (ȳ−y_j)²]^(1/2) over regions, per country-year. | 0-1 | Computed (this study) | |
cv continuous | Coefficient of variation (unweighted) | Unweighted spread of regional GDP per capita. | SD / mean of regional gdp_pc. | >=0 | Computed (this study) | |
gini_reg continuous | Regional Gini index | Population-weighted Gini of the regional income distribution. | Gini over sorted regional income with population shares. | 0-1 | Computed (this study) | |
wcv_nocap continuous | WCV excluding the capital region | WCV recomputed after dropping the largest (capital) region. | Exclude region 1, re-normalize population shares, recompute WCV. | 0-1 | Computed (this study) | |
trade_gdp continuous | Trade openness (% of GDP) | Exports + imports as a share of GDP. | Country base mean + AR(1) noise, clamped to [15, 171]. | % GDP | Simulation | |
urbanization continuous | Urbanization rate (%) | Share of population in urban areas. | Country base mean + random-walk drift, clamped to [20, 99]. | % | Simulation | |
nonag continuous | Non-agricultural share of GVA (%) | Share of gross value added outside agriculture (structural change). | Logistic transform of lnGDP. | % (0-100) | Simulation | |
ethnic continuous | Ethnic fractionalization index | Degree of ethnic/linguistic heterogeneity (time-invariant per country). | Beta(1.0, 1.7) draw rescaled to [0, 0.75]. | 0-1 | Simulation | |
federal dummy | Federal-state dummy (1=federal) | 1 if the country has a federal constitution, else 0. | 1 for a fixed set of federal states (US, Canada, Brazil, India, Germany, ...). | 0/1 | Assigned | |
lnunits continuous | Log number of regions | Natural log of the count of territorial units. | log(n_reg); region counts from Lessmann (2014) appendix. | log count | Scaffold (Lessmann 2014) | |
lnarea continuous | Log land area (km^2) | Natural log of country land area. | log(area); areas from Lessmann (2014) appendix. | log km^2 | Scaffold (Lessmann 2014) | |
area_units continuous | Area per region (log scale) | Land area relative to the number of regions. | lnarea / lnunits. | log scale | Derived | |
period5 identifier | 5-year period group | Categorical 5-year period (1=1980-84 ... 6=2005-09). | Year binned into six 5-year periods. | 1-6 | Derived | |
lnGDP2 continuous | Log GDP per capita, squared | Quadratic income term for the Kuznets polynomial. | lnGDP^2. | log^2 | Derived | |
lnGDP3 continuous | Log GDP per capita, cubed | Cubic income term (potential N-shape) for the Kuznets polynomial. | lnGDP^3. | log^3 | Derived |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
country | – | 100% | 890 | 56 | — | — | — | — | — |
region_grp | – | 95% | 844 | 6 | — | — | — | — | — |
year | – | 100% | 890 | 27 | 1982 | 1997.0 | 1998 | 2008 | 6.84 |
lnGDP | 100% | 890 | 890 | 5.75 | 9.59 | 9.85 | 11.32 | 1.06 | |
wcv | 100% | 890 | 874 | 0.038 | 0.332 | 0.314 | 1.24 | 0.176 | |
cv | 100% | 890 | 875 | 0.041 | 0.416 | 0.389 | 1.65 | 0.231 | |
gini_reg | 100% | 890 | 878 | 0.014 | 0.165 | 0.163 | 0.379 | 0.083 | |
wcv_nocap | 100% | 890 | 875 | 0.030 | 0.307 | 0.254 | 1.31 | 0.210 | |
trade_gdp | 100% | 890 | 890 | 15.00 | 82.10 | 84.20 | 165.0 | 29.96 | |
urbanization | 100% | 890 | 890 | 40.90 | 71.22 | 71.18 | 96.83 | 10.52 | |
nonag | 100% | 890 | 890 | 16.49 | 82.68 | 88.77 | 96.74 | 14.61 | |
ethnic | 100% | 890 | 56 | 0.021 | 0.327 | 0.321 | 0.687 | 0.178 | |
federal | 100% | 890 | 2 | 0 | 0.258 | 0 | 1.00 | 0.438 | |
lnunits | 100% | 890 | 28 | 0.693 | 2.36 | 2.30 | 3.93 | 0.795 | |
lnarea | 100% | 890 | 54 | 5.77 | 12.57 | 12.62 | 16.61 | 1.95 | |
area_units | 100% | 890 | 56 | 3.43 | 6.03 | 5.34 | 18.03 | 2.61 | |
period5 | – | 100% | 890 | 6 | — | — | — | — | — |
lnGDP2 | 100% | 890 | 890 | 33.10 | 93.02 | 97.07 | 128.2 | 19.37 | |
lnGDP3 | 100% | 890 | 890 | 190.4 | 912.1 | 956.3 | 1,451.5 | 270.5 |
Variable dictionary
| Variable | Label | Definition | Construction | Units | Source | Coverage |
|---|---|---|---|---|---|---|
country identifier | Country name | Synthetic country identifier (real country names used as labels). | From the hard-coded 56-country scaffold (region counts/areas from Lessmann 2014). | string | Scaffold (Lessmann 2014 appendix) | |
year year | Calendar year | Annual time index. | Country-specific start/end years (unbalanced coverage). | year | Simulation | |
region identifier | Region index within country | Sequential, persistent region identifier (region 1 is typically the capital). | 1..n_reg per country. | integer | Simulation | |
pop_share continuous | Regional population share | Fraction of country population in the region (sums to 1 per country). | Gamma(0.85) draws sorted descending, adjusted for a capital weight. | 0-1 | Simulation | |
gdp_pc continuous | Regional GDP per capita | Simulated regional GDP per capita. | exp(lnGDP)·exp(δ·z − 0.5·δ²): country mean scaled by a persistent regional position. | US$ | Simulation |
Distribution & statistics (click a header to sort)
| Variable | Distribution | Coverage | N | Distinct | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|---|---|---|---|
country | – | 100% | 820 | 56 | — | — | — | — | — |
year | – | 100% | 820 | 4 | 1982 | 1991.6 | 1993 | 2008 | 7.72 |
region | – | 100% | 820 | 51 | — | — | — | — | — |
pop_share | 100% | 820 | 820 | 4.48e-05 | 0.068 | 0.028 | 0.942 | 0.120 | |
gdp_pc | 100% | 820 | 820 | 85.33 | 12,179 | 5,731.9 | 92,133 | 13,949 |
Known limitations & caveats
- Synthetic data. There is no real data behind this tutorial; results are internally consistent with the calibration but are not empirical evidence about real-world spatial inequality.
- Fragile N-shape. The high-income upturn (cubic term) appears with income in logs and weakens in levels; the robust finding is the inverted-U, not the N-shape.
- Within vs. between. The genuine N-shape is cross-sectional (between countries); the within-country cubic term is insignificant.
- Significance ≠ shape. All income terms can be significant yet the discriminant D = β₂² − 3·β₁·β₃ be negative or the turning points fall outside the observed range — always check D > 0 and that turning points lie in range.