← Back to the post
Interactive data dictionary

Spatial Inequality and the Kuznets Curve

Parametric and semiparametric estimates in R, on a fully synthetic 56-country panel.

2
datasets
22
variables
56
countries
1980–2009
years

Downloads

Each dataset is available as a labeled Stata .dta and its source file.

⇩ Download all data (ZIP)stata_codebook.do

DatasetGrainRowsStataSource
sim_country_panelcountry-year890 × 19sim_country_panel.dtasim_country_panel.csv
sim_regional_gdpregion-year (base year)820 × 5sim_regional_gdp.dtasim_regional_gdp.csv

Run stata_codebook.do in Stata once to attach long-form per-variable notes to the .dta files.

Load directly in code

Every file loads straight from GitHub (raw URLs). Swap the file name to load any dataset.

Stata

* Stata 14+ : `use` reads an https URL directly
global BASE "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_kuznets/data/"
use "${BASE}sim_country_panel.dta", clear
describe
notes

Python

!pip install -q pyreadstat
import pandas as pd
BASE = "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_kuznets/data/"
df = pd.read_stata(BASE + "sim_country_panel.dta")

# load every dataset at once
files = ["sim_country_panel", "sim_regional_gdp"]
data = {f: pd.read_stata(BASE + f + ".dta") for f in files}

# pyreadstat (richest metadata) reads LOCAL files -> download first
import pyreadstat, urllib.request
urllib.request.urlretrieve(BASE + "sim_country_panel.dta", "sim_country_panel.dta")
df, meta = pyreadstat.read_dta("sim_country_panel.dta")

Copy and paste this snippet in Google Colab app. https://colab.research.google.com/notebooks/empty.ipynb

R

# R : haven::read_dta auto-downloads an https URL
library(haven)
BASE <- "https://raw.githubusercontent.com/cmg777/starter-academic-v501/master/content/post/r_kuznets/data/"
df <- read_dta(paste0(BASE, "sim_country_panel.dta"))

Overview & sources

Companion data for a hands-on R tutorial that replicates Lessmann (2014) on a fully synthetic dataset of regional GDP per capita for 56 countries over 1980–2009. The post measures spatial inequality with the population-weighted coefficient of variation (WCV) of regional income and estimates the spatial Kuznets curve — the hypothesis that regional inequality first rises then falls as development increases — using cross-section OLS, two-way fixed effects (fixest), and semiparametric methods (Robinson 1988, Baltagi–Li 2002). The inverted-U is robust across estimators; the high-income upturn (N-shape) appears only in cross-country comparisons, not within countries over time. The entire data-generating process is open and reproducible.

Two files. sim_country_panel is an annual country panel (one row per country × year, unbalanced over 1980–2009) carrying the inequality indices and covariates. sim_regional_gdp is the regional micro cross-section (one row per region × country, at each country's base year) from which the indices are built.

Data sources

SourceProvidesReference / URL
Lessmann (2014)Replicated study; calibration targets and country scaffold (region counts, areas)Lessmann, C. (2014). Spatial inequality and development — Is there an inverted-U relationship? Journal of Development Economics, 106, 35–51.
Synthetic (this study)All values — simulated via a calibrated lognormal data-generating process (open &amp; reproducible)Mendez, C. (2026). See the post's R script analysis.R for the full DGP.
Method referencesEstimators and conceptsKuznets (1955); Williamson (1965); Robinson (1988); Baltagi & Li (2002).

Cite this data

Please cite this dataset as follows.

APA

Mendez, C. (2026). Spatial Inequality and the Kuznets Curve: Parametric and Semiparametric Estimates in R [Data set]. https://carlos-mendez.org/post/r_kuznets/

Lessmann, C. (2014). Spatial inequality and development — Is there an inverted-U relationship? Journal of Development Economics, 106, 35–51.

BibTeX

@misc{mendez2026rkuznets,
  author       = {Mendez, Carlos},
  title        = {Spatial Inequality and the Kuznets Curve: Parametric and Semiparametric Estimates in R},
  year         = {2026},
  howpublished = {\url{https://carlos-mendez.org/post/r_kuznets/}},
  note         = {Data set}
}

@article{lessmann2014spatial,
  author  = {Lessmann, Christian},
  title   = {Spatial inequality and development---Is there an inverted-U relationship?},
  journal = {Journal of Development Economics},
  volume  = {106}, pages = {35--51}, year = {2014}
}

Variable explorer search & filter all 22 variables

Type to filter by name or label, or use the chips to filter by type. Each row shows a mini distribution. Click a header to sort.

VariableTypeDistributionLabelDefinitionUnitsIn filesSource
area_units#continuousmin 3.43 | median 5.34 | max 18Area per region (log scale)Land area relative to the number of regions.log scalesim_country_panelDerived
country#identifierCountry nameSynthetic country identifier (real country names used as labels).stringsim_country_panel, sim_regional_gdpScaffold (Lessmann 2014 appendix)
cv#continuousmin 0.0411 | median 0.389 | max 1.65Coefficient of variation (unweighted)Unweighted spread of regional GDP per capita.>=0sim_country_panelComputed (this study)
ethnic#continuousmin 0.0211 | median 0.321 | max 0.687Ethnic fractionalization indexDegree of ethnic/linguistic heterogeneity (time-invariant per country).0-1sim_country_panelSimulation
federal#dummyshare coded 1 = 0.258Federal-state dummy (1=federal)1 if the country has a federal constitution, else 0.0/1sim_country_panelAssigned
gdp_pc#continuousmin 85.3 | median 5.73e+03 | max 9.21e+04Regional GDP per capitaSimulated regional GDP per capita.US$sim_regional_gdpSimulation
gini_reg#continuousmin 0.0139 | median 0.163 | max 0.379Regional Gini indexPopulation-weighted Gini of the regional income distribution.0-1sim_country_panelComputed (this study)
lnGDP#continuousmin 5.75 | median 9.85 | max 11.3Log GDP per capitaNatural log of country GDP per capita.log US$sim_country_panelSimulation
lnGDP2#continuousmin 33.1 | median 97.1 | max 128Log GDP per capita, squaredQuadratic income term for the Kuznets polynomial.log^2sim_country_panelDerived
lnGDP3#continuousmin 190 | median 956 | max 1.45e+03Log GDP per capita, cubedCubic income term (potential N-shape) for the Kuznets polynomial.log^3sim_country_panelDerived
lnarea#continuousmin 5.77 | median 12.6 | max 16.6Log land area (km^2)Natural log of country land area.log km^2sim_country_panelScaffold (Lessmann 2014)
lnunits#continuousmin 0.693 | median 2.3 | max 3.93Log number of regionsNatural log of the count of territorial units.log countsim_country_panelScaffold (Lessmann 2014)
nonag#continuousmin 16.5 | median 88.8 | max 96.7Non-agricultural share of GVA (%)Share of gross value added outside agriculture (structural change).% (0-100)sim_country_panelSimulation
period5#identifier5-year period groupCategorical 5-year period (1=1980-84 ... 6=2005-09).1-6sim_country_panelDerived
pop_share#continuousmin 4.48e-05 | median 0.0276 | max 0.942Regional population shareFraction of country population in the region (sums to 1 per country).0-1sim_regional_gdpSimulation
region#identifierRegion index within countrySequential, persistent region identifier (region 1 is typically the capital).integersim_regional_gdpSimulation
region_grp#identifierWorld Bank region groupGeographic grouping of the country.codesim_country_panelAssigned
trade_gdp#continuousmin 15 | median 84.2 | max 165Trade openness (% of GDP)Exports + imports as a share of GDP.% GDPsim_country_panelSimulation
urbanization#continuousmin 40.9 | median 71.2 | max 96.8Urbanization rate (%)Share of population in urban areas.%sim_country_panelSimulation
wcv#continuousmin 0.0384 | median 0.314 | max 1.24Pop-weighted coefficient of variationPopulation-weighted spread of regional GDP per capita / country mean (headline index).0-1sim_country_panelComputed (this study)
wcv_nocap#continuousmin 0.0303 | median 0.254 | max 1.31WCV excluding the capital regionWCV recomputed after dropping the largest (capital) region.0-1sim_country_panelComputed (this study)
year#yearCalendar yearAnnual time index.yearsim_country_panel, sim_regional_gdpSimulation

Cross-file variable index

Which file each variable appears in (● = present).

Variablesim_country_panelsim_regional_gdp
area_units
country
cv
ethnic
federal
gdp_pc
gini_reg
lnGDP
lnGDP2
lnGDP3
lnarea
lnunits
nonag
period5
pop_share
region
region_grp
trade_gdp
urbanization
wcv
wcv_nocap
year

Construction & formulas

Inequality is measured within each country-year, across that country's regions, on regional GDP per capita y_j with population shares p_j and country mean ȳ.

Synthetic data-generating process: for region j in country i, year t, y_ijt = country_mean · exp(δ_it · z_ij), where z_ij is a persistent regional position (a rich region stays rich) and δ_it (log-dispersion) follows a structural inverted-U in development. A time-invariant between-country cubic term yields the cross-section N-shape while the panel keeps a clean inverted-U. Polynomial terms lnGDP2/lnGDP3 are powers of lnGDP.

The datasets

Switch datasets with the tabs. Each shows the full variable dictionary plus a sortable statistics table with mini distributions and data coverage.

expand to search (Ctrl/⌘+F) or print across all datasets

country-year  890 × 19 · 1980-2009 · 56 countries (unbalanced)

Panel key: country x year · Estimate the spatial Kuznets curve (OLS / two-way FE / semiparametric).

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
country identifierCountry nameSynthetic country identifier (real country names used as labels).From the hard-coded 56-country scaffold (region counts/areas from Lessmann 2014).stringScaffold (Lessmann 2014 appendix)
region_grp identifierWorld Bank region groupGeographic grouping of the country.Assigned per country: EAP, ECA, LAC, MENA, NA, SA, SSA.codeAssigned
year yearCalendar yearAnnual time index.Country-specific start/end years (unbalanced coverage).yearSimulation
lnGDP continuousLog GDP per capitaNatural log of country GDP per capita.Simulated: 2009 base value minus growth path plus AR-type residual.log US$Simulation
wcv continuousPop-weighted coefficient of variationPopulation-weighted spread of regional GDP per capita / country mean (headline index).WCV = (1/ȳ)·[Σ p_j (ȳ−y_j)²]^(1/2) over regions, per country-year.0-1Computed (this study)
cv continuousCoefficient of variation (unweighted)Unweighted spread of regional GDP per capita.SD / mean of regional gdp_pc.>=0Computed (this study)
gini_reg continuousRegional Gini indexPopulation-weighted Gini of the regional income distribution.Gini over sorted regional income with population shares.0-1Computed (this study)
wcv_nocap continuousWCV excluding the capital regionWCV recomputed after dropping the largest (capital) region.Exclude region 1, re-normalize population shares, recompute WCV.0-1Computed (this study)
trade_gdp continuousTrade openness (% of GDP)Exports + imports as a share of GDP.Country base mean + AR(1) noise, clamped to [15, 171].% GDPSimulation
urbanization continuousUrbanization rate (%)Share of population in urban areas.Country base mean + random-walk drift, clamped to [20, 99].%Simulation
nonag continuousNon-agricultural share of GVA (%)Share of gross value added outside agriculture (structural change).Logistic transform of lnGDP.% (0-100)Simulation
ethnic continuousEthnic fractionalization indexDegree of ethnic/linguistic heterogeneity (time-invariant per country).Beta(1.0, 1.7) draw rescaled to [0, 0.75].0-1Simulation
federal dummyFederal-state dummy (1=federal)1 if the country has a federal constitution, else 0.1 for a fixed set of federal states (US, Canada, Brazil, India, Germany, ...).0/1Assigned
lnunits continuousLog number of regionsNatural log of the count of territorial units.log(n_reg); region counts from Lessmann (2014) appendix.log countScaffold (Lessmann 2014)
lnarea continuousLog land area (km^2)Natural log of country land area.log(area); areas from Lessmann (2014) appendix.log km^2Scaffold (Lessmann 2014)
area_units continuousArea per region (log scale)Land area relative to the number of regions.lnarea / lnunits.log scaleDerived
period5 identifier5-year period groupCategorical 5-year period (1=1980-84 ... 6=2005-09).Year binned into six 5-year periods.1-6Derived
lnGDP2 continuousLog GDP per capita, squaredQuadratic income term for the Kuznets polynomial.lnGDP^2.log^2Derived
lnGDP3 continuousLog GDP per capita, cubedCubic income term (potential N-shape) for the Kuznets polynomial.lnGDP^3.log^3Derived

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
country100%89056
region_grp95%8446
year100%8902719821997.0199820086.84
lnGDPmin 5.75 | median 9.85 | max 11.3100%8908905.759.599.8511.321.06
wcvmin 0.0384 | median 0.314 | max 1.24100%8908740.0380.3320.3141.240.176
cvmin 0.0411 | median 0.389 | max 1.65100%8908750.0410.4160.3891.650.231
gini_regmin 0.0139 | median 0.163 | max 0.379100%8908780.0140.1650.1630.3790.083
wcv_nocapmin 0.0303 | median 0.254 | max 1.31100%8908750.0300.3070.2541.310.210
trade_gdpmin 15 | median 84.2 | max 165100%89089015.0082.1084.20165.029.96
urbanizationmin 40.9 | median 71.2 | max 96.8100%89089040.9071.2271.1896.8310.52
nonagmin 16.5 | median 88.8 | max 96.7100%89089016.4982.6888.7796.7414.61
ethnicmin 0.0211 | median 0.321 | max 0.687100%890560.0210.3270.3210.6870.178
federalshare coded 1 = 0.258100%890200.25801.000.438
lnunitsmin 0.693 | median 2.3 | max 3.93100%890280.6932.362.303.930.795
lnareamin 5.77 | median 12.6 | max 16.6100%890545.7712.5712.6216.611.95
area_unitsmin 3.43 | median 5.34 | max 18100%890563.436.035.3418.032.61
period5100%8906
lnGDP2min 33.1 | median 97.1 | max 128100%89089033.1093.0297.07128.219.37
lnGDP3min 190 | median 956 | max 1.45e+03100%890890190.4912.1956.31,451.5270.5

region-year (base year)  820 × 5 · base year per country · 56 countries; 2-51 regions each

Panel key: country x region · Regional incomes + population shares from which the inequality indices are computed.

Variable dictionary

VariableLabelDefinitionConstructionUnitsSourceCoverage
country identifierCountry nameSynthetic country identifier (real country names used as labels).From the hard-coded 56-country scaffold (region counts/areas from Lessmann 2014).stringScaffold (Lessmann 2014 appendix)
year yearCalendar yearAnnual time index.Country-specific start/end years (unbalanced coverage).yearSimulation
region identifierRegion index within countrySequential, persistent region identifier (region 1 is typically the capital).1..n_reg per country.integerSimulation
pop_share continuousRegional population shareFraction of country population in the region (sums to 1 per country).Gamma(0.85) draws sorted descending, adjusted for a capital weight.0-1Simulation
gdp_pc continuousRegional GDP per capitaSimulated regional GDP per capita.exp(lnGDP)·exp(δ·z − 0.5·δ²): country mean scaled by a persistent regional position.US$Simulation

Distribution & statistics (click a header to sort)

VariableDistributionCoverageNDistinctMinMeanMedianMaxSD
country100%82056
year100%820419821991.6199320087.72
region100%82051
pop_sharemin 4.48e-05 | median 0.0276 | max 0.942100%8208204.48e-050.0680.0280.9420.120
gdp_pcmin 85.3 | median 5.73e+03 | max 9.21e+04100%82082085.3312,1795,731.992,13313,949

Known limitations & caveats