On the Validity of Rural-Urban Polarization by Unsupervised and Network Clustering of Counties in the United States

Andrew Smith, May 22nd, 2023

Introduction¶

Abstract¶

It is often stated that urban and rural regions of a country become polarized over time with increasing disperities between not only political affliations [1], but also with economic and demographic statistics. For the United States, a very populous nation with the fourth largest land area of any country [2], there may be an exagerated effect given the larger distances between densely populated regions seperated by large suburban and rural areas. However, while the political divide between rural and urban regions of the US has been extensively studied, namely in presidential and gubernatorial election results, there is less research on the validity of dividing the US into two succinct groups [1]. Evidence is therefore needed in determining the effectiveness of various county clusterings of the US based on these aforementioned factors.

This study will perform a clustering analysis of counties in the United States to determine whether the binary view of the US is valid and if it conforms to the rural-urban polarization expected. Data is collected from the US census bureau from 2010 to 2019 and includes economic and demographic statistics for each county only; any political references are ignored for analysis. Clustering will be performed in two batches: unsupervised clustering, using k-means, agglomerative, and expectation-maximization clustering which are blind to inter-county adjacencies; and network clustering, using spectral clustering to have preference to nearby counties. Performance in each computation is measured using an average silhouette score across all clusters. Results indicate that the best clustering performance comes from considering the continental United States as a single cluster and thus provide evidence that a non-binary view of the United States is valid. Apart from political views, there is little evidence that the United States is signifantly different when comparing rural and urban regions, or any subgroup dividing the country for that matter.

Pre-Processing¶

Data on county statistics is collected from the United States Census Bureau from 2010 to 2019 [3]. A list of each US county and all counties which share a land border with said county is given by the county adjacency data from the United States Census Bureau as well [4]. The raw data files used for analysis are shown below.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import networkx as nx

from matplotlib import pyplot as plt
from matplotlib import image as mpimg

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_samples
from sklearn.mixture import GaussianMixture
from sklearn.cluster import SpectralClustering

from warnings import simplefilter

County Statistics¶

The economic and demographic data for each US county is shown below.

In [2]:
counties = pd.read_csv('https://corgis-edu.github.io/corgis/datasets/csv/county_demographics/county_demographics.csv', index_col=0)
counties.head(5)
Out[2]:
State Age.Percent 65 and Older Age.Percent Under 18 Years Age.Percent Under 5 Years Education.Bachelor's Degree or Higher Education.High School or Higher Employment.Nonemployer Establishments Ethnicities.American Indian and Alaska Native Alone Ethnicities.Asian Alone Ethnicities.Black Alone ... Population.Population per Square Mile Sales.Accommodation and Food Services Sales Sales.Retail Sales Employment.Firms.Total Employment.Firms.Women-Owned Employment.Firms.Men-Owned Employment.Firms.Minority-Owned Employment.Firms.Nonminority-Owned Employment.Firms.Veteran-Owned Employment.Firms.Nonveteran-Owned
County
Abbeville County SC 22.4 19.8 4.7 15.6 81.7 1416 0.3 0.4 27.6 ... 51.8 12507 91371 1450 543 689 317 1080 187 1211
Acadia Parish LA 15.8 25.8 6.9 13.3 79.0 4533 0.4 0.3 18.0 ... 94.3 52706 602739 4664 1516 2629 705 3734 388 4007
Accomack County VA 24.6 20.7 5.6 19.5 81.5 2387 0.7 0.8 28.8 ... 73.8 53568 348195 2997 802 1716 335 2560 212 2536
Ada County ID 14.9 23.2 5.6 38.5 95.2 41464 0.8 2.7 1.4 ... 372.8 763099 5766679 41789 14661 19409 3099 36701 3803 35132
Adair County IA 23.0 21.8 5.6 18.5 94.2 609 0.3 0.5 0.6 ... 13.5 -1 63002 914 304 499 0 861 185 679

5 rows × 42 columns

The data is indexed by county name with a myriad of statistical information collected. Refer to the source of the data for an explanation of each column header meaning [3]. A complete list of such features is shown below.

In [3]:
cols = pd.Series(counties.columns)
cols
Out[3]:
0                                                 State
1                              Age.Percent 65 and Older
2                            Age.Percent Under 18 Years
3                             Age.Percent Under 5 Years
4                 Education.Bachelor's Degree or Higher
5                       Education.High School or Higher
6                 Employment.Nonemployer Establishments
7     Ethnicities.American Indian and Alaska Native ...
8                               Ethnicities.Asian Alone
9                               Ethnicities.Black Alone
10                       Ethnicities.Hispanic or Latino
11    Ethnicities.Native Hawaiian and Other Pacific ...
12                        Ethnicities.Two or More Races
13                              Ethnicities.White Alone
14     Ethnicities.White Alone\t not Hispanic or Latino
15                           Housing.Homeownership Rate
16                                   Housing.Households
17                                Housing.Housing Units
18         Housing.Median Value of Owner-Occupied Units
19                        Housing.Persons per Household
20                        Income.Median Houseold Income
21                             Income.Per Capita Income
22                           Miscellaneous.Foreign Born
23                              Miscellaneous.Land Area
24    Miscellaneous.Language Other than English at Home
25          Miscellaneous.Living in Same House +1 Years
26                Miscellaneous.Manufacturers Shipments
27               Miscellaneous.Mean Travel Time to Work
28                         Miscellaneous.Percent Female
29                               Miscellaneous.Veterans
30                           Population.2020 Population
31                           Population.2010 Population
32                Population.Population per Square Mile
33          Sales.Accommodation and Food Services Sales
34                                   Sales.Retail Sales
35                               Employment.Firms.Total
36                         Employment.Firms.Women-Owned
37                           Employment.Firms.Men-Owned
38                      Employment.Firms.Minority-Owned
39                   Employment.Firms.Nonminority-Owned
40                       Employment.Firms.Veteran-Owned
41                    Employment.Firms.Nonveteran-Owned
dtype: object

As there may be multiple counties with similar names, e.g. "Adams County" having 12 such instances, the data will be indexed instead by both county name and state designation, such as "Adams County, ND." An example of this can be shown below.

In [4]:
counties.reset_index(inplace=True)
counties['County'] = counties['County'].astype(str) +", "+ counties["State"].astype(str) # concatenate columns
counties = counties.set_index("County") # set as index again

counties.drop(columns=["State"], inplace=True)
In [5]:
harco = counties.loc[["Harford County, MD"]]
harco
Out[5]:
Age.Percent 65 and Older Age.Percent Under 18 Years Age.Percent Under 5 Years Education.Bachelor's Degree or Higher Education.High School or Higher Employment.Nonemployer Establishments Ethnicities.American Indian and Alaska Native Alone Ethnicities.Asian Alone Ethnicities.Black Alone Ethnicities.Hispanic or Latino ... Population.Population per Square Mile Sales.Accommodation and Food Services Sales Sales.Retail Sales Employment.Firms.Total Employment.Firms.Women-Owned Employment.Firms.Men-Owned Employment.Firms.Minority-Owned Employment.Firms.Nonminority-Owned Employment.Firms.Veteran-Owned Employment.Firms.Nonveteran-Owned
County
Harford County, MD 16.6 22.2 5.6 36.7 92.7 17530 0.3 3.1 14.8 4.8 ... 560.1 390891 3792912 20330 7328 10727 3134 16495 2473 16889

1 rows × 41 columns

We can now standardize each column of data. To do this, the z-score is computed for each column which is equivalent to creating a pipeline with a standard scaler operator.

In [6]:
X = (counties - counties.mean())/counties.std() # convert to z-scores
X.head(5)
Out[6]:
Age.Percent 65 and Older Age.Percent Under 18 Years Age.Percent Under 5 Years Education.Bachelor's Degree or Higher Education.High School or Higher Employment.Nonemployer Establishments Ethnicities.American Indian and Alaska Native Alone Ethnicities.Asian Alone Ethnicities.Black Alone Ethnicities.Hispanic or Latino ... Population.Population per Square Mile Sales.Accommodation and Food Services Sales Sales.Retail Sales Employment.Firms.Total Employment.Firms.Women-Owned Employment.Firms.Men-Owned Employment.Firms.Minority-Owned Employment.Firms.Nonminority-Owned Employment.Firms.Veteran-Owned Employment.Firms.Nonveteran-Owned
County
Abbeville County, SC 0.549196 -0.609584 -0.949522 -0.666870 -0.839228 -0.206578 -0.271796 -0.397138 1.262073 -0.589303 ... -0.120312 -0.210169 -0.279610 -0.221310 -0.201403 -0.227746 -0.129359 -0.284525 -0.252945 -0.213273
Acadia Parish, LA -0.823704 1.120198 1.039138 -0.907208 -1.270673 -0.113855 -0.258327 -0.430981 0.597802 -0.502787 ... -0.095677 -0.169261 -0.165286 -0.127290 -0.126369 -0.120256 -0.106743 -0.134957 -0.171553 -0.121967
Accomack County, VA 1.006829 -0.350117 -0.135979 -0.259340 -0.871187 -0.177693 -0.217920 -0.261765 1.345107 -0.048580 ... -0.107560 -0.168384 -0.222193 -0.176055 -0.181430 -0.170843 -0.128309 -0.201119 -0.242822 -0.170004
Ada County, ID -1.010918 0.370626 -0.135979 1.726063 1.317995 0.984758 -0.204451 0.381257 -0.550832 -0.091837 ... 0.065753 0.553653 0.989189 0.958739 0.887328 0.809485 0.032794 1.722918 1.211310 0.894455
Adair County, IA 0.674005 -0.032990 -0.135979 -0.363835 1.158201 -0.230585 -0.271796 -0.363295 -0.606188 -0.524416 ... -0.142513 -0.222897 -0.285953 -0.236990 -0.219834 -0.238274 -0.147835 -0.296867 -0.253755 -0.230646

5 rows × 41 columns

In [7]:
simplefilter(action='ignore', category=UserWarning) # supress warnings
X.plot(kind='box', figsize=(20,5))
Out[7]:
<AxesSubplot:>

Note that in this analysis, we do not remove any outliers from the data analysis. This is done for several reasons. Firstly, we are examining the polarization of counties in the United States which by definition implies that the data will have a large number of values which greatly deviate from the median. Second, and most importantly, the plot above shows a large number of outliers in the data which prevents us from classifying the entirity of the United States as intended. Including any outliers will only provide stronger evidence that a serperated view of the US is valid and thus if the analysis shows such a result, this assumption can be reconsidered with computations performed again. The total number of outliers can be found below to corroborate this statement.

In [8]:
def is_outlier(x):
    Q25, Q75 = x.quantile([.25,.75])
    I = Q75 - Q25
    return (x < Q25 - 1.5*I) |  (x > Q75 + 1.5*I)

outliers = X.transform(is_outlier)

Xreg = X[~outliers] # non-outlier data
Xreg.dropna(inplace = True) # outliers flagged with NaN

print("Number of US counties: " + str(len(X)))
print("Number of outliers: " + str(len(Xreg)))
print("Percentage of outlying counties: " + f"{100*len(Xreg)/len(X):0.2f}" + "%")
Number of US counties: 3139
Number of outliers: 1084
Percentage of outlying counties: 34.53%

Therefore, the present analysis ignores outliers in the data in an attempt to prove the binary view of the US is invalid, even with such extrema in data.

County Network¶

The data presented for the adjacencies of each US county is presented below from teh US census bureau.

In [9]:
adr = pd.read_csv('https://data.nber.org/census/geo/county-adjacency/2010/county_adjacency2010.csv')
adr.head(5) # ADjacency data, Raw (before processing)
Out[9]:
countyname fipscounty neighborname fipsneighbor
0 Autauga County, AL 1001 Autauga County, AL 1001
1 Autauga County, AL 1001 Chilton County, AL 1021
2 Autauga County, AL 1001 Dallas County, AL 1047
3 Autauga County, AL 1001 Elmore County, AL 1051
4 Autauga County, AL 1001 Lowndes County, AL 1085

This data presents each inter-county connection as a seperate row which we can convert into a node-node relation. Hence, using the county name & state designation from before becomes significant. This data includes self-connectivity which we can exclude form analysis.

For this analysis, the weight between each county connection can be determined by the euclidean distance between the counties based on the presented data. This uses the standardized columns as computed before with the z-scores in order to prevent a single feature from having too large a bias on the algorithm. Therefore, a list of each county connection and the weight of the node based on the euclidean distance metric between features is found.

In [10]:
county=[] # "from"
neighbor=[] # "to"
weight=[] # euclidean weight from "county" to "neighbor"

for i in range(len(adr)): # loop over each row
    cty = adr.iloc[i, 0] # county to check
    nbr = adr.iloc[i, 2] # neighbor to check
    if cty != nbr: # ignore self-intersection
        if (cty in X.index) and (nbr in X.index): # both counties are in before data set
            county.append(cty)
            neighbor.append(nbr)
            A = X.loc[cty] # row of county data
            B = X.loc[nbr] # row of neighbor data
            weight.append(np.sqrt(np.sum([(a-b)*(a-b) for a, b in zip(np.array(A), np.array(B))])) ) # compute Euclidean metric

#A = np.array(X.loc[adr.iloc[1,0]])
#A
In [11]:
adjacency = pd.DataFrame({"from" : county, "to" : neighbor, "weight": weight})
adjacency.head(5)
Out[11]:
from to weight
0 Autauga County, AL Chilton County, AL 2.906492
1 Autauga County, AL Dallas County, AL 6.358240
2 Autauga County, AL Elmore County, AL 0.964755
3 Autauga County, AL Lowndes County, AL 6.637880
4 Autauga County, AL Montgomery County, AL 5.984465

This creates the county network shown below. The distribution of weights for this network, as well as several characteristic information, can be summarized as well.

In [12]:
adj = nx.from_pandas_edgelist(adjacency, "from", "to") # create network

sns.displot(data=adjacency, x="weight", log_scale=True);

print("Nodes (n): ", adj.number_of_nodes())
print("Edges (m): ", adj.number_of_edges())

degrees = pd.Series( dict(adj.degree) ) # compute degree of each county
print("Average degree (k): ", degrees.mean())

#adj_con = max(nx.connected_components(adj), key=len) 
# connected subgraphs of US (AL, HI, continental, puerto rico, guam)
subs = [adj.subgraph(c).copy() for c in nx.connected_components(adj)]
sub_diam = {0:nx.diameter(subs[0]), 1:nx.diameter(subs[1]), 2:nx.diameter(subs[2]), 3:nx.diameter(subs[3]), 4:nx.diameter(subs[4])}
print("Subgraph diameters (d): ")
print(sub_diam)
Nodes (n):  3134
Edges (m):  9274
Average degree (k):  5.918315252074027
Subgraph diameters (d): 
{0: 68, 1: 5, 2: 3, 3: 1, 4: 1}

Results¶

Unsupervised Clustering¶

A clustering can be computed based solely on the statistical features from before with no relation to the actual connectedness of each county. This will determine which counties are similar across the US despite geographical location. Two different clustering methods are employed: k-means and agglomerative clustering.

K-Means¶

A clustering can be created with $n$ clusters using the pre-processed data from before. Each cluster in this algorithm is created to reduce the overall inertia of the dataset, but the inertia is expected to decrease with increasing $n$ as the distance between clusters decreases. Rather than use the inertia to measure the cluster performance, the silhouette values are used. The silhouette score for each cluster, as well as the average silhouette score, is compued which gives an accurate method to determine the ideal number of clusters to use.

This process is performed below. The number of initializations is set to 10 to prevent a certain random state biasing the data and cluster performance.

In [13]:
km = KMeans(n_clusters=2, n_init=3, verbose=0, random_state=2002);
km.fit(X);
In [14]:
ks = [] # number of clusters
avg_sil = [] # average silhouette of each cluster

for k in range(2, 5):
    km = KMeans(n_clusters=k, n_init=10, verbose=0, random_state=2002);
    km.fit(X);
    y_hat = km.labels_ # cluster assignments
    ks.append(k)
    
    result = X.copy()
    result["cluster"] = y_hat # add cluster label
    result["sil"] = silhouette_samples(X, y_hat) # compute silhouette of cluster
    
    print("Clusters: " , str(k))
    print(f"inertia: {km.inertia_:.5g}")
    print("silhouette medians by cluster:")
    print( result.groupby("cluster")["sil"].median() )
    
    avg = np.average(result.groupby("cluster")["sil"].median())
    avg_sil.append(avg)
    print("average silhouette: " + str(avg))
    
    sns.catplot(data=result,
                x="cluster", y="sil",
                kind="violin", height=5
            )
    plt.show()
Clusters:  2
inertia: 1.0185e+05
silhouette medians by cluster:
cluster
0    0.782237
1    0.078960
Name: sil, dtype: float64
average silhouette: 0.43059868654818323
Clusters:  3
inertia: 88177
silhouette medians by cluster:
cluster
0    0.556099
1    0.097483
2    0.060976
Name: sil, dtype: float64
average silhouette: 0.23818595882335125
Clusters:  4
inertia: 78931
silhouette medians by cluster:
cluster
0    0.099352
1   -0.047595
2    0.073739
3    0.394433
Name: sil, dtype: float64
average silhouette: 0.12998212092376385

The above findings can be summarized in the table below.

In [15]:
sil_results = pd.DataFrame({"k" : ks, "Avg. Silhouette" : avg_sil})
sil_results = sil_results.set_index("k")

fsil_0 = sil_results.iloc[0][0] # save to final results
fsil_0b = sil_results.iloc[1][0] # save to final results

sil_results
Out[15]:
Avg. Silhouette
k
2 0.430599
3 0.238186
4 0.129982

As observed, the optimal number of clusters observed in this analysis is just two ($k=2$). The first clustering set (0) in this case has a high silhouette score which gradually decreases as more clusters are added. This suggests that there are a group of counties which are closely related in these economic and demographic statistics and thus increasing the number of clusters will only divide them further apart. Likely, this is a select few number of counties such as those with large cities which explains the low deviation in values.

Increasing from $k=2$ to $k=3$, the second cluster set (1) appears to be divided into roughly equal parts which has minimal, yet still positive, change in the silhouette score. Increase the number of clusters for this analysis causes a much lower average silhouette score and thus the small increase in the silhouette score for the 1 cluster and the newly formed 2 cluster is not worth it. Therefore, the counties in the US are best divided into two groups.

In [16]:
km = KMeans(n_clusters=2, n_init=10, verbose=0, random_state=2002); # best clustering
km.fit(X);
y_hat = km.labels_ # cluster assignments

X2 = X.copy() 
X2["cluster"] = y_hat # add cluster assignments

c_vals = pd.Series(y_hat)
print("Number of counties in each cluster: ")
c_vals.value_counts()
Number of counties in each cluster: 
Out[16]:
0    3084
1      55
dtype: int64

These results can be visually plotted using Excel and shown in the script below.

In [17]:
county_labels = pd.DataFrame(y_hat, index=counties.index)
county_labels.head(5)
Out[17]:
0
County
Abbeville County, SC 0
Acadia Parish, LA 0
Accomack County, VA 0
Ada County, ID 0
Adair County, IA 0
In [18]:
#county_labels.to_csv("county_labels.csv") # save data file
In [19]:
plt.figure(figsize=(15,8))
plt.axis('off')
 
image = mpimg.imread("k_means_geo_k=2.png")
plt.imshow(image)
plt.show()

As observed, the original assumption appears to actually be the opposite of what is observed. Rather than the few big cities dominating the cluster, it is actually the rest of the United States which has a roughly uniform statistical distribution. The few number of cities (cluster 1) have a larger range of values and contribute to the lower silhouette score in each case.

Agglomeration¶

The analysis performed before used the k-means clustering algorithm to determine which number of clusters had the highest average silhouette score. However, the distances between clusters and the method of measuring this cluster seperation was not optimized. In this section, an agglomerative clustering method is created to determine the optimal number of clusters using a variety of different linkage metrics.

In [20]:
ks = [] # number of clusters
links = [] # linkage type
avg_sil = [] # average silhouette of each cluster

for k in range(2,5):
    for l in ["single", "complete", "average", "ward"]: 
        # Create and fit cluster
        agg = AgglomerativeClustering(n_clusters=k, linkage=l);
        agg.fit(X); 
        z_hat = agg.labels_ # cluster assignments
               
        results = X.copy()
        results["cluster"] = z_hat # add cluster label
        results["sil"] = silhouette_samples(X, z_hat) # compute silhouette of cluster
               
        avg = np.average(results.groupby("cluster")["sil"].median()) # average cluster score
        
        # Add data to list
        links.append(l)
        ks.append(k)
        avg_sil.append(avg)

# Create dataframe, sort values
agg_results = pd.DataFrame({"k" : ks, "Linkage" : links, "Avg. Silhouette" : avg_sil})
agg_results.sort_values(by="Avg. Silhouette", ascending=False, inplace=True)

fsil_1 = agg_results.iloc[0][2] # save to final results
fsil_1b = agg_results.iloc[4][2] # save to final results
fsil_1c = agg_results.iloc[6][2] # save to final results

agg_results
Out[20]:
k Linkage Avg. Silhouette
3 2 ward 0.496070
0 2 single 0.475134
1 2 complete 0.475134
2 2 average 0.475134
6 3 average 0.405241
5 3 complete 0.382642
10 4 average 0.365941
4 3 single 0.295068
9 4 complete 0.285835
8 4 single 0.221380
7 3 ward 0.199837
11 4 ward 0.173677

As observed, the general case found from the k-means clustering is also shown in agglomerative clustering such that increasing $k$ generally lowers the average silhouette score. The best clustering found used $k=2$ with Ward linkage, reaching an average silhouette score of almost 0.5 which menas that this agglomerative method performed 13.2% better based on the average silhoette score of all clusters.

A silhouette plot is created for this optimal clustering.

In [21]:
agg = AgglomerativeClustering(n_clusters=2, linkage="ward");
agg.fit(X); 
z_hat = agg.labels_ # cluster assignments
               
results = X.copy()
results["cluster"] = z_hat # add cluster label
results["sil"] = silhouette_samples(X, z_hat) # compute silhouette of cluster 
print("Silhouette medians by cluster:")
print( pd.Series(results.groupby("cluster")["sil"].median()).to_dict() )

sns.catplot(data=results, x="cluster", y="sil", kind="violin", height=5)
plt.show()

c_vals = pd.Series(z_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
Silhouette medians by cluster:
{0: 0.8564568180909888, 1: 0.13568339782639516}
Number of counties in each cluster: 
{0: 3119, 1: 20}

Between the k-means and agglomerative clustering shown above, the median score of each cluster silhouette value increased which validates the statement that this is the optimal clustering. Each case of $k=2$ in the agglomerative approach increased the average silhouette value across clusters compared to k-means clustering, suggesting that the counties of the United States are best clustered by slowly merging like-regions instead of using a nearest neighbors approach.

Similar to the k-means approach, the majority of the samples fall into the cluster 0 with a high silhouette score and low deviation which suggests the majority of the US is similar in economic and demographic data which is expected. It seems the outliers of this data are placed in the "other" cluster, cluster 1, which exhibit similar properties but in general deviate from the 0 cluster and themselves greatly. Like before, the results can be plotted to show the geographic interpretation of this result.

In [22]:
county_labels = pd.DataFrame(z_hat, index=counties.index)
county_labels.head(5)
Out[22]:
0
County
Abbeville County, SC 0
Acadia Parish, LA 0
Accomack County, VA 0
Ada County, ID 0
Adair County, IA 0
In [23]:
#county_labels.to_csv("county_labels_2.csv") # save data file
In [24]:
plt.figure(figsize=(15,8))
plt.axis('off')
 
image = mpimg.imread("agg_ward_geo_k=2.png")
plt.imshow(image)
plt.show()

With this clustering, there is a greater disparity between county assignments as cluster 1 has fewer members compared to the k-means clustering. It seems as though the counties in cluster 1 are a subset of the counties in cluster 1 from the k-means clustering, suggesting that those were on the cusp of grouping.

Expectation-Maximization¶

The results before create a cluster with a small number of elements in order to reduce the overall variance in data. However, using expectation-maximization mixture models, we can prescribe a limit to the size of each cluster to prevent such groupings. This is done by iteration of the gaussian mixture model which prevents low element clusterings.

We can perform a grid search to optimize the hyperparameters of such a model and create a more even clustering distribution. For this, the number of clusters and the type of covariance parameters used are varied. The initial weights are determined using the k-Means algorithm to give an accurate baseline to compare results.

In [25]:
ks = [] # number of clusters
covs = [] # type of covariance to use
avg_sil = [] # average silhouette of each cluster

for k in range(2,5):
    for cov in ['full', 'tied', 'diag', 'spherical']: 
        gm = GaussianMixture(n_components=k, covariance_type=cov, init_params='kmeans', random_state=2002)
        gm.fit(X)
        w_hat = gm.predict(X) # probabilistic prediction
        
        results = X.copy()
        results["cluster"] = w_hat # add cluster label
        results["sil"] = silhouette_samples(X, w_hat) # compute silhouette of cluster 
        
        avg = np.average(results.groupby("cluster")["sil"].median()) # average cluster score
        
        # Add data to list
        covs.append(cov)
        ks.append(k)
        avg_sil.append(avg)

# Create dataframe, sort values
em_results = pd.DataFrame({"k" : ks, "Covariance" : covs, "Avg. Silhouette" : avg_sil})
em_results.sort_values(by="Avg. Silhouette", ascending=False, inplace=True)

fsil_2 = em_results.iloc[0][2] # save to final results
fsil_2b = em_results.iloc[1][2] # save to final results

em_results 
Out[25]:
k Covariance Avg. Silhouette
1 2 tied 0.407277
5 3 tied 0.254437
9 4 tied 0.165664
3 2 spherical 0.159767
4 3 full 0.152434
0 2 full 0.145999
2 2 diag 0.133524
8 4 full 0.072426
7 3 spherical 0.018805
11 4 spherical 0.012627
10 4 diag 0.002215
6 3 diag -0.030400

This clustering method is worse overall at keeping a high average silhouette score, but for $k=2$ clusters with the tied covariance parameter, the results can be comparable to the k-means algorithm. A silhouette plot for the best performing expectation-maximization clustering is shown below.

In [26]:
gm = GaussianMixture(n_components=2, covariance_type='tied', init_params='kmeans', random_state=2002)
gm.fit(X); 
w_hat = gm.predict(X) # probabilistic prediction

results = X.copy()
results["cluster"] = w_hat # add cluster label
results["sil"] = silhouette_samples(X, w_hat) # compute silhouette of cluster 
    
print("Silhouette medians by cluster:")
print( pd.Series(results.groupby("cluster")["sil"].median()).to_dict() )

sns.catplot(data=results, x="cluster", y="sil", kind="violin", height=5)
plt.show()

c_vals = pd.Series(w_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
Silhouette medians by cluster:
{0: 0.7580883927930174, 1: 0.05646592473471834}
Number of counties in each cluster: 
{0: 3073, 1: 66}

As mentioned before, this clustering method prevents smaller cluster sizes which has the consequence of a lower average silhouette score. This trade-off allows for more tangible data but suffers from worse performance, an outcome which provides evidence towards the non-binary view of the United States in the sense that the algorithm does a poor job dividing up counties and instead sees them as one combined cluster.

The EM method prevents smaller cluster sizes, so we can plot a geographical view of this clustering like before but now with more samples. Shown below, the best performing $k=2$ and $k=3$ clusters are mapped using the EM method. This algorithm also uses probabilistic determination which can thus be used to determine the liklihood of each county being labeled as a given cluster. Doing so, a wider range of values can be used to assess how binary the county clustering is for both maps.

In [27]:
# k=2, cov = tied
gm = GaussianMixture(n_components=2, covariance_type='tied', init_params='kmeans', random_state=2002)
gm.fit(X); 
w_hat = gm.predict_proba(X) # probabilistic prediction

gmp = pd.DataFrame(w_hat, index=X.index)[0]
gmp = gmp.sort_values(ascending=False) # probability of being in cluster 1
In [28]:
county_labels = pd.DataFrame(w_hat, index=counties.index)
#county_labels.to_csv("county_labels_3.csv") # save data file

The plot shown below colors the probability of being in cluster 0 for each county based on the probabilistic interpretation found from the EM method.

In [29]:
plt.figure(figsize=(15,8))
plt.axis('off')
 
image = mpimg.imread("em_geo_k=2.png")
plt.imshow(image)
plt.show()

w_hat = gm.predict(X)
c_vals = pd.Series(w_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
print("Average cluster silhouette: " + str(em_results.iloc[0]["Avg. Silhouette"]) )
Number of counties in each cluster: 
{0: 3073, 1: 66}
Average cluster silhouette: 0.4072771587638679

There is a larger sample size of counties in cluster 1 which extends the trend found in the agglomerative clustering of getting counties on the edge of being labelled as cluster 0. Similar regions are identified to be belonging to one cluster or the other as previous data shows.

For the case of $k=3$, coloring by probability can not be performed as clearly so instead each cluster is given a seperate color.

In [30]:
# k=3, cov = tied
gm = GaussianMixture(n_components=3, covariance_type='tied', init_params='kmeans', random_state=2002)
gm.fit(X); 
w_hat = gm.predict_proba(X) # probabilistic prediction

gmp = pd.DataFrame(w_hat, index=X.index) # probability of being in a given cluster
gmp.head(5)
Out[30]:
0 1 2
County
Abbeville County, SC 1.0 5.604427e-225 1.607282e-21
Acadia Parish, LA 1.0 1.501533e-229 9.592040e-21
Accomack County, VA 1.0 6.369716e-229 1.294897e-22
Ada County, ID 1.0 4.020907e-200 2.160049e-16
Adair County, IA 1.0 3.814114e-226 1.327451e-20
In [31]:
w_hat = gm.predict(X) # rounded prediction, ignore probability
gmp = pd.DataFrame(w_hat, index=X.index)
county_labels = pd.DataFrame(w_hat, index=counties.index)
#county_labels.to_csv("county_labels_5.csv") # save data file
In [32]:
plt.figure(figsize=(15,8))
plt.axis('off')
 
image = mpimg.imread("em_geo_k=3.png")
plt.imshow(image)
plt.show()

c_vals = pd.Series(w_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
print("Average cluster silhouette: " + str(em_results.iloc[1]["Avg. Silhouette"]) )
Number of counties in each cluster: 
{0: 3019, 2: 107, 1: 13}
Average cluster silhouette: 0.2544366630404996

The mapped results of the $k=2$ and $k=3$ EM methods are similar. While there is an increased number of samples, the silhouette score drops using this method and does not appear to confirm with the clustering labels found using either k-means or agglomerative clustering. These results are similar in nature and further divide the cluster 1 set from before. Therefore, we can conclude agglomerative clustering is still the best method to use for this data as these newer methods do little to improve the clustering.

Network Analysis¶

Rather than cluster the network blindly based strictly on the known economic and demographic data, we can instead choose to group clusters only if they are neighboring clusters. This creates distinct regions in the US joined by county adjacencies with the feature data providing weights between county nodes. An example of an ego graph created for Harford County, Maryland is shown below with the distance of seperation of each county given by the Euclidean distance between the feature vectors.

In [33]:
ego_harco = nx.ego_graph(adj, "Harford County, MD", distance='weight', radius=1)
nx.draw(ego_harco, with_labels=True, node_size=500, node_color="orange")

Determining where to cut the graph into seperate subgraphs is an NP-hard problem and thus requires more advanced algorithms. For this, spectral clustering will be used. Hyperparameter optimization is employed but significantly simplified to save computational time--this creates an accurate, yet computationally friendly estimate for the optimal subgraph splitting.

In [34]:
simplefilter(action='ignore', category=UserWarning) # supress warnings

ns = [] # clusters
ks = [] # nearest neighbors
ls = [] # label assignment method
sils = [] # average silhouette values

for n in range(2,5): # clusters
    for k in range(2, 10): # n-neighbors
        for l in ["kmeans", "discretize"]:
            # Define and fit clustering
            spectral_clustering = SpectralClustering(
                n_clusters=n, 
                random_state = 2002,
                n_init = 6,
                affinity="nearest_neighbors",
                n_neighbors = k,
                assign_labels=l,
                n_jobs=-1
            ) # hyperparameters used
            spc = spectral_clustering.fit(X) # fit clustering
            s_hat = spc.labels_ # get labels
            
            results = X.copy()
            results["cluster"] = s_hat # add cluster label
            results["sil"] = silhouette_samples(X, s_hat) # compute silhouette of cluster 
            
            avg = np.average(results.groupby("cluster")["sil"].median()) # average cluster score
            
            # Add to data
            ns.append(n)
            ks.append(k)
            ls.append(l)
            sils.append(avg)
In [35]:
# Create dataframe
spc_vals = pd.DataFrame({'Clusters' : ns, 'Neighbors' : ks, 'Labels' : ls, 'Avg Silhouette' : sils})
spc_vals = spc_vals.sort_values(by='Avg Silhouette', ascending=False)

fsil_3 = spc_vals.iloc[0][3] # save to final results
fsil_3b = spc_vals.iloc[5][3] # save to final results

spc_vals.head(10)
Out[35]:
Clusters Neighbors Labels Avg Silhouette
4 2 4 kmeans 0.529240
5 2 4 discretize 0.529240
10 2 7 kmeans 0.458866
6 2 5 kmeans 0.458866
8 2 6 kmeans 0.458866
20 3 4 kmeans 0.397198
3 2 3 discretize 0.379289
19 3 3 discretize 0.340084
29 3 8 discretize 0.312214
36 4 4 kmeans 0.309854

The silhouette scores for this clustering can then be found below.

In [36]:
spectral_clustering = SpectralClustering(
    n_clusters=2, 
    random_state = 2002,
    n_init = 10,
    affinity="nearest_neighbors",
    n_neighbors = 4,
    assign_labels="kmeans",
    n_jobs=-1
) # define method used
spc = spectral_clustering.fit(X) # fit clustering
s_hat = spc.labels_ # get labels

results = X.copy()
results["cluster"] = s_hat # add cluster label
results["sil"] = silhouette_samples(X, s_hat) # compute silhouette of cluster 
    
print("Silhouette medians by cluster:")
sils = pd.Series(results.groupby("cluster")["sil"].median())
print( sils.to_dict() )
print("Average silhouette:")
print(np.average(sils))

sns.catplot(data=results, x="cluster", y="sil", kind="violin", height=5)
plt.show()

c_vals = pd.Series(s_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
Silhouette medians by cluster:
{0: 0.7973155485697738, 1: 0.26116522225869837}
Average silhouette:
0.5292403854142361
Number of counties in each cluster: 
{0: 3134, 1: 5}

As observed, this clustering method performs better than previous methods by keeping adjacent counties connected. A geographic plot can be created for these connected counties.

In [37]:
county_labels = pd.DataFrame(s_hat, index=counties.index)
#county_labels.to_csv("county_labels_6.csv") # save data file
In [38]:
plt.figure(figsize=(15,8))
plt.axis('off')
 
image = mpimg.imread("spc_geo_k=2.png")
plt.imshow(image)
plt.show()

c_vals = pd.Series(s_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
print("Average cluster silhouette: " + str(spc_vals.iloc[0]["Avg Silhouette"]) )
Number of counties in each cluster: 
{0: 3134, 1: 5}
Average cluster silhouette: 0.5292403854142361

These results say two noteworthy conclusions. First, the continential United States is best defined as a single cluster as tested through a variety of different clustering methods. Secondly, spectral clustering suggests Alaska to be closer related to the United States than Hawaii, which intuitively makes sense as this follows demographic and economic data. To recreate the results from before, $k=4$ is needed to create one cluster for Alaska, one for Hawaii, and then the remaining two to divide the continental United States. The hyperparameters used to create this graph are found from the grid optimization results before, using the best performing hyperparameters for $k=4$.

In [39]:
spectral_clustering = SpectralClustering(
    n_clusters=4, 
    random_state = 2002,
    n_init = 10,
    affinity="nearest_neighbors",
    n_neighbors = 50,
    assign_labels="kmeans",
    n_jobs=-1
) # define method used
spc = spectral_clustering.fit(X) # fit clustering
s_hat = spc.labels_ # get labels

results = X.copy()
results["cluster"] = s_hat # add cluster label
results["sil"] = silhouette_samples(X, s_hat) # compute silhouette of cluster 
    
print("Silhouette medians by cluster:")
sils = pd.Series(results.groupby("cluster")["sil"].median())
print( sils.to_dict() )
print("Average silhouette:")
print(np.average(sils))

sns.catplot(data=results, x="cluster", y="sil", kind="violin", height=5)
plt.show()

c_vals = pd.Series(s_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
Silhouette medians by cluster:
{0: 0.26635253096651945, 1: -0.03388292282747894, 2: 0.29310678353792513, 3: 0.2238888604862083}
Average silhouette:
0.18736631304079346
Number of counties in each cluster: 
{3: 2351, 2: 386, 1: 218, 0: 184}
In [40]:
county_labels = pd.DataFrame(s_hat, index=X.index)
#county_labels.to_csv("county_labels_7.csv") # save data file
In [41]:
plt.figure(figsize=(15,8))
plt.axis('off')
 
image = mpimg.imread("spc_geo_k=4.png")
plt.imshow(image)
plt.show()

c_vals = pd.Series(s_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
print("Average cluster silhouette: " + str(spc_vals.iloc[9]["Avg Silhouette"]) )

fsil_3c = spc_vals.iloc[9]["Avg Silhouette"]
Number of counties in each cluster: 
{3: 2351, 2: 386, 1: 218, 0: 184}
Average cluster silhouette: 0.30985386834088685

Regionization¶

The principles discussed before for clustering regions of the United States can be extended to create several connected regions. Similar optimal hyperparameters are used for before but with a much larger nearest neighbors, defined to be 1/4 of the total number of counties to attempt to create four connected regions of like labels.

In [42]:
spectral_clustering = SpectralClustering(
    n_clusters=4, 
    random_state = 2002,
    n_init = 10,
    affinity="nearest_neighbors",
    n_neighbors = int(len(X)/4),
    assign_labels="kmeans",
    n_jobs=-1
) # define method used
spc = spectral_clustering.fit(X) # fit clustering
s_hat = spc.labels_ # get labels

results = X.copy()
results["cluster"] = s_hat # add cluster label
results["sil"] = silhouette_samples(X, s_hat) # compute silhouette of cluster 
    
print("Silhouette medians by cluster:")
sils = pd.Series(results.groupby("cluster")["sil"].median())
print( sils.to_dict() )
print("Average silhouette:")
print(np.average(sils))
fsil_4 = np.average(sils) # save to final results

sns.catplot(data=results, x="cluster", y="sil", kind="violin", height=5)
plt.show()

c_vals = pd.Series(s_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
Silhouette medians by cluster:
{0: -0.16217613847803253, 1: 0.14049076973456662, 2: 0.3304741413054011, 3: -0.08098758967566834}
Average silhouette:
0.05695029572156671
Number of counties in each cluster: 
{2: 1223, 3: 659, 0: 646, 1: 611}

These regions can be mapped as shown before.

In [43]:
county_labels = pd.DataFrame(s_hat, index=X.index)
#county_labels.to_csv("county_labels_8.csv") # save data file
In [44]:
plt.figure(figsize=(15,8))
plt.axis('off')
 
image = mpimg.imread("spc_reg_geo_k=4.png")
plt.imshow(image)
plt.show()

c_vals = pd.Series(s_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
print("Average cluster silhouette: " + str(np.average(sils)) )
Number of counties in each cluster: 
{2: 1223, 3: 659, 0: 646, 1: 611}
Average cluster silhouette: 0.05695029572156671

This does a very poor clustering job as expected and as such we confirm the optimal clustering was found from before.

Conclusions and Discussion¶

The average silouette score of all clusters found in each clustering analysis is shown below. The results are ordered in terms of the the highest silhouette score, and thus best performing clustering, with some of the relevant hyperparameters or distinguishing features of the analysis noted.

In [45]:
results_data = {"Method" : ["k-Means", "k-Means", "Agglomerative", "Agglomerative", "Agglomerative", "Expectation-Maximization", "Expectation-Maximization", "Spectral", "Spectral", "Spectral", "Spectral (Regional)"],
                "Clusters" : [2,3,2,3,4,2,3,2,3,4,4],
                "Linkage" : ["-", "-", "Ward", "Average", "Average", "-", "-", "-", "-", "-", "-"],
                "Covariance" : ["-", "-", "-", "-", "-", "Tied", "Tied", "-", "-", "-", "-"],
                "Labelling" : ["-", "-", "-", "-", "-", "-", "-", "k-Means", "k-Means", "k-Means", "k-Means"],
                "Neighbors" : ["-", "-", "-", "-", "-", "-", "-", 4, 4, 4, 784],
                "Avg. Silhouette" : [fsil_0, fsil_0b, fsil_1, fsil_1b, fsil_1c, fsil_2, fsil_2b, fsil_3, fsil_3b, fsil_3c, fsil_4]
}
results = pd.DataFrame(data=results_data)
results = results.sort_values(by="Avg. Silhouette", ascending=False) # sort values
results = results.reset_index() # reset indices
results.drop(columns=["index"], inplace=True)
results
Out[45]:
Method Clusters Linkage Covariance Labelling Neighbors Avg. Silhouette
0 Spectral 2 - - k-Means 4 0.529240
1 Agglomerative 2 Ward - - - 0.496070
2 k-Means 2 - - - - 0.430599
3 Expectation-Maximization 2 - Tied - - 0.407277
4 Agglomerative 3 Average - - - 0.405241
5 Spectral 3 - - k-Means 4 0.397198
6 Agglomerative 4 Average - - - 0.365941
7 Spectral 4 - - k-Means 4 0.309854
8 Expectation-Maximization 3 - Tied - - 0.254437
9 k-Means 3 - - - - 0.238186
10 Spectral (Regional) 4 - - k-Means 784 0.056950

The best performing clustering was found using spectral clustering with a minimum number of clusters of $k=2$. The results from this are shown below from the previous section.

In [46]:
plt.figure(figsize=(15,8))
plt.axis('off')
 
image = mpimg.imread("spc_geo_k=2.png")
plt.imshow(image)
plt.show()

c_vals = pd.Series(s_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
print("Average cluster silhouette: " + str(spc_vals.iloc[0]["Avg Silhouette"]) )
Number of counties in each cluster: 
{2: 1223, 3: 659, 0: 646, 1: 611}
Average cluster silhouette: 0.5292403854142361

As observed in the above plot, the best performing clustering occurs with considering the continental United States as a single cluster. While these results suggest Hawaii to be the most isolated from the rest of the US, this result likely occurs from the algorithm having to choose a subgroup of counties to form the second cluster and thus, with Hawaii only having 5 counties which are all likely similar to one another in terms of demographics and economic statistics, they form a relatively higher silhouette score cluster. Note that Delaware has the least number of counties of any state, but likely as Delaware has similar features to neighboring state counties, the silhouette score in forming Delaware as a seperate cluster would be larger than Hawaii.

This gives strong evidence that the United States is not divided between rural and urban regions and in general, a binary view of the US is invalid. The best performing clusters are those which consider the US to have the fewest number of clusters and/or the fewest number of counties chosen for the secondary cluster. Any division of the US which resulted in two distinct groups, such as in the k-Means, $k=2$, clustering, did so by including densely populated regions such as the suspected big cities, such as New York City, Balitmore, Miami, etc. However, as shown in the results above, ignoring the big cities and instead clustering the continental US as one large cluster resulted in an average silhoette score to increase by 6.69%.

These results also show that a network analysis, such as with spectral clustering, of the United States is better performing than a true blind clustering, such as with k-Means, agglomerative, or expectation-maximization methods. Regions in the US, joined by county connections as approximated by this data, share similar economic and demographic data but may not perform best on a local scale of clustering. With taking these geographic connections into account, the clustering of the US as a whole is much greater suggesting that while it is not accurate to divide the US into seperate categories such as rural, suburban, or urban, grouping by closely connected counties may be a better approximation.

In summary, while political data may suggest the US to be divided into rural and urban counties [1], this analysis proves that economically and by overall demographics, the US is best viewed as a whole, undivided country. More research is needed to determine to best divisions of the US besides the continental/island split which is shown. Future work can be conducted to redraw state lines based on this data to group strong clusters and thus create subnetworks which have a maximum average silhouette score. This is a challenging problem breifly touched upon in this report but could provide evidence for how strongly related certain regions of the country are, how many of such regions exist, etc.

References¶

[1] Love, Hanna, and Tracy Hadden Loh. “The ‘rural-Urban Divide’ Furthers Myths about Race and Poverty-Concealing Effective Policy Solutions.” Brookings, December 8, 2020. https://www.brookings.edu/blog/the-avenue/2020/12/08/the-rural-urban-divide-furthers-myths-about-race-and-poverty-concealing-effective-policy-solutions/.

[2] “Largest Countries in the World (by Area).” Worldometer. Accessed May 18, 2023. https://www.worldometers.info/geography/largest-countries-in-the-world/.

[3] Whitcomb, Ryan, Joung Min Choi, and Bo Guan. “County Demographics CSV File.” CORGIS Datasets Project, from Austin Cory Bart, Dennis Kafura, Clifford A. Shaffer, Javier Tibau, Luke Gusukuma, Eli Tilevich, November 5, 2022. https://corgis-edu.github.io/corgis/csv/county_demographics/.

[4] National Bureau of Economic Research. “County Adjacency.” NBER, from U.S. Census Bureau, May 8, 2017. https://www.nber.org/research/data/county-adjacency.