It is often stated that urban and rural regions of a country become polarized over time with increasing disperities between not only political affliations [1], but also with economic and demographic statistics. For the United States, a very populous nation with the fourth largest land area of any country [2], there may be an exagerated effect given the larger distances between densely populated regions seperated by large suburban and rural areas. However, while the political divide between rural and urban regions of the US has been extensively studied, namely in presidential and gubernatorial election results, there is less research on the validity of dividing the US into two succinct groups [1]. Evidence is therefore needed in determining the effectiveness of various county clusterings of the US based on these aforementioned factors.
This study will perform a clustering analysis of counties in the United States to determine whether the binary view of the US is valid and if it conforms to the rural-urban polarization expected. Data is collected from the US census bureau from 2010 to 2019 and includes economic and demographic statistics for each county only; any political references are ignored for analysis. Clustering will be performed in two batches: unsupervised clustering, using k-means, agglomerative, and expectation-maximization clustering which are blind to inter-county adjacencies; and network clustering, using spectral clustering to have preference to nearby counties. Performance in each computation is measured using an average silhouette score across all clusters. Results indicate that the best clustering performance comes from considering the continental United States as a single cluster and thus provide evidence that a non-binary view of the United States is valid. Apart from political views, there is little evidence that the United States is signifantly different when comparing rural and urban regions, or any subgroup dividing the country for that matter.
Data on county statistics is collected from the United States Census Bureau from 2010 to 2019 [3]. A list of each US county and all counties which share a land border with said county is given by the county adjacency data from the United States Census Bureau as well [4]. The raw data files used for analysis are shown below.
import numpy as np
import pandas as pd
import seaborn as sns
import networkx as nx
from matplotlib import pyplot as plt
from matplotlib import image as mpimg
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_samples
from sklearn.mixture import GaussianMixture
from sklearn.cluster import SpectralClustering
from warnings import simplefilter
The economic and demographic data for each US county is shown below.
counties = pd.read_csv('https://corgis-edu.github.io/corgis/datasets/csv/county_demographics/county_demographics.csv', index_col=0)
counties.head(5)
| State | Age.Percent 65 and Older | Age.Percent Under 18 Years | Age.Percent Under 5 Years | Education.Bachelor's Degree or Higher | Education.High School or Higher | Employment.Nonemployer Establishments | Ethnicities.American Indian and Alaska Native Alone | Ethnicities.Asian Alone | Ethnicities.Black Alone | ... | Population.Population per Square Mile | Sales.Accommodation and Food Services Sales | Sales.Retail Sales | Employment.Firms.Total | Employment.Firms.Women-Owned | Employment.Firms.Men-Owned | Employment.Firms.Minority-Owned | Employment.Firms.Nonminority-Owned | Employment.Firms.Veteran-Owned | Employment.Firms.Nonveteran-Owned | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| County | |||||||||||||||||||||
| Abbeville County | SC | 22.4 | 19.8 | 4.7 | 15.6 | 81.7 | 1416 | 0.3 | 0.4 | 27.6 | ... | 51.8 | 12507 | 91371 | 1450 | 543 | 689 | 317 | 1080 | 187 | 1211 |
| Acadia Parish | LA | 15.8 | 25.8 | 6.9 | 13.3 | 79.0 | 4533 | 0.4 | 0.3 | 18.0 | ... | 94.3 | 52706 | 602739 | 4664 | 1516 | 2629 | 705 | 3734 | 388 | 4007 |
| Accomack County | VA | 24.6 | 20.7 | 5.6 | 19.5 | 81.5 | 2387 | 0.7 | 0.8 | 28.8 | ... | 73.8 | 53568 | 348195 | 2997 | 802 | 1716 | 335 | 2560 | 212 | 2536 |
| Ada County | ID | 14.9 | 23.2 | 5.6 | 38.5 | 95.2 | 41464 | 0.8 | 2.7 | 1.4 | ... | 372.8 | 763099 | 5766679 | 41789 | 14661 | 19409 | 3099 | 36701 | 3803 | 35132 |
| Adair County | IA | 23.0 | 21.8 | 5.6 | 18.5 | 94.2 | 609 | 0.3 | 0.5 | 0.6 | ... | 13.5 | -1 | 63002 | 914 | 304 | 499 | 0 | 861 | 185 | 679 |
5 rows × 42 columns
The data is indexed by county name with a myriad of statistical information collected. Refer to the source of the data for an explanation of each column header meaning [3]. A complete list of such features is shown below.
cols = pd.Series(counties.columns)
cols
0 State 1 Age.Percent 65 and Older 2 Age.Percent Under 18 Years 3 Age.Percent Under 5 Years 4 Education.Bachelor's Degree or Higher 5 Education.High School or Higher 6 Employment.Nonemployer Establishments 7 Ethnicities.American Indian and Alaska Native ... 8 Ethnicities.Asian Alone 9 Ethnicities.Black Alone 10 Ethnicities.Hispanic or Latino 11 Ethnicities.Native Hawaiian and Other Pacific ... 12 Ethnicities.Two or More Races 13 Ethnicities.White Alone 14 Ethnicities.White Alone\t not Hispanic or Latino 15 Housing.Homeownership Rate 16 Housing.Households 17 Housing.Housing Units 18 Housing.Median Value of Owner-Occupied Units 19 Housing.Persons per Household 20 Income.Median Houseold Income 21 Income.Per Capita Income 22 Miscellaneous.Foreign Born 23 Miscellaneous.Land Area 24 Miscellaneous.Language Other than English at Home 25 Miscellaneous.Living in Same House +1 Years 26 Miscellaneous.Manufacturers Shipments 27 Miscellaneous.Mean Travel Time to Work 28 Miscellaneous.Percent Female 29 Miscellaneous.Veterans 30 Population.2020 Population 31 Population.2010 Population 32 Population.Population per Square Mile 33 Sales.Accommodation and Food Services Sales 34 Sales.Retail Sales 35 Employment.Firms.Total 36 Employment.Firms.Women-Owned 37 Employment.Firms.Men-Owned 38 Employment.Firms.Minority-Owned 39 Employment.Firms.Nonminority-Owned 40 Employment.Firms.Veteran-Owned 41 Employment.Firms.Nonveteran-Owned dtype: object
As there may be multiple counties with similar names, e.g. "Adams County" having 12 such instances, the data will be indexed instead by both county name and state designation, such as "Adams County, ND." An example of this can be shown below.
counties.reset_index(inplace=True)
counties['County'] = counties['County'].astype(str) +", "+ counties["State"].astype(str) # concatenate columns
counties = counties.set_index("County") # set as index again
counties.drop(columns=["State"], inplace=True)
harco = counties.loc[["Harford County, MD"]]
harco
| Age.Percent 65 and Older | Age.Percent Under 18 Years | Age.Percent Under 5 Years | Education.Bachelor's Degree or Higher | Education.High School or Higher | Employment.Nonemployer Establishments | Ethnicities.American Indian and Alaska Native Alone | Ethnicities.Asian Alone | Ethnicities.Black Alone | Ethnicities.Hispanic or Latino | ... | Population.Population per Square Mile | Sales.Accommodation and Food Services Sales | Sales.Retail Sales | Employment.Firms.Total | Employment.Firms.Women-Owned | Employment.Firms.Men-Owned | Employment.Firms.Minority-Owned | Employment.Firms.Nonminority-Owned | Employment.Firms.Veteran-Owned | Employment.Firms.Nonveteran-Owned | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| County | |||||||||||||||||||||
| Harford County, MD | 16.6 | 22.2 | 5.6 | 36.7 | 92.7 | 17530 | 0.3 | 3.1 | 14.8 | 4.8 | ... | 560.1 | 390891 | 3792912 | 20330 | 7328 | 10727 | 3134 | 16495 | 2473 | 16889 |
1 rows × 41 columns
We can now standardize each column of data. To do this, the z-score is computed for each column which is equivalent to creating a pipeline with a standard scaler operator.
X = (counties - counties.mean())/counties.std() # convert to z-scores
X.head(5)
| Age.Percent 65 and Older | Age.Percent Under 18 Years | Age.Percent Under 5 Years | Education.Bachelor's Degree or Higher | Education.High School or Higher | Employment.Nonemployer Establishments | Ethnicities.American Indian and Alaska Native Alone | Ethnicities.Asian Alone | Ethnicities.Black Alone | Ethnicities.Hispanic or Latino | ... | Population.Population per Square Mile | Sales.Accommodation and Food Services Sales | Sales.Retail Sales | Employment.Firms.Total | Employment.Firms.Women-Owned | Employment.Firms.Men-Owned | Employment.Firms.Minority-Owned | Employment.Firms.Nonminority-Owned | Employment.Firms.Veteran-Owned | Employment.Firms.Nonveteran-Owned | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| County | |||||||||||||||||||||
| Abbeville County, SC | 0.549196 | -0.609584 | -0.949522 | -0.666870 | -0.839228 | -0.206578 | -0.271796 | -0.397138 | 1.262073 | -0.589303 | ... | -0.120312 | -0.210169 | -0.279610 | -0.221310 | -0.201403 | -0.227746 | -0.129359 | -0.284525 | -0.252945 | -0.213273 |
| Acadia Parish, LA | -0.823704 | 1.120198 | 1.039138 | -0.907208 | -1.270673 | -0.113855 | -0.258327 | -0.430981 | 0.597802 | -0.502787 | ... | -0.095677 | -0.169261 | -0.165286 | -0.127290 | -0.126369 | -0.120256 | -0.106743 | -0.134957 | -0.171553 | -0.121967 |
| Accomack County, VA | 1.006829 | -0.350117 | -0.135979 | -0.259340 | -0.871187 | -0.177693 | -0.217920 | -0.261765 | 1.345107 | -0.048580 | ... | -0.107560 | -0.168384 | -0.222193 | -0.176055 | -0.181430 | -0.170843 | -0.128309 | -0.201119 | -0.242822 | -0.170004 |
| Ada County, ID | -1.010918 | 0.370626 | -0.135979 | 1.726063 | 1.317995 | 0.984758 | -0.204451 | 0.381257 | -0.550832 | -0.091837 | ... | 0.065753 | 0.553653 | 0.989189 | 0.958739 | 0.887328 | 0.809485 | 0.032794 | 1.722918 | 1.211310 | 0.894455 |
| Adair County, IA | 0.674005 | -0.032990 | -0.135979 | -0.363835 | 1.158201 | -0.230585 | -0.271796 | -0.363295 | -0.606188 | -0.524416 | ... | -0.142513 | -0.222897 | -0.285953 | -0.236990 | -0.219834 | -0.238274 | -0.147835 | -0.296867 | -0.253755 | -0.230646 |
5 rows × 41 columns
simplefilter(action='ignore', category=UserWarning) # supress warnings
X.plot(kind='box', figsize=(20,5))
<AxesSubplot:>
Note that in this analysis, we do not remove any outliers from the data analysis. This is done for several reasons. Firstly, we are examining the polarization of counties in the United States which by definition implies that the data will have a large number of values which greatly deviate from the median. Second, and most importantly, the plot above shows a large number of outliers in the data which prevents us from classifying the entirity of the United States as intended. Including any outliers will only provide stronger evidence that a serperated view of the US is valid and thus if the analysis shows such a result, this assumption can be reconsidered with computations performed again. The total number of outliers can be found below to corroborate this statement.
def is_outlier(x):
Q25, Q75 = x.quantile([.25,.75])
I = Q75 - Q25
return (x < Q25 - 1.5*I) | (x > Q75 + 1.5*I)
outliers = X.transform(is_outlier)
Xreg = X[~outliers] # non-outlier data
Xreg.dropna(inplace = True) # outliers flagged with NaN
print("Number of US counties: " + str(len(X)))
print("Number of outliers: " + str(len(Xreg)))
print("Percentage of outlying counties: " + f"{100*len(Xreg)/len(X):0.2f}" + "%")
Number of US counties: 3139 Number of outliers: 1084 Percentage of outlying counties: 34.53%
Therefore, the present analysis ignores outliers in the data in an attempt to prove the binary view of the US is invalid, even with such extrema in data.
The data presented for the adjacencies of each US county is presented below from teh US census bureau.
adr = pd.read_csv('https://data.nber.org/census/geo/county-adjacency/2010/county_adjacency2010.csv')
adr.head(5) # ADjacency data, Raw (before processing)
| countyname | fipscounty | neighborname | fipsneighbor | |
|---|---|---|---|---|
| 0 | Autauga County, AL | 1001 | Autauga County, AL | 1001 |
| 1 | Autauga County, AL | 1001 | Chilton County, AL | 1021 |
| 2 | Autauga County, AL | 1001 | Dallas County, AL | 1047 |
| 3 | Autauga County, AL | 1001 | Elmore County, AL | 1051 |
| 4 | Autauga County, AL | 1001 | Lowndes County, AL | 1085 |
This data presents each inter-county connection as a seperate row which we can convert into a node-node relation. Hence, using the county name & state designation from before becomes significant. This data includes self-connectivity which we can exclude form analysis.
For this analysis, the weight between each county connection can be determined by the euclidean distance between the counties based on the presented data. This uses the standardized columns as computed before with the z-scores in order to prevent a single feature from having too large a bias on the algorithm. Therefore, a list of each county connection and the weight of the node based on the euclidean distance metric between features is found.
county=[] # "from"
neighbor=[] # "to"
weight=[] # euclidean weight from "county" to "neighbor"
for i in range(len(adr)): # loop over each row
cty = adr.iloc[i, 0] # county to check
nbr = adr.iloc[i, 2] # neighbor to check
if cty != nbr: # ignore self-intersection
if (cty in X.index) and (nbr in X.index): # both counties are in before data set
county.append(cty)
neighbor.append(nbr)
A = X.loc[cty] # row of county data
B = X.loc[nbr] # row of neighbor data
weight.append(np.sqrt(np.sum([(a-b)*(a-b) for a, b in zip(np.array(A), np.array(B))])) ) # compute Euclidean metric
#A = np.array(X.loc[adr.iloc[1,0]])
#A
adjacency = pd.DataFrame({"from" : county, "to" : neighbor, "weight": weight})
adjacency.head(5)
| from | to | weight | |
|---|---|---|---|
| 0 | Autauga County, AL | Chilton County, AL | 2.906492 |
| 1 | Autauga County, AL | Dallas County, AL | 6.358240 |
| 2 | Autauga County, AL | Elmore County, AL | 0.964755 |
| 3 | Autauga County, AL | Lowndes County, AL | 6.637880 |
| 4 | Autauga County, AL | Montgomery County, AL | 5.984465 |
This creates the county network shown below. The distribution of weights for this network, as well as several characteristic information, can be summarized as well.
adj = nx.from_pandas_edgelist(adjacency, "from", "to") # create network
sns.displot(data=adjacency, x="weight", log_scale=True);
print("Nodes (n): ", adj.number_of_nodes())
print("Edges (m): ", adj.number_of_edges())
degrees = pd.Series( dict(adj.degree) ) # compute degree of each county
print("Average degree (k): ", degrees.mean())
#adj_con = max(nx.connected_components(adj), key=len)
# connected subgraphs of US (AL, HI, continental, puerto rico, guam)
subs = [adj.subgraph(c).copy() for c in nx.connected_components(adj)]
sub_diam = {0:nx.diameter(subs[0]), 1:nx.diameter(subs[1]), 2:nx.diameter(subs[2]), 3:nx.diameter(subs[3]), 4:nx.diameter(subs[4])}
print("Subgraph diameters (d): ")
print(sub_diam)
Nodes (n): 3134
Edges (m): 9274
Average degree (k): 5.918315252074027
Subgraph diameters (d):
{0: 68, 1: 5, 2: 3, 3: 1, 4: 1}
A clustering can be computed based solely on the statistical features from before with no relation to the actual connectedness of each county. This will determine which counties are similar across the US despite geographical location. Two different clustering methods are employed: k-means and agglomerative clustering.
A clustering can be created with $n$ clusters using the pre-processed data from before. Each cluster in this algorithm is created to reduce the overall inertia of the dataset, but the inertia is expected to decrease with increasing $n$ as the distance between clusters decreases. Rather than use the inertia to measure the cluster performance, the silhouette values are used. The silhouette score for each cluster, as well as the average silhouette score, is compued which gives an accurate method to determine the ideal number of clusters to use.
This process is performed below. The number of initializations is set to 10 to prevent a certain random state biasing the data and cluster performance.
km = KMeans(n_clusters=2, n_init=3, verbose=0, random_state=2002);
km.fit(X);
ks = [] # number of clusters
avg_sil = [] # average silhouette of each cluster
for k in range(2, 5):
km = KMeans(n_clusters=k, n_init=10, verbose=0, random_state=2002);
km.fit(X);
y_hat = km.labels_ # cluster assignments
ks.append(k)
result = X.copy()
result["cluster"] = y_hat # add cluster label
result["sil"] = silhouette_samples(X, y_hat) # compute silhouette of cluster
print("Clusters: " , str(k))
print(f"inertia: {km.inertia_:.5g}")
print("silhouette medians by cluster:")
print( result.groupby("cluster")["sil"].median() )
avg = np.average(result.groupby("cluster")["sil"].median())
avg_sil.append(avg)
print("average silhouette: " + str(avg))
sns.catplot(data=result,
x="cluster", y="sil",
kind="violin", height=5
)
plt.show()
Clusters: 2 inertia: 1.0185e+05 silhouette medians by cluster: cluster 0 0.782237 1 0.078960 Name: sil, dtype: float64 average silhouette: 0.43059868654818323
Clusters: 3 inertia: 88177 silhouette medians by cluster: cluster 0 0.556099 1 0.097483 2 0.060976 Name: sil, dtype: float64 average silhouette: 0.23818595882335125
Clusters: 4 inertia: 78931 silhouette medians by cluster: cluster 0 0.099352 1 -0.047595 2 0.073739 3 0.394433 Name: sil, dtype: float64 average silhouette: 0.12998212092376385
The above findings can be summarized in the table below.
sil_results = pd.DataFrame({"k" : ks, "Avg. Silhouette" : avg_sil})
sil_results = sil_results.set_index("k")
fsil_0 = sil_results.iloc[0][0] # save to final results
fsil_0b = sil_results.iloc[1][0] # save to final results
sil_results
| Avg. Silhouette | |
|---|---|
| k | |
| 2 | 0.430599 |
| 3 | 0.238186 |
| 4 | 0.129982 |
As observed, the optimal number of clusters observed in this analysis is just two ($k=2$). The first clustering set (0) in this case has a high silhouette score which gradually decreases as more clusters are added. This suggests that there are a group of counties which are closely related in these economic and demographic statistics and thus increasing the number of clusters will only divide them further apart. Likely, this is a select few number of counties such as those with large cities which explains the low deviation in values.
Increasing from $k=2$ to $k=3$, the second cluster set (1) appears to be divided into roughly equal parts which has minimal, yet still positive, change in the silhouette score. Increase the number of clusters for this analysis causes a much lower average silhouette score and thus the small increase in the silhouette score for the 1 cluster and the newly formed 2 cluster is not worth it. Therefore, the counties in the US are best divided into two groups.
km = KMeans(n_clusters=2, n_init=10, verbose=0, random_state=2002); # best clustering
km.fit(X);
y_hat = km.labels_ # cluster assignments
X2 = X.copy()
X2["cluster"] = y_hat # add cluster assignments
c_vals = pd.Series(y_hat)
print("Number of counties in each cluster: ")
c_vals.value_counts()
Number of counties in each cluster:
0 3084 1 55 dtype: int64
These results can be visually plotted using Excel and shown in the script below.
county_labels = pd.DataFrame(y_hat, index=counties.index)
county_labels.head(5)
| 0 | |
|---|---|
| County | |
| Abbeville County, SC | 0 |
| Acadia Parish, LA | 0 |
| Accomack County, VA | 0 |
| Ada County, ID | 0 |
| Adair County, IA | 0 |
#county_labels.to_csv("county_labels.csv") # save data file
plt.figure(figsize=(15,8))
plt.axis('off')
image = mpimg.imread("k_means_geo_k=2.png")
plt.imshow(image)
plt.show()
As observed, the original assumption appears to actually be the opposite of what is observed. Rather than the few big cities dominating the cluster, it is actually the rest of the United States which has a roughly uniform statistical distribution. The few number of cities (cluster 1) have a larger range of values and contribute to the lower silhouette score in each case.
The analysis performed before used the k-means clustering algorithm to determine which number of clusters had the highest average silhouette score. However, the distances between clusters and the method of measuring this cluster seperation was not optimized. In this section, an agglomerative clustering method is created to determine the optimal number of clusters using a variety of different linkage metrics.
ks = [] # number of clusters
links = [] # linkage type
avg_sil = [] # average silhouette of each cluster
for k in range(2,5):
for l in ["single", "complete", "average", "ward"]:
# Create and fit cluster
agg = AgglomerativeClustering(n_clusters=k, linkage=l);
agg.fit(X);
z_hat = agg.labels_ # cluster assignments
results = X.copy()
results["cluster"] = z_hat # add cluster label
results["sil"] = silhouette_samples(X, z_hat) # compute silhouette of cluster
avg = np.average(results.groupby("cluster")["sil"].median()) # average cluster score
# Add data to list
links.append(l)
ks.append(k)
avg_sil.append(avg)
# Create dataframe, sort values
agg_results = pd.DataFrame({"k" : ks, "Linkage" : links, "Avg. Silhouette" : avg_sil})
agg_results.sort_values(by="Avg. Silhouette", ascending=False, inplace=True)
fsil_1 = agg_results.iloc[0][2] # save to final results
fsil_1b = agg_results.iloc[4][2] # save to final results
fsil_1c = agg_results.iloc[6][2] # save to final results
agg_results
| k | Linkage | Avg. Silhouette | |
|---|---|---|---|
| 3 | 2 | ward | 0.496070 |
| 0 | 2 | single | 0.475134 |
| 1 | 2 | complete | 0.475134 |
| 2 | 2 | average | 0.475134 |
| 6 | 3 | average | 0.405241 |
| 5 | 3 | complete | 0.382642 |
| 10 | 4 | average | 0.365941 |
| 4 | 3 | single | 0.295068 |
| 9 | 4 | complete | 0.285835 |
| 8 | 4 | single | 0.221380 |
| 7 | 3 | ward | 0.199837 |
| 11 | 4 | ward | 0.173677 |
As observed, the general case found from the k-means clustering is also shown in agglomerative clustering such that increasing $k$ generally lowers the average silhouette score. The best clustering found used $k=2$ with Ward linkage, reaching an average silhouette score of almost 0.5 which menas that this agglomerative method performed 13.2% better based on the average silhoette score of all clusters.
A silhouette plot is created for this optimal clustering.
agg = AgglomerativeClustering(n_clusters=2, linkage="ward");
agg.fit(X);
z_hat = agg.labels_ # cluster assignments
results = X.copy()
results["cluster"] = z_hat # add cluster label
results["sil"] = silhouette_samples(X, z_hat) # compute silhouette of cluster
print("Silhouette medians by cluster:")
print( pd.Series(results.groupby("cluster")["sil"].median()).to_dict() )
sns.catplot(data=results, x="cluster", y="sil", kind="violin", height=5)
plt.show()
c_vals = pd.Series(z_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
Silhouette medians by cluster:
{0: 0.8564568180909888, 1: 0.13568339782639516}
Number of counties in each cluster:
{0: 3119, 1: 20}
Between the k-means and agglomerative clustering shown above, the median score of each cluster silhouette value increased which validates the statement that this is the optimal clustering. Each case of $k=2$ in the agglomerative approach increased the average silhouette value across clusters compared to k-means clustering, suggesting that the counties of the United States are best clustered by slowly merging like-regions instead of using a nearest neighbors approach.
Similar to the k-means approach, the majority of the samples fall into the cluster 0 with a high silhouette score and low deviation which suggests the majority of the US is similar in economic and demographic data which is expected. It seems the outliers of this data are placed in the "other" cluster, cluster 1, which exhibit similar properties but in general deviate from the 0 cluster and themselves greatly. Like before, the results can be plotted to show the geographic interpretation of this result.
county_labels = pd.DataFrame(z_hat, index=counties.index)
county_labels.head(5)
| 0 | |
|---|---|
| County | |
| Abbeville County, SC | 0 |
| Acadia Parish, LA | 0 |
| Accomack County, VA | 0 |
| Ada County, ID | 0 |
| Adair County, IA | 0 |
#county_labels.to_csv("county_labels_2.csv") # save data file
plt.figure(figsize=(15,8))
plt.axis('off')
image = mpimg.imread("agg_ward_geo_k=2.png")
plt.imshow(image)
plt.show()
With this clustering, there is a greater disparity between county assignments as cluster 1 has fewer members compared to the k-means clustering. It seems as though the counties in cluster 1 are a subset of the counties in cluster 1 from the k-means clustering, suggesting that those were on the cusp of grouping.
The results before create a cluster with a small number of elements in order to reduce the overall variance in data. However, using expectation-maximization mixture models, we can prescribe a limit to the size of each cluster to prevent such groupings. This is done by iteration of the gaussian mixture model which prevents low element clusterings.
We can perform a grid search to optimize the hyperparameters of such a model and create a more even clustering distribution. For this, the number of clusters and the type of covariance parameters used are varied. The initial weights are determined using the k-Means algorithm to give an accurate baseline to compare results.
ks = [] # number of clusters
covs = [] # type of covariance to use
avg_sil = [] # average silhouette of each cluster
for k in range(2,5):
for cov in ['full', 'tied', 'diag', 'spherical']:
gm = GaussianMixture(n_components=k, covariance_type=cov, init_params='kmeans', random_state=2002)
gm.fit(X)
w_hat = gm.predict(X) # probabilistic prediction
results = X.copy()
results["cluster"] = w_hat # add cluster label
results["sil"] = silhouette_samples(X, w_hat) # compute silhouette of cluster
avg = np.average(results.groupby("cluster")["sil"].median()) # average cluster score
# Add data to list
covs.append(cov)
ks.append(k)
avg_sil.append(avg)
# Create dataframe, sort values
em_results = pd.DataFrame({"k" : ks, "Covariance" : covs, "Avg. Silhouette" : avg_sil})
em_results.sort_values(by="Avg. Silhouette", ascending=False, inplace=True)
fsil_2 = em_results.iloc[0][2] # save to final results
fsil_2b = em_results.iloc[1][2] # save to final results
em_results
| k | Covariance | Avg. Silhouette | |
|---|---|---|---|
| 1 | 2 | tied | 0.407277 |
| 5 | 3 | tied | 0.254437 |
| 9 | 4 | tied | 0.165664 |
| 3 | 2 | spherical | 0.159767 |
| 4 | 3 | full | 0.152434 |
| 0 | 2 | full | 0.145999 |
| 2 | 2 | diag | 0.133524 |
| 8 | 4 | full | 0.072426 |
| 7 | 3 | spherical | 0.018805 |
| 11 | 4 | spherical | 0.012627 |
| 10 | 4 | diag | 0.002215 |
| 6 | 3 | diag | -0.030400 |
This clustering method is worse overall at keeping a high average silhouette score, but for $k=2$ clusters with the tied covariance parameter, the results can be comparable to the k-means algorithm. A silhouette plot for the best performing expectation-maximization clustering is shown below.
gm = GaussianMixture(n_components=2, covariance_type='tied', init_params='kmeans', random_state=2002)
gm.fit(X);
w_hat = gm.predict(X) # probabilistic prediction
results = X.copy()
results["cluster"] = w_hat # add cluster label
results["sil"] = silhouette_samples(X, w_hat) # compute silhouette of cluster
print("Silhouette medians by cluster:")
print( pd.Series(results.groupby("cluster")["sil"].median()).to_dict() )
sns.catplot(data=results, x="cluster", y="sil", kind="violin", height=5)
plt.show()
c_vals = pd.Series(w_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
Silhouette medians by cluster:
{0: 0.7580883927930174, 1: 0.05646592473471834}
Number of counties in each cluster:
{0: 3073, 1: 66}
As mentioned before, this clustering method prevents smaller cluster sizes which has the consequence of a lower average silhouette score. This trade-off allows for more tangible data but suffers from worse performance, an outcome which provides evidence towards the non-binary view of the United States in the sense that the algorithm does a poor job dividing up counties and instead sees them as one combined cluster.
The EM method prevents smaller cluster sizes, so we can plot a geographical view of this clustering like before but now with more samples. Shown below, the best performing $k=2$ and $k=3$ clusters are mapped using the EM method. This algorithm also uses probabilistic determination which can thus be used to determine the liklihood of each county being labeled as a given cluster. Doing so, a wider range of values can be used to assess how binary the county clustering is for both maps.
# k=2, cov = tied
gm = GaussianMixture(n_components=2, covariance_type='tied', init_params='kmeans', random_state=2002)
gm.fit(X);
w_hat = gm.predict_proba(X) # probabilistic prediction
gmp = pd.DataFrame(w_hat, index=X.index)[0]
gmp = gmp.sort_values(ascending=False) # probability of being in cluster 1
county_labels = pd.DataFrame(w_hat, index=counties.index)
#county_labels.to_csv("county_labels_3.csv") # save data file
The plot shown below colors the probability of being in cluster 0 for each county based on the probabilistic interpretation found from the EM method.
plt.figure(figsize=(15,8))
plt.axis('off')
image = mpimg.imread("em_geo_k=2.png")
plt.imshow(image)
plt.show()
w_hat = gm.predict(X)
c_vals = pd.Series(w_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
print("Average cluster silhouette: " + str(em_results.iloc[0]["Avg. Silhouette"]) )
Number of counties in each cluster:
{0: 3073, 1: 66}
Average cluster silhouette: 0.4072771587638679
There is a larger sample size of counties in cluster 1 which extends the trend found in the agglomerative clustering of getting counties on the edge of being labelled as cluster 0. Similar regions are identified to be belonging to one cluster or the other as previous data shows.
For the case of $k=3$, coloring by probability can not be performed as clearly so instead each cluster is given a seperate color.
# k=3, cov = tied
gm = GaussianMixture(n_components=3, covariance_type='tied', init_params='kmeans', random_state=2002)
gm.fit(X);
w_hat = gm.predict_proba(X) # probabilistic prediction
gmp = pd.DataFrame(w_hat, index=X.index) # probability of being in a given cluster
gmp.head(5)
| 0 | 1 | 2 | |
|---|---|---|---|
| County | |||
| Abbeville County, SC | 1.0 | 5.604427e-225 | 1.607282e-21 |
| Acadia Parish, LA | 1.0 | 1.501533e-229 | 9.592040e-21 |
| Accomack County, VA | 1.0 | 6.369716e-229 | 1.294897e-22 |
| Ada County, ID | 1.0 | 4.020907e-200 | 2.160049e-16 |
| Adair County, IA | 1.0 | 3.814114e-226 | 1.327451e-20 |
w_hat = gm.predict(X) # rounded prediction, ignore probability
gmp = pd.DataFrame(w_hat, index=X.index)
county_labels = pd.DataFrame(w_hat, index=counties.index)
#county_labels.to_csv("county_labels_5.csv") # save data file
plt.figure(figsize=(15,8))
plt.axis('off')
image = mpimg.imread("em_geo_k=3.png")
plt.imshow(image)
plt.show()
c_vals = pd.Series(w_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
print("Average cluster silhouette: " + str(em_results.iloc[1]["Avg. Silhouette"]) )
Number of counties in each cluster:
{0: 3019, 2: 107, 1: 13}
Average cluster silhouette: 0.2544366630404996
The mapped results of the $k=2$ and $k=3$ EM methods are similar. While there is an increased number of samples, the silhouette score drops using this method and does not appear to confirm with the clustering labels found using either k-means or agglomerative clustering. These results are similar in nature and further divide the cluster 1 set from before. Therefore, we can conclude agglomerative clustering is still the best method to use for this data as these newer methods do little to improve the clustering.
Rather than cluster the network blindly based strictly on the known economic and demographic data, we can instead choose to group clusters only if they are neighboring clusters. This creates distinct regions in the US joined by county adjacencies with the feature data providing weights between county nodes. An example of an ego graph created for Harford County, Maryland is shown below with the distance of seperation of each county given by the Euclidean distance between the feature vectors.
ego_harco = nx.ego_graph(adj, "Harford County, MD", distance='weight', radius=1)
nx.draw(ego_harco, with_labels=True, node_size=500, node_color="orange")
Determining where to cut the graph into seperate subgraphs is an NP-hard problem and thus requires more advanced algorithms. For this, spectral clustering will be used. Hyperparameter optimization is employed but significantly simplified to save computational time--this creates an accurate, yet computationally friendly estimate for the optimal subgraph splitting.
simplefilter(action='ignore', category=UserWarning) # supress warnings
ns = [] # clusters
ks = [] # nearest neighbors
ls = [] # label assignment method
sils = [] # average silhouette values
for n in range(2,5): # clusters
for k in range(2, 10): # n-neighbors
for l in ["kmeans", "discretize"]:
# Define and fit clustering
spectral_clustering = SpectralClustering(
n_clusters=n,
random_state = 2002,
n_init = 6,
affinity="nearest_neighbors",
n_neighbors = k,
assign_labels=l,
n_jobs=-1
) # hyperparameters used
spc = spectral_clustering.fit(X) # fit clustering
s_hat = spc.labels_ # get labels
results = X.copy()
results["cluster"] = s_hat # add cluster label
results["sil"] = silhouette_samples(X, s_hat) # compute silhouette of cluster
avg = np.average(results.groupby("cluster")["sil"].median()) # average cluster score
# Add to data
ns.append(n)
ks.append(k)
ls.append(l)
sils.append(avg)
# Create dataframe
spc_vals = pd.DataFrame({'Clusters' : ns, 'Neighbors' : ks, 'Labels' : ls, 'Avg Silhouette' : sils})
spc_vals = spc_vals.sort_values(by='Avg Silhouette', ascending=False)
fsil_3 = spc_vals.iloc[0][3] # save to final results
fsil_3b = spc_vals.iloc[5][3] # save to final results
spc_vals.head(10)
| Clusters | Neighbors | Labels | Avg Silhouette | |
|---|---|---|---|---|
| 4 | 2 | 4 | kmeans | 0.529240 |
| 5 | 2 | 4 | discretize | 0.529240 |
| 10 | 2 | 7 | kmeans | 0.458866 |
| 6 | 2 | 5 | kmeans | 0.458866 |
| 8 | 2 | 6 | kmeans | 0.458866 |
| 20 | 3 | 4 | kmeans | 0.397198 |
| 3 | 2 | 3 | discretize | 0.379289 |
| 19 | 3 | 3 | discretize | 0.340084 |
| 29 | 3 | 8 | discretize | 0.312214 |
| 36 | 4 | 4 | kmeans | 0.309854 |
The silhouette scores for this clustering can then be found below.
spectral_clustering = SpectralClustering(
n_clusters=2,
random_state = 2002,
n_init = 10,
affinity="nearest_neighbors",
n_neighbors = 4,
assign_labels="kmeans",
n_jobs=-1
) # define method used
spc = spectral_clustering.fit(X) # fit clustering
s_hat = spc.labels_ # get labels
results = X.copy()
results["cluster"] = s_hat # add cluster label
results["sil"] = silhouette_samples(X, s_hat) # compute silhouette of cluster
print("Silhouette medians by cluster:")
sils = pd.Series(results.groupby("cluster")["sil"].median())
print( sils.to_dict() )
print("Average silhouette:")
print(np.average(sils))
sns.catplot(data=results, x="cluster", y="sil", kind="violin", height=5)
plt.show()
c_vals = pd.Series(s_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
Silhouette medians by cluster:
{0: 0.7973155485697738, 1: 0.26116522225869837}
Average silhouette:
0.5292403854142361
Number of counties in each cluster:
{0: 3134, 1: 5}
As observed, this clustering method performs better than previous methods by keeping adjacent counties connected. A geographic plot can be created for these connected counties.
county_labels = pd.DataFrame(s_hat, index=counties.index)
#county_labels.to_csv("county_labels_6.csv") # save data file
plt.figure(figsize=(15,8))
plt.axis('off')
image = mpimg.imread("spc_geo_k=2.png")
plt.imshow(image)
plt.show()
c_vals = pd.Series(s_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
print("Average cluster silhouette: " + str(spc_vals.iloc[0]["Avg Silhouette"]) )
Number of counties in each cluster:
{0: 3134, 1: 5}
Average cluster silhouette: 0.5292403854142361
These results say two noteworthy conclusions. First, the continential United States is best defined as a single cluster as tested through a variety of different clustering methods. Secondly, spectral clustering suggests Alaska to be closer related to the United States than Hawaii, which intuitively makes sense as this follows demographic and economic data. To recreate the results from before, $k=4$ is needed to create one cluster for Alaska, one for Hawaii, and then the remaining two to divide the continental United States. The hyperparameters used to create this graph are found from the grid optimization results before, using the best performing hyperparameters for $k=4$.
spectral_clustering = SpectralClustering(
n_clusters=4,
random_state = 2002,
n_init = 10,
affinity="nearest_neighbors",
n_neighbors = 50,
assign_labels="kmeans",
n_jobs=-1
) # define method used
spc = spectral_clustering.fit(X) # fit clustering
s_hat = spc.labels_ # get labels
results = X.copy()
results["cluster"] = s_hat # add cluster label
results["sil"] = silhouette_samples(X, s_hat) # compute silhouette of cluster
print("Silhouette medians by cluster:")
sils = pd.Series(results.groupby("cluster")["sil"].median())
print( sils.to_dict() )
print("Average silhouette:")
print(np.average(sils))
sns.catplot(data=results, x="cluster", y="sil", kind="violin", height=5)
plt.show()
c_vals = pd.Series(s_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
Silhouette medians by cluster:
{0: 0.26635253096651945, 1: -0.03388292282747894, 2: 0.29310678353792513, 3: 0.2238888604862083}
Average silhouette:
0.18736631304079346
Number of counties in each cluster:
{3: 2351, 2: 386, 1: 218, 0: 184}
county_labels = pd.DataFrame(s_hat, index=X.index)
#county_labels.to_csv("county_labels_7.csv") # save data file
plt.figure(figsize=(15,8))
plt.axis('off')
image = mpimg.imread("spc_geo_k=4.png")
plt.imshow(image)
plt.show()
c_vals = pd.Series(s_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
print("Average cluster silhouette: " + str(spc_vals.iloc[9]["Avg Silhouette"]) )
fsil_3c = spc_vals.iloc[9]["Avg Silhouette"]
Number of counties in each cluster:
{3: 2351, 2: 386, 1: 218, 0: 184}
Average cluster silhouette: 0.30985386834088685
The principles discussed before for clustering regions of the United States can be extended to create several connected regions. Similar optimal hyperparameters are used for before but with a much larger nearest neighbors, defined to be 1/4 of the total number of counties to attempt to create four connected regions of like labels.
spectral_clustering = SpectralClustering(
n_clusters=4,
random_state = 2002,
n_init = 10,
affinity="nearest_neighbors",
n_neighbors = int(len(X)/4),
assign_labels="kmeans",
n_jobs=-1
) # define method used
spc = spectral_clustering.fit(X) # fit clustering
s_hat = spc.labels_ # get labels
results = X.copy()
results["cluster"] = s_hat # add cluster label
results["sil"] = silhouette_samples(X, s_hat) # compute silhouette of cluster
print("Silhouette medians by cluster:")
sils = pd.Series(results.groupby("cluster")["sil"].median())
print( sils.to_dict() )
print("Average silhouette:")
print(np.average(sils))
fsil_4 = np.average(sils) # save to final results
sns.catplot(data=results, x="cluster", y="sil", kind="violin", height=5)
plt.show()
c_vals = pd.Series(s_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
Silhouette medians by cluster:
{0: -0.16217613847803253, 1: 0.14049076973456662, 2: 0.3304741413054011, 3: -0.08098758967566834}
Average silhouette:
0.05695029572156671
Number of counties in each cluster:
{2: 1223, 3: 659, 0: 646, 1: 611}
These regions can be mapped as shown before.
county_labels = pd.DataFrame(s_hat, index=X.index)
#county_labels.to_csv("county_labels_8.csv") # save data file
plt.figure(figsize=(15,8))
plt.axis('off')
image = mpimg.imread("spc_reg_geo_k=4.png")
plt.imshow(image)
plt.show()
c_vals = pd.Series(s_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
print("Average cluster silhouette: " + str(np.average(sils)) )
Number of counties in each cluster:
{2: 1223, 3: 659, 0: 646, 1: 611}
Average cluster silhouette: 0.05695029572156671
This does a very poor clustering job as expected and as such we confirm the optimal clustering was found from before.
The average silouette score of all clusters found in each clustering analysis is shown below. The results are ordered in terms of the the highest silhouette score, and thus best performing clustering, with some of the relevant hyperparameters or distinguishing features of the analysis noted.
results_data = {"Method" : ["k-Means", "k-Means", "Agglomerative", "Agglomerative", "Agglomerative", "Expectation-Maximization", "Expectation-Maximization", "Spectral", "Spectral", "Spectral", "Spectral (Regional)"],
"Clusters" : [2,3,2,3,4,2,3,2,3,4,4],
"Linkage" : ["-", "-", "Ward", "Average", "Average", "-", "-", "-", "-", "-", "-"],
"Covariance" : ["-", "-", "-", "-", "-", "Tied", "Tied", "-", "-", "-", "-"],
"Labelling" : ["-", "-", "-", "-", "-", "-", "-", "k-Means", "k-Means", "k-Means", "k-Means"],
"Neighbors" : ["-", "-", "-", "-", "-", "-", "-", 4, 4, 4, 784],
"Avg. Silhouette" : [fsil_0, fsil_0b, fsil_1, fsil_1b, fsil_1c, fsil_2, fsil_2b, fsil_3, fsil_3b, fsil_3c, fsil_4]
}
results = pd.DataFrame(data=results_data)
results = results.sort_values(by="Avg. Silhouette", ascending=False) # sort values
results = results.reset_index() # reset indices
results.drop(columns=["index"], inplace=True)
results
| Method | Clusters | Linkage | Covariance | Labelling | Neighbors | Avg. Silhouette | |
|---|---|---|---|---|---|---|---|
| 0 | Spectral | 2 | - | - | k-Means | 4 | 0.529240 |
| 1 | Agglomerative | 2 | Ward | - | - | - | 0.496070 |
| 2 | k-Means | 2 | - | - | - | - | 0.430599 |
| 3 | Expectation-Maximization | 2 | - | Tied | - | - | 0.407277 |
| 4 | Agglomerative | 3 | Average | - | - | - | 0.405241 |
| 5 | Spectral | 3 | - | - | k-Means | 4 | 0.397198 |
| 6 | Agglomerative | 4 | Average | - | - | - | 0.365941 |
| 7 | Spectral | 4 | - | - | k-Means | 4 | 0.309854 |
| 8 | Expectation-Maximization | 3 | - | Tied | - | - | 0.254437 |
| 9 | k-Means | 3 | - | - | - | - | 0.238186 |
| 10 | Spectral (Regional) | 4 | - | - | k-Means | 784 | 0.056950 |
The best performing clustering was found using spectral clustering with a minimum number of clusters of $k=2$. The results from this are shown below from the previous section.
plt.figure(figsize=(15,8))
plt.axis('off')
image = mpimg.imread("spc_geo_k=2.png")
plt.imshow(image)
plt.show()
c_vals = pd.Series(s_hat)
print("Number of counties in each cluster: ")
print(c_vals.value_counts().to_dict())
print("Average cluster silhouette: " + str(spc_vals.iloc[0]["Avg Silhouette"]) )
Number of counties in each cluster:
{2: 1223, 3: 659, 0: 646, 1: 611}
Average cluster silhouette: 0.5292403854142361
As observed in the above plot, the best performing clustering occurs with considering the continental United States as a single cluster. While these results suggest Hawaii to be the most isolated from the rest of the US, this result likely occurs from the algorithm having to choose a subgroup of counties to form the second cluster and thus, with Hawaii only having 5 counties which are all likely similar to one another in terms of demographics and economic statistics, they form a relatively higher silhouette score cluster. Note that Delaware has the least number of counties of any state, but likely as Delaware has similar features to neighboring state counties, the silhouette score in forming Delaware as a seperate cluster would be larger than Hawaii.
This gives strong evidence that the United States is not divided between rural and urban regions and in general, a binary view of the US is invalid. The best performing clusters are those which consider the US to have the fewest number of clusters and/or the fewest number of counties chosen for the secondary cluster. Any division of the US which resulted in two distinct groups, such as in the k-Means, $k=2$, clustering, did so by including densely populated regions such as the suspected big cities, such as New York City, Balitmore, Miami, etc. However, as shown in the results above, ignoring the big cities and instead clustering the continental US as one large cluster resulted in an average silhoette score to increase by 6.69%.
These results also show that a network analysis, such as with spectral clustering, of the United States is better performing than a true blind clustering, such as with k-Means, agglomerative, or expectation-maximization methods. Regions in the US, joined by county connections as approximated by this data, share similar economic and demographic data but may not perform best on a local scale of clustering. With taking these geographic connections into account, the clustering of the US as a whole is much greater suggesting that while it is not accurate to divide the US into seperate categories such as rural, suburban, or urban, grouping by closely connected counties may be a better approximation.
In summary, while political data may suggest the US to be divided into rural and urban counties [1], this analysis proves that economically and by overall demographics, the US is best viewed as a whole, undivided country. More research is needed to determine to best divisions of the US besides the continental/island split which is shown. Future work can be conducted to redraw state lines based on this data to group strong clusters and thus create subnetworks which have a maximum average silhouette score. This is a challenging problem breifly touched upon in this report but could provide evidence for how strongly related certain regions of the country are, how many of such regions exist, etc.
[1] Love, Hanna, and Tracy Hadden Loh. “The ‘rural-Urban Divide’ Furthers Myths about Race and Poverty-Concealing Effective Policy Solutions.” Brookings, December 8, 2020. https://www.brookings.edu/blog/the-avenue/2020/12/08/the-rural-urban-divide-furthers-myths-about-race-and-poverty-concealing-effective-policy-solutions/.
[2] “Largest Countries in the World (by Area).” Worldometer. Accessed May 18, 2023. https://www.worldometers.info/geography/largest-countries-in-the-world/.
[3] Whitcomb, Ryan, Joung Min Choi, and Bo Guan. “County Demographics CSV File.” CORGIS Datasets Project, from Austin Cory Bart, Dennis Kafura, Clifford A. Shaffer, Javier Tibau, Luke Gusukuma, Eli Tilevich, November 5, 2022. https://corgis-edu.github.io/corgis/csv/county_demographics/.
[4] National Bureau of Economic Research. “County Adjacency.” NBER, from U.S. Census Bureau, May 8, 2017. https://www.nber.org/research/data/county-adjacency.