Lab 5-1 - Statistical Analysis of Urban Data

Author
Affiliations

Hamish Gibbs

Network Science Institute, Northeastern University

NETS 7983 Computational Urban Science

Published

March 31, 2025

In today’s lab, we will build on our progress from Lab 2 (analysis of census data). In that lab, we used dimensionality reduction with Principal Component Analysis to identify the key axes differentiating areas in Boston based on a subset of census variables.

Today, we will re-analyze some of the Boston census data, aiming to quantify the degree of spatial autocorrelation (clustering) in the underlying census features.

We will then conduct a geodemographic analysis, with the aim of identifying spatially contiguous geographic areas with similar demographic features. Geodemographics has a long history in creating meaningful spatial groupings of individuals with similar demographic characteristics and has been used for informing public policy, identifying potential consumer segments, and characterizing urban dynamics.

For our geodemographic clustering, we will compare the results of non-spatial clustering (K-means) with a spatially-aware method (Spatial ’K’luster Analysis by Tree Edge Removal - SKATER [1]).

Data preparation

library(leaflet)
library(tidycensus)
library(tidyverse)
library(tigris)
library(sf)
library(arrow)
library(spdep)
library(ggplot2)
library(factoextra)
library(knitr)
census_api_key(Sys.getenv("CENSUS_API_KEY"), install = TRUE, overwrite=TRUE)
[1] "a4eeffc6464d8c120634bf9e1cfd3b7de47373ff"

The census geospatial boundary API is down (as of Jan 4th 2025)! Luckily, we can use pre-processed census data for Boston, and spatial boundaries we have already downloaded and saved.

census <- read_parquet("/data/CUS/labs/5/14460_acs_2021.parquet")
cbg <- read_parquet("/data/CUS/labs/5/14460_2021_cbg_geo.parquet")

To do this, we just need to add spatial boundaries to our familiar census data. In 14460_2021_cbg_geo.parquet, geometries are stored as in “Well-Known-Text” format. We can parse this with the sf package.

sfc_cbg <- st_as_sfc(cbg$geometry_wkt, crs = 4326) 
cbg$geometry <- sfc_cbg
cbg <- st_as_sf(cbg)

Next, we need to join our spatial boundaries onto our census data, using the GEOID column. While we are at it, we can remove rows with missing median_income values present for Census Block Groups (CBGs) with low or no population. We can also remove redundant population columns.

census <- census |>
    drop_na(median_income) |>
    left_join(cbg, by="GEOID") |>
    st_as_sf() |>
    select(-race.total, -employment.total, -geometry_wkt) |>
    select(-matches("^(industry|travel|trans)"))

Let’s quickly visualize our data to make sure everything looks OK:

census |> leaflet() |> addProviderTiles(provider=providers$CartoDB.Positron) |> addPolygons()

Quantifying spatial heterogeneity

In week 2, we looked at the correlation between our identified principal components and the original variables (captured by the PCA loadings). This showed a strong effect of income in defining differences between areas. We can use R to further explore the spatial clustering of the income variable from our original dataset.

In this section, we are especially interested in the degree of spatial autocorrelation in the Income variable: whether places with high or low incomes tend to be located next to places with similar incomes.

To understand this, we will use two techniques:

  • Moran’s I: captures the degree of spatial clustering in an index between -1 (perfect dispersion) and 1 (perfect clustering.

  • Local Moran’s I (LISA): captures the local spatial clustering of an area relative to its neighbors.

Exercise

  • To start out, modify the map above to show the distribution of the median_income variable.
    • Hint: See Lab 1-1: Crash Course on GIS with R for help defining a color scale for leaflet maps.

Computing Moran’s I

Now that we have seen the distribution of income around Boston, we can proceed to calculating the spatial clustering in our data.

The first step to compute Moran’s I (our first clustering measure) is to create an adjacency matrix describing which CBG polygons are next to one another:

nb <- poly2nb(census)

This function returns the neighbors of each CBG based on strict adjacency (whether neighboring polygons intersect each other). For example, see the neighbors of the first CBG:

print(nb[[1]])
[1] 2 3

However, we are getting a warning that some polygons don’t have any neighbors. Let’s inspect only these CBGs:

no_neighbors <- which(card(nb) == 0)
census[no_neighbors, ] |> leaflet() |> addProviderTiles(provider=providers$CartoDB.Positron) |> addPolygons()

These are just two polygons disconnected from others. Because we are defining adjacency strictly by whether polygons intersect with one another, we will have to remove these.

census_filtered <- census[which(!c(1:nrow(census) %in% no_neighbors)), ]

Now we can proceed to calculate our adjacency matrix.

nb <- poly2nb(census_filtered)

Note: it is easy to plot spdep neighbors lists with the Base R plot() function.

coords <- st_centroid(st_geometry(census_filtered))
plot(st_geometry(census_filtered), border="grey", main = "Intersection-based neighbors")
plot(nb, coords, add=TRUE)

Next, we need to convert the adjacency matrix into a set of spatial weights defining the distances between areas.

lw <- nb2listw(nb, style = "W")
lw
Characteristics of weights list object:
Neighbour list object:
Number of regions: 3233 
Number of nonzero links: 18402 
Percentage nonzero weights: 0.1760571 
Average number of links: 5.691927 

Weights style: W 
Weights constants summary:
     n       nn   S0       S1       S2
W 3233 10452289 3233 1217.944 13326.31

Here, we are creating row-standardized weights (style = "W"). However, there are a number of different options. From the spdep package documentation: B is the basic binary coding, W is row standardised (sums over all links to n), C is globally standardised (sums over all links to n), U is equal to C divided by the number of neighbours (sums over all links to unity), while S is the variance-stabilizing coding scheme proposed by Tiefelsdorf et al. 1999, p. 167-168 (sums over all links to n).

Remember how we discussed the sensitivity of spatial statistics to definitions of distance and adjacency? There are a lot of choices that go into defining the spatial structure of your data!

Now that we have our spatial weights, we can compute Moran’s I for our areas:

moran_test <- moran.test(census_filtered$median_income, lw)
print(moran_test)

    Moran I test under randomisation

data:  census_filtered$median_income  
weights: lw    

Moran I statistic standard deviate = 46.058, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.4964156362     -0.0003094059      0.0001163104 

Remember, Moran’s I varies between -1 (no clustering) and 1 (perfect clustering). Here, we have a Moran’s I statistic of \(0.49\) with a statistical significance of \(< 2.2e-16\). This indicates that our median income values exhibit statistically significant, relatively strong clustering.

If we plot the values of median_income against spatially lagged values, we can see a positive correlation (indicating spatial clustering).

census_filtered$median_income_lag <- lag.listw(lw, census_filtered$median_income)

ggplot(census_filtered, aes(x = median_income, y = median_income_lag)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Moran's I Scatterplot",
       x = "Median Income",
       y = "Spatially Lagged Median Income") +
  theme_minimal()

Defining Spatial Structure

Lets drill into the choices we have made regarding how we encode the spatial structure of our data a bit more. The basic process is to define which polygons are neighbors of one another, then build a matrix of spatial weights, informed by the adjacency matrix. Is our approach of defining adjacency based on exact spatial intersection appropriate? With so many parks and natural barriers in Boston, it could be more appropriate to choose a different way to represent adjacency.

We have a few choices, including Delauny triangulation-based adjacency:

coords <- st_centroid(st_geometry(census))
tri_nb <- tri2nb(coords)
plot(st_geometry(census), border="grey", main = "Triangulation-based neighbors")
plot(tri_nb, coords, add=TRUE)

And distance-based adjacency:

k1 <- knn2nb(knearneigh(coords))
nnb <- dnearneigh(coords, 0, 0.05, longlat = TRUE) 

plot(st_geometry(census), border="grey", main = "Distance-based neighbors" )
plot(nnb, coords, add=TRUE)

Exercise

  • Try playing around with these distance values and see the effect that different definitions of adjacency make in defining the spatial structure of our data.

Once we have defined adjacency, we can also consider how we build our spatial weights. These weights define how “far” or “near” are observations are to one another. Above, we used row-standardized weights, based on the adjacency matrix binary adjacency matrix (i.e. are two polygons next to one another or not?). We can also define custom costs between areas with the nbcosts function.

Why does it matter?

All this talk of spatial autocorrelation can seem niche, but metrics like Moran’s I can be very important as an exploratory measure, either to understand the presence of spatial clustering in variables, or to diagnose issues of spatial dependence which can create limit the validity of statistical inferences.

Next week, we will look at geographically weighted regression (GWR) models which account for spatial nonstationarity (variation of statistical relationships across space) by adapting local regression coefficients based on the relationships of an area’s neighbors.

This week, we will try a simple example showing how Moran’s I can reveal spatial dependence in the results of a regression analysis.

Our methodology is as follows:

  • Create a regression model using two variables that we know are correlated. In our case, we will use the proportion of the population with a Bachelor’s degree as our independent variable, and median income as our dependent variable.
  • Use our model to predict the variable across all areas.
  • Extract and compute Moran’s I for the residuals of our model.

If we find spatial clustering in our residuals - this will indicate a violation of the assumption of independently distributed residuals and will indicate that we need to take a geographically-aware approach to regression analysis.

First, we need to transform our independent variable:

census_lm <- census_filtered |>
  mutate(prop_bachelor = education.bachelor / age.total)

Next, we can fit our regression model.

lm_model <- lm(median_income ~ prop_bachelor, data = census_lm)
summary(lm_model)

Call:
lm(formula = median_income ~ prop_bachelor, data = census_lm)

Residuals:
    Min      1Q  Median      3Q     Max 
-144541  -27652   -4372   23233  177514 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)      52583       1667   31.54   <2e-16 ***
prop_bachelor   161675       4044   39.98   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 41920 on 3231 degrees of freedom
Multiple R-squared:  0.3309,    Adjusted R-squared:  0.3307 
F-statistic:  1598 on 1 and 3231 DF,  p-value: < 2.2e-16

Adjusted R-squared of \(0.33\) is notable, there is some statistical relationship between education and income, as we hypothesized. However, what we are interested in is the model residuals:

census_lm$residuals <- residuals(lm_model)

ExeNow, following the same procedure as above, we build our adjacency matrix, define our spatial weights, and calculate Moran’s I.

nb <- poly2nb(census_lm)
listw <- nb2listw(nb, style = "W")
moran_test <- moran.test(census_lm$residuals, listw)
print(moran_test)

    Moran I test under randomisation

data:  census_lm$residuals  
weights: listw    

Moran I statistic standard deviate = 31.944, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.3441793385     -0.0003094059      0.0001163005 

This shows a statistically significant Moran’s I value of \(0.34\), indicating some spatial clustering in our residuals. This means that, in order to accurately model the relationship between income and education, we must use a regression model that explicitly accounts for the spatial dependence in our data.

Exercise

  • Repeat this analysis with other variables that you hypothesize will have even more spatial autocorrelation of model residulas. What do you find?

Computing LISA

Now for another application of measures of spatial autocorrelation. Moran’s I gives us an indication of the level of global spatial autocorrelation in our data, but what if we want to look at an indicator of the location of clusters of high and low income? For this, we can use Local Moran’s I (also known as LISA), which also takes our variable of interest and our spatial weights.

lisa_test <- localmoran(census_filtered$median_income, lw)
summary(lisa_test)
       Ii                E.Ii                Var.Ii             Z.Ii         
 Min.   :-3.27218   Min.   :-2.233e-03   Min.   :0.00000   Min.   :-5.37911  
 1st Qu.:-0.00395   1st Qu.:-3.776e-04   1st Qu.:0.01444   1st Qu.:-0.07482  
 Median : 0.16784   Median :-1.294e-04   Median :0.07648   Median : 0.83394  
 Mean   : 0.49642   Mean   :-3.094e-04   Mean   :0.20189   Mean   : 0.90499  
 3rd Qu.: 0.68291   3rd Qu.:-2.535e-05   3rd Qu.:0.23023   3rd Qu.: 1.81553  
 Max.   : 7.21719   Max.   : 0.000e+00   Max.   :3.60053   Max.   : 7.21406  
 Pr(z != E(Ii))   
 Min.   :0.00000  
 1st Qu.:0.05455  
 Median :0.25777  
 Mean   :0.34797  
 3rd Qu.:0.61474  
 Max.   :0.99991  

Because LISA produces estimated for each area, this returns a matrix of size \(N_{cbg} × 5\). This matrix returns the Local Moran’s I test statistic Ii, expected value of Local Moran’s I under the null hypothesis (no clustering) E.Ii, Variance Var.Ii, Z-score Z.Ii, and p-value Pr(z != E(Ii)).

We can transform these values into categorical variables which capture the relationship between areas and their neighbors. For more info, see this excellent (epidemiology focused) tutorial.

names(lisa_test)[5] <- 'pvalue'

census_filtered$lag <- lag.listw(lw, var = census_filtered$median_income)
census_filtered$lm_pv <- lisa_test[,5]

census_filtered_lisa <- census_filtered |>
  mutate(raw_std = as.numeric(scale(median_income)),
         lag_std = as.numeric(scale(lag)),
         lm_quad = factor(case_when(
           raw_std >= 0 & lag_std >= 0 & lm_pv < 0.05 ~ 'High-High',
            raw_std <= 0 & lag_std <= 0 & lm_pv < 0.05 ~ 'Low-Low',
            raw_std <= 0 & lag_std >= 0 & lm_pv < 0.05 ~ 'Low-High',
            raw_std >= 0 & lag_std <= 0 & lm_pv < 0.05 ~ 'High-Low',
            lm_pv >= 0.05 ~ 'Non-significant'),
           levels = c('High-High','Low-Low','Low-High','High-Low','Non-significant')))
head(census_filtered_lisa |> select(GEOID, lm_quad, geometry))
Simple feature collection with 6 features and 2 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -70.94334 ymin: 42.41313 xmax: -70.89246 ymax: 42.48502
Geodetic CRS:  WGS 84
# A tibble: 6 × 3
  GEOID        lm_quad                                                  geometry
  <chr>        <fct>                                               <POLYGON [°]>
1 250092011001 Non-significant ((-70.9301 42.4265, -70.93008 42.42684, -70.9300…
2 250092011002 Non-significant ((-70.94106 42.42335, -70.9397 42.42555, -70.939…
3 250092011003 Non-significant ((-70.94328 42.43815, -70.94324 42.4382, -70.943…
4 250092021011 Low-High        ((-70.91511 42.47505, -70.91486 42.47522, -70.91…
5 250092021012 Non-significant ((-70.91808 42.46869, -70.91795 42.46879, -70.91…
6 250092021013 Non-significant ((-70.92295 42.46878, -70.92292 42.4688, -70.922…

Now that we’ve categorized our data, we can make a map.

ggplot() +
    geom_sf(data=census_filtered_lisa, aes(fill=lm_quad)) + 
    scale_fill_manual(values=c("High-High"="red", "Low-Low"="blue", "Low-High"="green", "High-Low"="purple", "Non-significant"="#ffffb3")) + 
    theme_void()

Pretty interesting! Here we can see clear spatial clustering for areas with significantly lower income levels (“Low-Low”) and clusters of high income (“High-High”), as well as a few spatial outliers (“Low-High” and “High-Low”).

Exercise

  • Repeat our calculations of Moran’s I and LISA for two more variables that you hypothesize will have higher and lower spatial autocorrelation compared to median_income.

    • Hint: Consider normalizing some variables by total population.

    • Were your hypotheses about spatial clustering correct?

  • Repeat Moran’s I calculations but explore the effect of different choices in building the adjacency / spatial weights. Is the statistic robust to multiple different choices?

Geodemographic analysis

Now that we have had some introduction to measures to quantify the degree of spatial autocorrelation in our data (and seen some promising potential clusters with LISA, we can move on to another analysis of spatial variation in our census data: delimiting areas with statistically similar demographic characteristics.

Here, our aim is to build on previous work from Week 2, defining contiguous spatial areas which share demographic similarities. This is called geodemographic analysis, and it aims to classify spatial areas which are statistically similar to one another with respect to their demographic attributes, on the assumption that demographic similarity is indicative of meaningful behavioral similarities between residential areas.

Note: I have adapted this practical from Analyzing US Census Data §8.5: Classification and clustering of ACS data, a great resource for learning about spatial statistics.

Exercise

  • Before we proceed, consider the assumptions underlying geodemographic classification and the methods proposed in our reading about the UK’s 2011 Output Area Classification [2]. These classifications have been used to guide anti-poverty interventions and other social programs, identify consumer segments for marketing and retail site location, and explain consumption and employment patterns in cities. The assumption that geographically clustered demographic groups have similar and identifiable behaviors has a long history in geography, sociology, and urban studies. See, for example: “The Predictive Postcode” [3]. What do you think about the assumptions underlying geodemographic analysis? How could we employ modern behavioral data to enhance insights into human behavior?

Data Preparation

Here, we need to repeat some steps from Week 2. Our aim is to get back the set of principal components we ended with in our Social Area Analysis. To begin, lets create a matrix of only the numeric variables we will use in our analysis, removing geometry, spatial identifiers, as well as the variables which solely capture total CBG populations.

Also, because the spatial clustering is relatively computationally intensive, we will restrict our analysis to only the counties around Boston. To do this, we can use the census data from Week 2.

census <- read_parquet("/data/CUS/labs/2/14460_acs_2021_filtered_boston.parquet")
cbg <- read_parquet("/data/CUS/labs/5/14460_2021_cbg_geo.parquet")
sfc_cbg <- st_as_sfc(cbg$geometry_wkt, crs = 4326) 
cbg$geometry <- sfc_cbg
cbg <- st_as_sf(cbg)
census <- census |>
    drop_na(median_income) |>
    left_join(cbg, by="GEOID") |>
    st_as_sf() |>
    select(-race.total, -employment.total, -age.total, -geometry_wkt) |>
    select(-matches("^(industry|travel|trans)"))

Later on, we will use a spatial weights matrix. Lets create our adjacency matrix and remove any CBGs without neighbors.

nb <- poly2nb(census) 
no_neighbors <- which(card(nb) == 0) 
census_filtered <- census[which(!c(1:nrow(census) %in% no_neighbors)), ]
nb <- poly2nb(census_filtered) 
lw <- nb2listw(nb, style = "W")
census_no_geo <- census_filtered |>
  select(-GEOID) |>
  st_drop_geometry()
print(colnames(census_no_geo))
 [1] "age.u18"                         "age.u1825"                      
 [3] "age.u2564"                       "age.a64"                        
 [5] "race.white"                      "race.black"                     
 [7] "race.native"                     "race.asian"                     
 [9] "median_income"                   "ratio_poverty"                  
[11] "education.grade_0_9"             "education.grade_9_12"           
[13] "education.some_college"          "education.bachelor"             
[15] "employment.labor_force"          "employment.civilian_labor_force"
[17] "employment.civilian_employed"    "employment.civilian_unemployed" 
[19] "employment.armed_forces"         "employment.not_labor_force"     

Now we can re-run our Principal Component Analysis using this matrix.

pca <- prcomp(
  formula = ~., 
  data = census_no_geo, 
  scale. = TRUE, 
  center = TRUE
)

summary(pca)
Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     2.5268 1.8899 1.40473 1.05567 1.01904 0.99542 0.95532
Proportion of Variance 0.3192 0.1786 0.09866 0.05572 0.05192 0.04954 0.04563
Cumulative Proportion  0.3192 0.4978 0.59647 0.65219 0.70412 0.75366 0.79929
                           PC8     PC9   PC10    PC11    PC12    PC13    PC14
Standard deviation     0.91368 0.83373 0.7590 0.73525 0.68861 0.58073 0.53806
Proportion of Variance 0.04174 0.03476 0.0288 0.02703 0.02371 0.01686 0.01448
Cumulative Proportion  0.84103 0.87579 0.9046 0.93162 0.95533 0.97219 0.98666
                          PC15    PC16    PC17    PC18      PC19      PC20
Standard deviation     0.45256 0.23178 0.07876 0.04438 1.629e-15 2.613e-16
Proportion of Variance 0.01024 0.00269 0.00031 0.00010 0.000e+00 0.000e+00
Cumulative Proportion  0.99691 0.99959 0.99990 1.00000 1.000e+00 1.000e+00

Lets use the simple Scree Plot to look at the variance explained by our identified components.

fviz_eig(pca, addlabels = TRUE, ylim = c(0, 50), ncp=Inf)

Now we need to attach our identified components back onto our original data:

components <- predict(pca, census_no_geo)

census_pca <- census_filtered |>
  cbind(components) 

We can also extract the PCA “rotations” or the linear transformations applied to each original variable for our first \(n\) principal components. This indicates the contribution of the original variables to each component.

pca_tibble <- pca$rotation |>
  as_tibble(rownames = "predictor")

pca_tibble |>
  select(predictor:PC5) |>
  pivot_longer(PC1:PC5, names_to = "component", values_to = "value") |>
  ggplot(aes(x = value, y = predictor)) + 
  geom_col(fill = "darkblue", color = "darkblue", alpha = 0.5) + 
  facet_wrap(~component, nrow = 1) + 
  labs(y = NULL, x = "Value") + 
  theme_minimal()

Just like in our last analysis, the second component shows clear spatial heterogeneity, mirroring the LISA results.

ggplot(census_pca, aes(fill = PC2)) +
  geom_sf(color = NA) +
  theme_void() +
  scale_fill_viridis_c()

Exercise

  • Explore and plot the other principal components. The contributors to PC2 are relatively clear (lower income, higher black population, lower bachelor’s education). What factors are defining other components?

Geodemographic analysis: non-spatial clustering with K-means

Both of our clustering algorithms will use our PCA results as input to define clusters of CBGs with similar characteristics. Along with removing covariance that could influence the clustering results, selecting a smaller number of transformed principal components also reduces the computational complexity of the clustering.

First, we have to prepare the data. We will use the first 8 components, and run the k-means algorithm with \(k=6\).

census_kmeans <- census_pca |>
  st_drop_geometry() |>
  select(PC1:PC8) |>
  kmeans(centers = 6)

table(census_kmeans$cluster)

  1   2   3   4   5   6 
240  55 117 140  68 308 

The table above shows the number of CBGs in each cluster. But what if we plot the spatial distribution of these clusters?

census_kmeans <- census_pca |>
  mutate(cluster = as.character(census_kmeans$cluster))

ggplot(census_kmeans, aes(fill = cluster)) + 
  geom_sf(size = 0) + 
  scale_fill_brewer(palette = "Set2") + 
  theme_void() + 
  labs(fill = "Cluster")

This map shows some spatial structure in the cluster labels, for example, a cluster roughly aligning with PC2 in South Boston is visible, as are differentiated clusters in more affluent neighborhoods. To check the differentiation of our clusters from one another, we can also plot the cluster labels against our input principal components.

ggplot(census_kmeans, 
    aes(x = PC1, y = PC2, color = cluster)) + 
    geom_point(size=1) + 
    scale_color_brewer(palette = "Set2") + 
    theme_minimal()

This shows the rather noisy boundaries defining our clusters. Likely, our data are not

Exercise

  • Does any more explicit spatial structure arise for smaller values of \(k\)? Try repeating the k-means clustering with different values of \(k\). You can also try to find an optimal \(k\) using a diagnostic like an elbow plot (see here).

Geodemographic analysis: non-spatial clustering with SKATER

The noisy clustering result from k-means is partly driven by a lack of spatial structure in the input data. To address this, we can try using a spatially-explicit clustering technique which will produce a geographically contiguous areas with similar spatial characteristics.

This algorithm again takes our first 8 principal components, but this time, we also provide a minimum-spanning tree built from our original distance weights.

input_vars <- census_pca |>
  select(PC1:PC8) |>
  st_drop_geometry() |>
  as.data.frame()

mst <- mstree(lw)

regions <- skater(
  mst[,1:2], 
  input_vars, 
  ncuts = 7,
  crit = 10
)

Now, by encoding the spatial structure of our data, we have created spatially contiguous regions, defined by common underlying demographic features.

census_pca$region <- as.character(regions$group)

ggplot(census_pca, aes(fill = region)) + 
  geom_sf(size = 0) + 
  scale_fill_brewer(palette = "Set1") + 
  theme_void() + 
  labs(fill = "Cluster")

Note that for these regions, the degree of statistical similarity between CBGs in the same cluster is necessarily lower because of the spatial constraints. However, there are cases where spatially contiguous regions are crucial, either for interpretability of spatial clusters (a goal of the 2011 Output Area Classification, for example), or because of physical constraints requiring the identified areas to have clear spatial structure (like defining districts for a policy intervention).

Exercise

  • What are the limitations of the spatially-explicit clustering? What if there is no meaningful regional structure in your data? Is there a danger of mis-interpreting contiguous spatial areas with little statistical similarity?

  • Can you plot the normalized distribution of census variables for each SKATER cluster? How do the identified clusters relate to our input components?

  • Consider: what is the role played by spatial scale? How will the clusters change if we change our area of analysis?

Challenge

  • SKATER is only one option for spatially explicit clustering algorithm. In R, most cluserint algorithms have very similar inputs. Try creating a new set of regions with the classic DBSCAN algorithm mentioned in our lecture.

    • Hint: DBSCAN is available in the dbscan R package (big surprise). For this, you will have to create a distance matrix between input polygons (remember sf::st_distance()) from practical 2?

References

[1]
R. M. AssunÇão, M. C. Neves, G. Câmara, and C. Da Costa Freitas, “Efficient regionalization techniques for socio‐economic geographical units using minimum spanning trees,” International Journal of Geographical Information Science, vol. 20, no. 7, pp. 797–811, Aug. 2006, doi: 10.1080/13658810600665111.
[2]
C. G. Gale, A. D. Singleton, A. G. Bates, and P. A. Longley, “Creating the 2011 area classification for output areas (2011 OAC),” Journal of Spatial Information Science, vol. 2016, no. 12, pp. 1–27, 2016, doi: 10.5311/JOSIS.2016.12.232.
[3]
R. Webber and R. Burrows, The Predictive Postcode: The Geodemographic Classification of British Society. 55 City Road, London: SAGE Publications Ltd, 2018. doi: 10.4135/9781529714685.