Lab 10-1 - Urban social network analysis and modeling

Author
Affiliations

Esteban Moro

Network Science Institute, Northeastern University

NETS 7983 Computational Urban Science

Published

March 31, 2025

Objectives

In this lab, we will analyze the structure of urban social networks. We will:

  • Use Facebook’s Social Connectivity Index in an urban area.
  • Model the social network as a graph and analyze its properties.
  • Visualize the network and key metrics.
  • Understand the geographical dependence of the social network.

Load some libraries and settings we will use

library(tidyverse)
library(arrow) # for efficient dataframes
library(DT) # for interactive tables
library(knitr) # for tables
library(ggthemes) # for ggplot themes
library(sf) # for spatial data
library(tigris) # for US geospatial data
library(tidycensus) # for US census data
library(stargazer) # for tables
library(leaflet) # for interactive maps
options(tigris_use_cache = TRUE) # use cache for tigris
theme_set(theme_hc() + theme(axis.title.y = element_text(angle = 90))) 

Load the social network data

As in Lab 4-2 we will use the Social Connectedness Index between the different areas in the US. This is a Facebook dataset in which they calculated the index between geographies i and j as

SCIij=FBijFBi×FBj where FBij is the number of friendships on Facebook between county i and county j and FBi is the number of users on Facebook that live in county i. The SCIij measures the normalized probability of having a friendship between i and j.

The data can be downloaded from here https://data.humdata.org/dataset/social-connectedness-index, which contains different files for different geographical levels (e.g., country, county, zip code, etc.). Read the documentation of the data to understand its structure.

Specifically, we will model the social connections in an urban Area.

files <- Sys.glob("/data/CUS/FB_SCI/us-zip-code-us-zip-code-fb-social-connectedness-index-october-2021/*.tsv")
sci <- open_tsv_dataset(files,
                           schema = schema(
                             user_loc = utf8(),
                             fr_loc = utf8(),
                             scaled_sci = int64()
                           ), skip = 1) 

To get the zip codes in the Boston Metropolitan area, we will use a crosswalk between zip codes and metro areas.

crosswalk <- read_csv("/data/CUS/labs/10/map_cbsa_zips.csv")

With that, let’s get the SCI for the zip codes in the Boston Metro Area

boston_zips <- crosswalk |> filter(area_code==14460) |> pull(zips)
sci_boston <- sci |>
  filter(user_loc %in% boston_zips) |>
  filter(fr_loc %in% boston_zips) |>
  collect()

We also collect census information from those zips codes using the tidycensus package

vars <- c(total_population = "B01003_001",
          median_income = "B19013_001",
          white_population = "B02001_002",
          black_population = "B02001_003")
census_boston <- get_acs(geography = "zcta", 
                         variables = vars,
                         output = "wide",
                         geometry = TRUE, 
                         year=2019) |>
  filter(GEOID %in% boston_zips) |>
  mutate(pop = total_populationE,
         income = median_incomeE,
         fwhite = white_populationE/pop,
         fblack = black_populationE/pop) |>
  select(GEOID,pop,income,fwhite,fblack) |>
  drop_na() |>
  # add the centroid coordinates for each zip
  mutate(lon = st_coordinates(st_centroid(geometry))[,1],
         lat = st_coordinates(st_centroid(geometry))[,2])

With the geometries of the zip codes, let’s add the geographical distance between the zip codes to the SCI

tabla_zips <- st_centroid(census_boston$geometry)

and calculate the distance between them and put it as a table

dd <-  st_distance(tabla_zips) |> as.tibble()
colnames(dd) <- census_boston$GEOID
dist_zips <- dd |>
  mutate(GEOID = census_boston$GEOID) |>
  pivot_longer(-GEOID, names_to = "fr_loc", values_to = "distance") |>
  rename(user_loc = GEOID) |>
  mutate(distance = as.numeric(distance))

Add it to the SCI

sci_boston <- sci_boston |>
  left_join(dist_zips, by = c("user_loc","fr_loc")) 

Social Network Analysis

Let’s analyze the social network properties. First we create the social network using the igraph package. To do that we first clean the network to have significant values of SCI between areas and remove outliers next to zero, as we did in Lab 4-2.

sci_boston <- sci_boston |> filter(scaled_sci >= 10)

Next, we create the graph from the data frame.

library(igraph)
g <- sci_boston |> filter(user_loc <= fr_loc) |>
  filter(user_loc %in% census_boston$GEOID & fr_loc %in% census_boston$GEOID) |>
  graph_from_data_frame(directed = FALSE,
                        vertices = census_boston |> st_drop_geometry()) 
g
IGRAPH 3a2a0bb UN-- 336 55108 -- 
+ attr: name (v/c), pop (v/n), income (v/n), fwhite (v/n), fblack
| (v/n), lon (v/n), lat (v/n), scaled_sci (e/n), distance (e/n)
+ edges from 3a2a0bb (vertex names):
 [1] 01420--01420 01420--01430 01431--01420 01432--01420 01434--01420
 [6] 01420--01450 01420--01451 01460--01420 01462--01420 01420--01463
[11] 01464--01420 01420--01469 01420--01474 01420--01503 01504--01420
[16] 01420--01523 01420--01532 01568--01420 01420--01581 01420--01701
[21] 01702--01420 01719--01420 01720--01420 01721--01420 01420--01730
[26] 01731--01420 01740--01420 01420--01741 01420--01742 01420--01746
[31] 01747--01420 01748--01420 01420--01749 01420--01752 01754--01420
+ ... omitted several edges

Let’s visualize the graph. As layout we will use the coordinates for the centroid of each zip code. For simplicity we only plot the top 5000 edges by scaled_sci and use the ggraph package for visualization.

g_vis <- delete_edges(g,which(E(g)$distance==0))
g_vis <- g_vis |>
  delete_edges(E(g_vis)[order(E(g_vis)$scaled_sci, 
                              decreasing = TRUE)][5000:length(E(g_vis))]
  )
require(ggraph)
ggraph(g_vis, layout = "manual", x = V(g_vis)$lon, y = V(g_vis)$lat) + 
  geom_edge_link(aes(alpha = log(scaled_sci)/20)) + 
  geom_node_point(aes(x = lon, y = lat), size = 0) + 
  coord_equal() +  
  theme_void()

Local properties

In igraph there are some reserved names for the properties of the nodes. One of them is for the weight of the edges. In our case that is the scaled_sci value. We can set it as the weight of the edges to facilitate the use of it in the analysis.

E(g)$weight <- E(g)$scaled_sci

Since SCI measures all the social connections between users in different zip codes, the graph is very dense. It contains

cat("Number of nodes:", vcount(g), "\n")
Number of nodes: 336 

and

cat("Number of edges:", ecount(g), "\n")
Number of edges: 55108 

Thus, the average degree of nodes is very large

cat("Average node degree:", mean(degree(g)), "\n")
Average node degree: 328.0238 

And here is the distribution

degree(g) |> enframe(name = "node", value = "degree") |>
  ggplot(aes(x=degree)) + geom_density(fill="lightgray",alpha=0.5) + 
  scale_x_log10()

Let’s get the largest connected component

gcc <- g |>
  induced_subgraph(which(components(g)$membership == which.max(components(g)$csize)))

Our network is weighted, so we can also build the strength of the nodes

si=jwij

strength(gcc) |> 
  enframe(name = "node", value = "strength") |>
  ggplot(aes(x=strength)) + geom_density(fill="lightgray",alpha=0.5) +
  ggtitle("Strength distribution") + xlab("Strength") + ylab("Density") +
  scale_x_log10()

The strength or SCI of each edge depends on the distance.

g |> get.data.frame("edges") |> filter(distance > 0) |>
  ggplot(aes(x=distance,y=scaled_sci)) + geom_bin2d() +
  scale_fill_gradient(low="lightgray",high="black",trans="log10") +
  scale_x_log10() + scale_y_log10() + labs(x="d_ij",y="SCI_ij") +
  labs(x="Distance (km)",y="SCI (scaled)") + theme_minimal() +
  geom_line(aes(x=distance,y=1e12/(distance+2000)^1.5))

As in Lab 4.1, we find a strong correlation between distance and the social connectivity

cor.test(g |> get.data.frame("edges") |> filter(distance > 0) |> pull(distance),
         g |> get.data.frame("edges") |> filter(distance > 0) |> pull(scaled_sci))

    Pearson's product-moment correlation

data:  pull(filter(get.data.frame(g, "edges"), distance > 0), distance) and pull(filter(get.data.frame(g, "edges"), distance > 0), scaled_sci)
t = -82.854, df = 54774, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.3411426 -0.3262590
sample estimates:
       cor 
-0.3337216 

Note however that it is smaller that we found for counties.

Clustering

Let’s see the clustering coefficient of the network. There are several generalizations of the clustering coefficient for weighted graphs. Here we use the definition by A. Barrat: Ciw=1si(ki1)j,hwijwih2AijAihAjh where si is the strength of node i, ki is the degree of node i, wij is the weight of the edge between nodes i and j, and Aij is the adjacency matrix of the graph.

local_clustering <- transitivity(gcc,type="barrat",weights=E(gcc)$scaled_sci)
strength <- strength(gcc)

As with other spatial weighted networks (see )

cor.test(local_clustering,strength)

    Pearson's product-moment correlation

data:  local_clustering and strength
t = -12.747, df = 330, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.6423261 -0.4975245
sample estimates:
       cor 
-0.5744016 
data.frame(local_clustering,strength) |>
  ggplot(aes(x=strength,y=local_clustering )) + geom_bin2d() + geom_smooth() + 
  scale_fill_gradient(low="lightgray",high="black",trans="log10") +
  scale_x_log10() + scale_y_log10() + labs(x="d_ij",y="SCI_ij") +
  labs(x="Node Strength",y="Local Clustering") + theme_minimal()

which means that zip codes with smaller strength, i.e. with less probability to have social relationships with other zip codes, have more clustered environments.

Assortativity

Next we investigate the assortativity in urban social connections by different type of demographic traits. Since we have a weighted network, we cannot use the traditional assortativity coefficient defined by M.E.J. Newman. we use the extension by

ρx,y=i,jwij(xix¯)(yjy¯)Wσx2 where xi is the value of the attribute of source node i and yi is the value of the atribute of target node of each edge. Here is the function

assortativity_weighted <- function(g,x,weights){
  x <- as.numeric(x)
  weights <- as.numeric(weights)
  values_df <- data.frame(name=V(g)$name,x)
  get.data.frame(g,what="edges") |>
    mutate(weight = weights) |>
    left_join(values_df, by = c("from" = "name")) |>
    rename(x_from = x) |>
    left_join(values_df, by = c("to" = "name")) |>
    rename(x_to = x) |>
    summarize(
      sum(weight*(x_from - mean(x_from))*(x_to - mean(x_to)))/(sum(weight)*sd(x_from)*sd(x_to))
    ) |> pull()
}

Here is the assortativity

data.frame(
  rho_pop = assortativity_weighted(gcc, V(gcc)$pop, E(gcc)$weight),
  rho_inc = assortativity_weighted(gcc, V(gcc)$income, E(gcc)$weight),
  rho_fwhite = assortativity_weighted(gcc, V(gcc)$fwhite, E(gcc)$weight),
  rho_fblack = assortativity_weighted(gcc, V(gcc)$fblack, E(gcc)$weight)
)
    rho_pop  rho_inc rho_fwhite rho_fblack
1 0.3409484 0.396625  0.4193051  0.2933085

As we can see the network has a large assortativity by income and race, which means that zip codes with similar income are more likely to have social connections, i.e. social segregation.

Centrality Measures

Let’s see how central are the zip codes in the network. We will use the betweenness centrality, which measures the fraction of shortest paths that pass through a node. Since SCI measures the probability of having a social connection between two zip codes, we will use the inverse of the SCI as the weight of the edges.

bc <- betweenness(gcc,weights=1/E(gcc)$scaled_sci,directed = T) |> 
  enframe(name="node",value="bc")
bc |> arrange(desc(bc))
# A tibble: 332 × 2
   node     bc
   <chr> <dbl>
 1 02047 7876 
 2 01922 7046 
 3 01952 6626.
 4 03874 5608.
 5 02071 5489 
 6 01827 5246.
 7 01523 4591 
 8 03858 4546.
 9 01450 4496 
10 01503 4211 
# ℹ 322 more rows

Let’s see the distribution of centrality in a choropleth map

cb <- census_boston |> left_join(bc, by = c("GEOID" = "node")) |>
  st_transform(4326) 
pal = colorNumeric("YlOrRd", sqrt(cb$bc))
leaflet(cb) |> addTiles() |>  
  addPolygons(fillColor = ~pal(sqrt(bc)),
              fillOpacity = 0.7,
              color = "white",
              weight = 1) |>
  addLegend("bottomright", pal = pal,
            values = sqrt(cb$bc), title = "Betweenness centrality")
Betweenness centrality
01020304050607080

NA

Interestingly, the zip codes with more centrality are in the peripheral areas of the Boston Metro Area.

Community structure

Let’s see the community structure of the network. We will use the Louvain algorithm to find the communities in the network.

communities <- cluster_louvain(gcc,weights=E(gcc)$scaled_sci,resolution=1)

Let’s see the communities in a choropleth map

cb <- census_boston |> 
  left_join(tibble(name=V(gcc)$name,
                   comm=communities$membership), 
            by = c("GEOID" = "name")) |>
  st_transform(4326)
pal = colorFactor("Set3", cb$comm)
leaflet(cb) |> addProviderTiles(provider=providers$CartoDB.Positron) |>  
  addPolygons(fillColor = ~pal(comm),
              fillOpacity = 0.7,
              color = "white",
              weight = 1) |>
  addLegend("bottomright", pal = pal,
            values = cb$comm, title = "Community")
Community
1
2
3
4
5
6
7
NA
Leaflet | © OpenStreetMap contributors © CARTO

Your turn

Check that the results obtained are different when the network structure is independent of the weights.

Your turn

Check that the results obtained are can be or not reproduced by a model that only includes the dependence of the SCI on the distance between the zip codes.

fit <- nls(log(scaled_sci) ~ a - gamma*log(distance + d), 
           start=list(a=1000, gamma=1, d=1000), 
                   algorithm="port",
                   lower = list(a=.01, gamma = 0, d=100), 
                   upper = list(a= 1000, gamma = 10, d=10000),
           data = gcc |> get.data.frame("edges"))
E(gcc)$scaled_sci_model <- exp(predict(fit, newdata = g |> get.data.frame("edges")))
  • Compare the strength of the real network with the model
compare_strength <- data.frame(
  strength = strength(gcc),
  strength_model = strength(gcc,weights=E(gcc)$scaled_sci_model)
)
  • Compare the assortativity of the model with the real network
assortativity_weighted(gcc, V(gcc)$pop, E(gcc)$scaled_sci_model)
[1] 0.275388
  • What about centrality?

  • Does the model reproduce the community structure of the real network?

communities <- cluster_louvain(gcc,weights=E(gcc)$scaled_sci_model,resolution=1)

Conclusions

Urban social networks have a strong dependence on geography and, in particular, on distance. They show:

  • Large local clustering.
  • Demographic assortativity.
  • Big differences in centrality.
  • Community structure that is related to the geography of the area.

Some of these properties might be explained just by distance and the geographical distribution of the population.

Further reading

References

[1]
A. Barrat, M. Barthélemy, R. Pastor-Satorras, and A. Vespignani, “The architecture of complex weighted networks,” Proceedings of the National Academy of Sciences, vol. 101, no. 11, pp. 3747–3752, Mar. 2004, doi: 10.1073/pnas.0400087101.
[2]
Y. Yuan, J. Yan, and P. Zhang, “Assortativity measures for weighted and directed networks,” Journal of Complex Networks, vol. 9, no. 2, p. cnab017, Apr. 2021, doi: 10.1093/comnet/cnab017.
[3]
M. Barthélemy, “Spatial networks,” Physics reports, vol. 499, no. 1–3, pp. 1–101, 2011.
[4]
V. D. Blondel, A. Decuyper, and G. Krings, “A survey of results on mobile phone datasets analysis,” EPJ Data Science, vol. 4, no. 1, p. 10, Dec. 2015, doi: 10.1140/epjds/s13688-015-0046-0.
[5]
M. Bailey, R. Cao, T. Kuchler, J. Stroebel, and A. Wong, “Social Connectedness: Measurement, Determinants, and Effects,” Journal of Economic Perspectives, vol. 32, no. 3, pp. 259–280, Aug. 2018, doi: 10.1257/jep.32.3.259.