Lab 10-1 - Urban social network analysis and modeling
Author
Affiliations
Esteban Moro
Network Science Institute, Northeastern University
NETS 7983 Computational Urban Science
Published
March 31, 2025
Objectives
In this lab, we will analyze the structure of urban social networks. We will:
Use Facebook’s Social Connectivity Index in an urban area.
Model the social network as a graph and analyze its properties.
Visualize the network and key metrics.
Understand the geographical dependence of the social network.
Load some libraries and settings we will use
library(tidyverse)library(arrow) # for efficient dataframeslibrary(DT) # for interactive tableslibrary(knitr) # for tableslibrary(ggthemes) # for ggplot themeslibrary(sf) # for spatial datalibrary(tigris) # for US geospatial datalibrary(tidycensus) # for US census datalibrary(stargazer) # for tableslibrary(leaflet) # for interactive mapsoptions(tigris_use_cache =TRUE) # use cache for tigristheme_set(theme_hc() +theme(axis.title.y =element_text(angle =90)))
Load the social network data
As in Lab 4-2 we will use the Social Connectedness Index between the different areas in the US. This is a Facebook dataset in which they calculated the index between geographies and as
where is the number of friendships on Facebook between county and county and is the number of users on Facebook that live in county . The measures the normalized probability of having a friendship between and .
The data can be downloaded from here https://data.humdata.org/dataset/social-connectedness-index, which contains different files for different geographical levels (e.g., country, county, zip code, etc.). Read the documentation of the data to understand its structure.
Specifically, we will model the social connections in an urban Area.
sci_boston <- sci_boston |>left_join(dist_zips, by =c("user_loc","fr_loc"))
Social Network Analysis
Let’s analyze the social network properties. First we create the social network using the igraph package. To do that we first clean the network to have significant values of SCI between areas and remove outliers next to zero, as we did in Lab 4-2.
Let’s visualize the graph. As layout we will use the coordinates for the centroid of each zip code. For simplicity we only plot the top 5000 edges by scaled_sci and use the ggraph package for visualization.
g_vis <-delete_edges(g,which(E(g)$distance==0))g_vis <- g_vis |>delete_edges(E(g_vis)[order(E(g_vis)$scaled_sci, decreasing =TRUE)][5000:length(E(g_vis))] )require(ggraph)ggraph(g_vis, layout ="manual", x =V(g_vis)$lon, y =V(g_vis)$lat) +geom_edge_link(aes(alpha =log(scaled_sci)/20)) +geom_node_point(aes(x = lon, y = lat), size =0) +coord_equal() +theme_void()
Local properties
In igraph there are some reserved names for the properties of the nodes. One of them is for the weight of the edges. In our case that is the scaled_sci value. We can set it as the weight of the edges to facilitate the use of it in the analysis.
E(g)$weight <-E(g)$scaled_sci
Since SCI measures all the social connections between users in different zip codes, the graph is very dense. It contains
As in Lab 4.1, we find a strong correlation between distance and the social connectivity
cor.test(g |>get.data.frame("edges") |>filter(distance >0) |>pull(distance), g |>get.data.frame("edges") |>filter(distance >0) |>pull(scaled_sci))
Pearson's product-moment correlation
data: pull(filter(get.data.frame(g, "edges"), distance > 0), distance) and pull(filter(get.data.frame(g, "edges"), distance > 0), scaled_sci)
t = -82.854, df = 54774, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3411426 -0.3262590
sample estimates:
cor
-0.3337216
Note however that it is smaller that we found for counties.
Clustering
Let’s see the clustering coefficient of the network. There are several generalizations of the clustering coefficient for weighted graphs. Here we use the definition by A. Barrat: where is the strength of node , is the degree of node , is the weight of the edge between nodes and , and is the adjacency matrix of the graph.
which means that zip codes with smaller strength, i.e. with less probability to have social relationships with other zip codes, have more clustered environments.
Assortativity
Next we investigate the assortativity in urban social connections by different type of demographic traits. Since we have a weighted network, we cannot use the traditional assortativity coefficient defined by M.E.J. Newman. we use the extension by [2]
where is the value of the attribute of source node and is the value of the atribute of target node of each edge. Here is the function
assortativity_weighted <-function(g,x,weights){ x <-as.numeric(x) weights <-as.numeric(weights) values_df <-data.frame(name=V(g)$name,x)get.data.frame(g,what="edges") |>mutate(weight = weights) |>left_join(values_df, by =c("from"="name")) |>rename(x_from = x) |>left_join(values_df, by =c("to"="name")) |>rename(x_to = x) |>summarize(sum(weight*(x_from -mean(x_from))*(x_to -mean(x_to)))/(sum(weight)*sd(x_from)*sd(x_to)) ) |>pull()}
As we can see the network has a large assortativity by income and race, which means that zip codes with similar income are more likely to have social connections, i.e. social segregation.
Centrality Measures
Let’s see how central are the zip codes in the network. We will use the betweenness centrality, which measures the fraction of shortest paths that pass through a node. Since SCI measures the probability of having a social connection between two zip codes, we will use the inverse of the SCI as the weight of the edges.
bc <-betweenness(gcc,weights=1/E(gcc)$scaled_sci,directed = T) |>enframe(name="node",value="bc")bc |>arrange(desc(bc))
Check that the results obtained are different when the network structure is independent of the weights.
Your turn
Check that the results obtained are can be or not reproduced by a model that only includes the dependence of the SCI on the distance between the zip codes.
fit <-nls(log(scaled_sci) ~ a - gamma*log(distance + d), start=list(a=1000, gamma=1, d=1000), algorithm="port",lower =list(a=.01, gamma =0, d=100), upper =list(a=1000, gamma =10, d=10000),data = gcc |>get.data.frame("edges"))E(gcc)$scaled_sci_model <-exp(predict(fit, newdata = g |>get.data.frame("edges")))
Compare the strength of the real network with the model
Social connectedness in urban areas, by Bailey et al. [5]
References
[1]
A. Barrat, M. Barthélemy, R. Pastor-Satorras, and A. Vespignani, “The architecture of complex weighted networks,”Proceedings of the National Academy of Sciences, vol. 101, no. 11, pp. 3747–3752, Mar. 2004, doi: 10.1073/pnas.0400087101.
[2]
Y. Yuan, J. Yan, and P. Zhang, “Assortativity measures for weighted and directed networks,”Journal of Complex Networks, vol. 9, no. 2, p. cnab017, Apr. 2021, doi: 10.1093/comnet/cnab017.
[3]
M. Barthélemy, “Spatial networks,”Physics reports, vol. 499, no. 1–3, pp. 1–101, 2011.
[4]
V. D. Blondel, A. Decuyper, and G. Krings, “A survey of results on mobile phone datasets analysis,”EPJ Data Science, vol. 4, no. 1, p. 10, Dec. 2015, doi: 10.1140/epjds/s13688-015-0046-0.
[5]
M. Bailey, R. Cao, T. Kuchler, J. Stroebel, and A. Wong, “Social Connectedness: Measurement, Determinants, and Effects,”Journal of Economic Perspectives, vol. 32, no. 3, pp. 259–280, Aug. 2018, doi: 10.1257/jep.32.3.259.
Social Network Analysis
Let’s analyze the social network properties. First we create the social network using the
igraph
package. To do that we first clean the network to have significant values of SCI between areas and remove outliers next to zero, as we did in Lab 4-2.Next, we create the graph from the data frame.
Let’s visualize the graph. As layout we will use the coordinates for the centroid of each zip code. For simplicity we only plot the top 5000 edges by
scaled_sci
and use theggraph
package for visualization.Local properties
In
igraph
there are some reserved names for the properties of the nodes. One of them is for theweight
of the edges. In our case that is thescaled_sci
value. We can set it as the weight of the edges to facilitate the use of it in the analysis.Since SCI measures all the social connections between users in different zip codes, the graph is very dense. It contains
and
Thus, the average degree of nodes is very large
And here is the distribution
Let’s get the largest connected component
Our network is weighted, so we can also build the strength of the nodes
The strength or SCI of each edge depends on the distance.
As in Lab 4.1, we find a strong correlation between distance and the social connectivity
Note however that it is smaller that we found for counties.
Clustering
Let’s see the clustering coefficient of the network. There are several generalizations of the clustering coefficient for weighted graphs. Here we use the definition by A. Barrat: where is the strength of node , is the degree of node , is the weight of the edge between nodes and , and is the adjacency matrix of the graph.
As with other spatial weighted networks (see [1])
which means that zip codes with smaller strength, i.e. with less probability to have social relationships with other zip codes, have more clustered environments.
Assortativity
Next we investigate the assortativity in urban social connections by different type of demographic traits. Since we have a weighted network, we cannot use the traditional assortativity coefficient defined by M.E.J. Newman. we use the extension by [2]
Here is the assortativity
As we can see the network has a large assortativity by income and race, which means that zip codes with similar income are more likely to have social connections, i.e. social segregation.
Centrality Measures
Let’s see how central are the zip codes in the network. We will use the betweenness centrality, which measures the fraction of shortest paths that pass through a node. Since SCI measures the probability of having a social connection between two zip codes, we will use the inverse of the SCI as the weight of the edges.
Let’s see the distribution of centrality in a choropleth map
Interestingly, the zip codes with more centrality are in the peripheral areas of the Boston Metro Area.