Lecture 6
Spatial Inference and
Machine Learning
in CUS

Esteban Moro

Network Science Institute | Northeastern University

NETS 7983 Computational Urban Science

2025-03-31

Welcome!

This week:

Spatial Inference and Machine Learning for Urban Science

GNNs to understand spatial structures. From [1]

Aims

  • Understand the role of spatial data properties in statistical inference and machine learning.
  • Overview of statistical inference and machine learning methods for spatial data.
  • Practical: spatial regression for inference and prediction models for Urban areas.

Spatial modeling in CUS

Most data in Computational Social Science has spatially embedded features of cities.

Spatial data has specific properties that must be considered when making inferences or using machine learning.

  • Spatial Autocorrelation: Similar values are close to each other.
  • Spatial Heterogeneity: The relationship between variables can change over space.
  • Spatial Context: values can depend on the variables around them.

Failure to incorporate these properties can lead to biased estimates and predictions in our models.

Geography as a tool

Remember:

  • Geography is a feature: it is not just a place but a feature that can help us understand the data and inform our models
    • For example, exposure to fast food environments as we move around during the day can impact how much we consume and our health [2]. Thus, we might need to incorporate the spatial distribution of fast-food restaurants into our models.
  • Geography is also an instrument: it can help us build better models by incorporating spatial relationships.
    • For example, since pollution is spatially correlated, we can use the levels of pollution in nearby areas to predict the effect of pollution in a given area.
  • Geography is a source: it can help us build new variables by incorporating new spatial relationships between our observed predictors and maps, resources, census, etc.
    • For example, we can use the distance to the nearest park as a new variable in our model to predict the price of a house.

Inference vs prediction in CUS

CUS is more than just large datasets in the urban context. It changes our way of describing, modeling, understanding, and predicting urban phenomena.

  • Inference: we want to understand the relationship between variables and how they change over space.

\[ y = f(x) + \varepsilon \]

  • For example, we might want to understand the relationship between income and education over space and how it is affected by the presence of parks.

  • The priority here is the interpretation of the coefficients and the uncertainty of our estimates. For example, we are interested more on \(\partial y / \partial x\) than on the value of \(y\).

Inference vs prediction in CUS

  • Prediction: we want to predict the value of a variable in a given location based on the values of other variables in that location and in nearby locations.

\[ y = f(x) + \varepsilon \]

  • For example, we might want to predict the price of a house based on the number of rooms, the distance to the nearest park, and the average price of houses in the neighborhood.

  • The priority here is the accuracy of our predictions and the ability to generalize to new data. That is, we are interested in \(|y - f(x)|\), especially when \(x\) is new.

Inference vs prediction in CUS

Most inference and prediction models are interested in steady situations without intervention or changes. The next dimension:

  • Causal inference: we want to understand the effect of a change in a variable on another variable, usually where effects are small
  • Integrative modeling: use principled models to predict outcomes and significant effects not yet seen.

From [3]

Spatial inference

Spatial inference is the process of understanding the relationship between spatially distributed variables.

Key concepts:

  • Spatial autocorrelation: the value of a variable in one location is correlated with the variable’s value in nearby locations.

  • Spatial heterogeneity: the relationship between variables can change over space or time.

  • Modifiable areal unit problem (MAUP): the relationship between variables can change depending on the spatial scale at which they are measured.

Our inference models should incorporate those concepts.

Methods for Spatial Inference

Methods:

  • Exploratory Spatial Data Analysis (ESDA): we can use spatial statistics to understand the spatial distribution of our variables, and to identify spatial patterns.

    • We show some of these methods in the last lecture, like Moran’s I, LISA, SKATER, etc.
  • Spatial regression or machine learning: we can use spatial regression models to understand the relationship between variables and how they change over space.

    • We will show some of these methods in this lecture.

Inference through Spatial Regression

Regression models are a powerful tool to understand the relationship between variables. In most papers/research, we see the following equation:

\[ y = \alpha + \beta X + \varepsilon \]

Where \(y\) is the dependent variable (vector), \(X\) are the independent variables (matrix), and \(\varepsilon\) is the error term (vector). Each of the components of \(y\) is \(y_i\) where \(i\) refers to a geographical unit, like a city, a neighborhood, CBG, etc.

  • The goal is to estimate the coefficients \(\beta\) that best explains the relationship between \(y_i\) and \(X_i\).
  • Methods like OLS, GLM, etc,. are used to estimate these coefficients.
  • One of the key assumptions of these models is that the error term \(\varepsilon_i\) is independent and identically distributed (i.i.d).
  • In most situations, though, \(y_i\), \(X_i\) and \(\varepsilon_i\) are spatially correlated. This means that the value of \(y_i\) in a location is correlated with the value of \(y_j\) in a nearby location.

Spatial Weight Matrix

How can we incorporate spatial correlation in our regression models? Since we find that the variables in one place are related to nearby places, we can use relationships between neighbors.

  • To do that, we are going to define a spatial weight matrix \(W\) that defines the relationship between locations.

  • If a location \(i\) is a neighbor ofthe location \(j\), then $w_{ij} $, otherwise \(w_{ij} = 0\).

  • Most of the time, we use “contiguity” to define neighbors. For example, here are the Rook, Bishop, and Queen contiguity matrices for a 3x3 grid:

Contiguity-based spatial matrix

Spatial Weight Matrix

  • Another way is to define neighbors based on distance. For example, we can define a weight matrix \(W\) where

\[w_{ij} = 1/d_{ij}\] where \(d_{ij}\) is the distance between the location \(i\) and location \(j\), or use a kernel function to define the relationship between locations.

\[ w_{ij} = f(d_{ij}) \]

  • The choice of the weight matrix \(W\) is a model choice. It can have a big impact on the results of our model, and thus, it has to be chosen carefully and based on our knowledge of the data.

Spatial Lag Models

Once we have defined the weight matrix \(W\), we can incorporate it into our regression model. The most common way to do this is to use a spatial lag model (SLM):

\[ y = \alpha + \rho W y + \beta X + \varepsilon \] Where \(y\) is the dependent variable, \(X\) is the independent variable, and \(\varepsilon\) is the error term, and \(\alpha\) is a constant. The term \(\rho W y\) is the spatial lag term, the weighted average of the dependent variable in the neighboring locations.

  • The parameter \(\rho\) is the spatial autoregressive parameter, which measures the strength of the spatial autocorrelation in the dependent variable.

  • This model assumes that a spatial process affects the dependent variable. For example, we might think that the income in one area will affect the income in nearby areas (spillover). SLM is a principled way to include those potential spillovers.

Spatial Lag Models

  • One of the main problems with SLM is that it assumes that the relationship between the dependent and independent variables is the same in all locations. This assumption is known as spatial homogeneity.

  • Also, SLM assumes that the spatial relationship is linear. This might not be the case in practice, as the relationship between variables can be non-linear. However, this can be addressed by using a non-linear spatial lag model.

  • More importantly, SLM does not allow us to interpret \(\beta\) as the effect of the independent variable \(X_i\) on the dependent variable \(y_i\) at location \(i\). This is because the spatial lag term \(\rho W y\) is a function of the dependent variable \(y\) itself. Thus, \(X_j\) at neighboring locations will affect \(y_i\) at location \(i\). (see practical)

Spatial Error Models

We can use a spatial error model (SEM) to address this issue. In SEM, we assume that the spatial process is affecting the error term \(\varepsilon\):

\[ y = \alpha + \beta X + u \] \[ u = a + \lambda W u + \varepsilon \] Where \(u\) is the error term, and \(\lambda\) is the spatial autoregressive parameter that measures the strength of the spatial autocorrelation in the error term.

  • In this model, we assume that there is a spatial process that affects the error term but does not directly affect the dependent variable.

  • This allows us to interpret \(\beta\) as the effect of the independent variable \(X_i\) on the dependent variable \(y_i\) at location \(i\).

Other Spatial Models

There are many other spatial regression models that we can use to incorporate spatial correlation in our models; see [4] [5]. Some of the most common ones are:

  • SLX: Spatial lag model with exogenous variables, where the spatial structure is introduced in the dependent variable variables.

  • SDM: Spatial Durbin Model, where the spatial structure is introduced in the dependent and independent variables.

Relationship between spatial regression models

Spatial Heterogeneous Models

The models considered so far are homogeneous; that is, they assume that the relationship between variables is the same in all locations. But we know this might not be the case in spatial data. One way to address this issue is to use a spatial heterogeneity model. In these models, we assume that the relationship between variables can change over space. Remember our OLS linear model:

\[y_i = \alpha + \sum_k \beta_k x_{ik} + \varepsilon_i\] There are two ways to incorporate spatial heterogeneity:

  • Spatial Fixed Effects models: Allowing geographical variations of \(\alpha\).

\[ y_i = a_{C(i)} + \sum_k \beta_k x_{ik} + \varepsilon_i \] where \(C(i)\) is the geographical unit to which location \(i\) belongs to.

Spatial Fixed Effects

  • At the core of spatial fixed effects is the idea that there are geographical effects on the dependent variable that we should consider.
  • For example, consider income: it is very different across neighborhoods, counties, and states in the US. Thus, a model explaining it should take into account those differences.
  • If you think about it, a spatial fixed effects model is a way to “normalize” income by geographical units. The effect should be the same.

\[ y_i - a_{C(i)} = \alpha + \sum_k \beta_k x_{ik} + \varepsilon_i \]

Geographically Weighted Regression

  • Another way to allow for spatial heterogeneity is to allow the coefficients \(\beta_k\) to change over space. This is typically known as a spatial Random Effect model or a Geographically Weighted Regression (GWR).

\[ y_i = \alpha + \sum_k \beta_{k}(lat_i, lon_i) x_{ik} + \varepsilon_i \]

  • In this model, we are allowing the coefficients \(\beta_k\) to change over space. This allows us to capture the spatial heterogeneity in the relationship between variables.

  • The coefficients \(\beta_k\) can are estimated using a local weighted regression using a local search window from one regression point to the next.

GWR with fixed spatial kernel. Source [6]

Geographically Weighted Regression

Thus, we need to define a Kernel function to define the relationship between locations. The most common kernel functions are Gaussian, Exponential, and Uniform, and they depend on whether the kernel is fixed or adaptive.

  • Fixed kernel: the kernel function is the same for all locations. This is the simplest kernel function, but it might not be the best choice if the relationship between variables changes over space.
  • Adaptive kernel: the kernel function is different for each location. This allows us to capture the spatial heterogeneity in the relationship between variables. But it is more computationally expensive.

For example, here are the weights for a fixed Gaussian kernel

\[ w_{ij} = \exp\left(-\frac{d_{ij}^2}{2h^2}\right) \] GWR methods find the best bandwidth \(h\) that minimizes the error of the model.

Geographically Weighted Regression

  • GWR are computationally expensive, as they require estimating the coefficients \(\beta_k\) for each location. This can be a problem if we have a large number of locations.

  • GWR are also sensitive to the kernel function and bandwidth choice. The choice of the kernel function and the bandwidth can greatly impact the model results.

  • GWR is particularly affected by problems like local multicollinearity, edge effects, etc.

  • GWR is difficult to interpret and compare to other models.

  • For these reasons, GWR is generally recommended as an exploratory technique rather than a confirmatory one.

require(spgwr)
Loading required package: spgwr
Loading required package: sp
Loading required package: spData
To access larger datasets in this package, install the spDataLarge
package with: `install.packages('spDataLarge',
repos='https://nowosad.github.io/drat/', type='source')`
NOTE: This package does not constitute approval of GWR
as a method of spatial analysis; see example(gwr)

Software for Spatial Inference

Software:

  • In R we can use the spatialreg package to estimate spatial regression models.
  • For GWR models, we can use the spgwr package.
  • In Python we can use the pysal package to estimate spatial regression models and GWR models

We will use those packages in our practical session.

Machine Learning in CUS

Machine learning is a powerful tool in CUS. Again, we need to be aware of the spatial properties of our data when using machine learning models. There are three areas of machine learning which are affected by spatial data:

  • Feature Engineering: we can use spatial relationships between variables and geography to create new features to help us build better models.

  • Model Training: since data is spatially correlated, we have to use spatial cross-validation to prevent data leakage and overfitting.

  • Model Selection: Most models are based on statistical tests that assume the data is independent and identically distributed. However, this is not the case in spatial data, so we have to use spatial tests to select the best model.

Feature engineering in CUS

  • Feature Engineering: we can use spatial relationships between variables and geography to create new features to help us build better models.

    • For example, we can use the distance to the nearest park as a new feature in our model to predict the price of a house (map matching)

    • We can use spatial relationships between variables to create new features to help us build better models. For example, we can aggregate income data at the neighborhood level to create a new feature that can help us predict the price of a house (map synthesis)

From [7]

Model Training in CUS

  • Model Training: since data is spatially correlated, we must use spatial cross-validation to prevent data leakage and overfitting.

    • In spatial cross-validation, we split the data into training and testing sets so that the spatial relationships between the data points are preserved. This prevents data leakage and overfitting.

    • For example, we can use spatial k-fold cross-validation to split the data into training and testing sets, preserving the spatial relationships between the data points.

Cross-validation in CUS

There are many methods to do spatial cross-validation, like:

  • CV by geographical area: using neighborhoods, counties, states, or rectangular or hexagonal regions to split the data.
  • CV by spatial distance: using the distance between data points to split the data.
  • CV by spatial clustering: using spatial clustering to split the data into groups.

The choice of the method depends on the spatial properties of the data and the research question.

Hyperparametric tuning using spatial CV. From Geocomputation with R

Cross-validation in CUS

In general, using spatial data in training models without considering the spatial properties of the data can lead to over-optimistic and biased results. For example, a massive model using Random Forest was trained and validated to predict aboveground biomass (ABG) in Africa \(R^2 = 0.53\)

From

However, when spatial cross-validation was used, the \(R^2\) dropped to 0.14. See [8]

Model Selection in CUS

  • Model Selection. Finally, we need to incorporate spatial statistical tests to validate and choose our models.

    • For example, we can use spatial tests like Moran’s I, LISA, etc. to validate our hypothesis about independence between variables.

    • We also have to split the data into training and testing sets so that the spatial relationships between the data points are preserved. This prevents data leakage and overfitting.

    • We must change our hypothesis in those models to test “statistical significance” to “spatial significance.”

Network models in CUS

Example: Community finding algorithms in spatial networks. As we saw in the previous lectures, social networks have a strong spatial component. In particular the weights of the edges in the network can be related to the spatial distance between the nodes.

Community finding algorithms try to find partitions that maximize Modularity, a measure of the quality of the partition. Modularity is defined as:

\[ Q = \frac{1}{2m} \sum_{ij} \left( A_{ij} - \frac{k_i k_j}{2m} \right) \delta(c_i, c_j) \]

where \(A_{ij}\) is the weight of the edge between nodes \(i\) and \(j\), \(k_i\) is the degree of node \(i\), \(m\) is the total weight of the edges in the network, and \(\delta(c_i, c_j)\) is the Kronecker delta function that is 1 if nodes \(i\) and \(j\) are in the same community, and zero otherwise.

Note that modularity is a function that depends on a Null model: the partition model.

Network models in CUS

Using that definition, most community algorithms find contiguous spatial communities. Here are the communities that Expert et al. [9] found in Belgium using the social network of calls between areas:

Network models in CUS

However, we know that the social relationships between areas are stronger when the areas are closer. Thus, we can use as a Null model a gravity-like model that incorporates the spatial distance between areas.

\[ P_{ij} = N_i N_j f(d_{ij}) \]

and modified the modularity accordingly:

\[ Q = \frac{1}{2m} \sum_{ij} \left( A_{ij} - P_{ij} \right) \delta(c_i, c_j) \] Using this new definition of modularity, Expert et al. [9] found a different partition of the Belgium areas that was more related to the linguistic separation of the country.

Conclusions

  • Consider the spatial properties of the data when making inferences and using machine learning in CUS.
  • If you do not, your models will be biased and over-optimistic.
  • Use spatial regression models to incorporate spatial correlation in your models. At least check if your results from simple OLS change significantly.
  • Use spatial cross-validation to prevent data leakage and overfitting in your machine-learning models.
  • Use spatial tests to validate and choose your models.
  • Use spatial relationships between variables and geography to create new features to help you build better models.
  • Most spatial inference and machine learning models are computationally expensive, so be careful when using them.

Further reading

  • Integrating explanation and prediction in computational Social Science [3]
  • Spatial Data Analysis, a Handbook by Tobias Ruttenauer [4]
  • An Introduction to Spatial Econometrics by James LeSage [5]
  • Geocomputation with R by Robin Lovelace et al. link
  • Geographic Data Science with Python by Rey, Arribas-Be,l and Wolf link
  • Spatial Machine Learning by Kopzewska [10]
  • Geospatial Analysis, book by Smith, Goodchild, Longley [11]
  • Becoming a Spatial Data Scientist. E-book by the Carto Company. link

References

[1]
C. Fan, Y. Yang, and A. Mostafavi, “Neural embeddings of urban big data reveal spatial structures in cities,” Humanities and Social Sciences Communications, vol. 11, no. 1, pp. 1–15, Mar. 2024, doi: 10.1057/s41599-024-02917-6.
[2]
B. García Bulle Bueno et al., “Effect of mobile food environments on fast food visits,” Nature Communications, vol. 15, no. 1, p. 2291, Mar. 2024, doi: 10.1038/s41467-024-46425-2.
[3]
J. M. Hofman et al., “Integrating explanation and prediction in computational social science,” Nature, vol. 595, no. 7866, pp. 181–188, Jul. 2021, doi: 10.1038/s41586-021-03659-0.
[4]
T. Rüttenauer, “Spatial Data Analysis.” arXiv, Feb. 2024. doi: 10.48550/arXiv.2402.09895.
[5]
J. P. LeSage, “What Regional Scientists Need to Know About Spatial Econometrics.” Rochester, NY, Jan. 2014. doi: 10.2139/ssrn.2420725.
[6]
A. S. Fotheringham, C. Brunsdon, and M. Charlton, Geographically Weighted Regression: The Analysis of Spatially Varying Relationships. John Wiley & Sons, 2003.
[7]
X. Sun, W. Xu, H. Jiang, and Q. Wang, “A deep multitask learning approach for air quality prediction,” Annals of Operations Research, vol. 303, no. 1, pp. 51–79, Aug. 2021, doi: 10.1007/s10479-020-03734-1.
[8]
P. Ploton et al., “Spatial validation reveals poor predictive performance of large-scale ecological mapping models,” Nature Communications, vol. 11, no. 1, p. 4540, Sep. 2020, doi: 10.1038/s41467-020-18321-y.
[9]
P. Expert, T. S. Evans, V. D. Blondel, and R. Lambiotte, “Uncovering space-independent communities in spatial networks,” Proceedings of the National Academy of Sciences, vol. 108, no. 19, pp. 7663–7668, May 2011, doi: 10.1073/pnas.1018962108.
[10]
K. Kopczewska, “Spatial machine learning: New opportunities for regional science,” The Annals of Regional Science, vol. 68, no. 3, pp. 713–755, Jun. 2022, doi: 10.1007/s00168-021-01101-x.
[11]
M. J. D. Smith, M. F. Goodchild, and P. Longley, Geospatial Analysis: A Comprehensive Guide to Principles, Techniques and Software Tools. Troubador Publishing Ltd, 2007.