Network Science Institute | Northeastern University
NETS 7983 Computational Urban Science
2025-03-31
This week:
Introduction to Urban Data - Mobile Phone Data as a tool for understanding urban systems
Mobile phone data is a powerful tool for understanding urban systems. It provides detailed information about the behavior and mobility of individuals, which can be used to study a wide range of urban phenomena, including transportation, land use, social interactions, and economic activity.
Mobile phone data has several advantages over traditional data sources, such as surveys and censuses:
Three main types of mobile phone data are used in urban science:
Call Detail Records (CDRs): These are records of calls, texts, and data usage made by mobile phone users. They include information about the time, duration, and location of the communication.
Location-based Services (LBS) data: This data comes from apps that use users’ geolocation to provide a service. It includes information about the user’s location, the time of the location, and the type of location.
Bluetooth and Wi-Fi data: This data comes from mobile phones’ Bluetooth and Wi-Fi signals. It includes information about the user’s proximity to other devices and the time of the proximity.
Those three types of data have very different spatial and temporal scales
Mobile phone data is widely used today, especially after COVID-19, by different groups of organizations:
Academia: Researchers use mobile phone data to study urban systems, develop models of human behavior, and test social science theories.
Government: Policymakers use mobile phone data to inform public policy, monitor the impact of interventions, and plan for emergencies. For example, the CDC tracks the spread of diseases, the DOT monitors traffic congestion, and National Statistical Offices use it to complement traditional census data.
Industry: Companies use mobile phone data for marketing, advertising, and product development.
Nonprofits: NGOs use mobile phone data to understand the needs of vulnerable populations, target resources, and evaluate the impact of their programs. For example, the UN and UNICEF constantly use mobile phone data to understand the effects of natural disasters, refugee movements, and the spread of diseases.
Mobile phone data is used in urban science to study a wide range of urban phenomena, including:
Transportation: For example, CDR or LBS data can be used to study travel patterns, traffic congestion, and public transportation usage [1] [2]
Land use: CDR and LBS can be used to study the distribution of land uses, the density of buildings, and the use of public spaces. [3] [4] [5]
Social interactions: CDR data like calls or texts to study social networks, the spread of information, and the dynamics of social groups. [3] [6] [7]
Economic activity: LBS data can be used to study the location of businesses, the flow of goods and services, and the impact of economic policies. [8] [9] [10]
Public health: LBS data and Bluetooth data can be used to study the spread of diseases, the effectiveness of public health interventions, and the impact of environmental factors on health [11] [12], [13] [14]
Natural disasters: Mobile phone data like CDR or LBS can be used to study the impact of natural disasters on urban systems, the effectiveness of emergency response, and the resilience of cities, [15], [16] [17].
Mobile phone data is a typical example of secondary data. Private companies usually collect data for marketing, billing, network optimization, and other purposes. The data is then sold to third parties, who use it for research, policy, or commercial purposes.
It is important to understand how the data is collected and processed, as this can affect the quality and reliability of the results. Some of the main challenges of using mobile phone data for urban science include:
Drifting: Mobile phone data is subject to changes in how it is collected, processed, and stored. This can lead to inconsistencies and errors in the data, which can affect the validity and reliability of the results.
Privacy: Mobile phone data is highly sensitive and can be used to infer personal information about individuals. This raises concerns about privacy, data security, and the potential for data misuse.
Bias: Mobile phone data is not representative of the general population and is subject to various biases, such as demographic, geographical, temporal, and behavioral biases. This can affect the validity and generalizability of the results.
Data accessibility and Data processing: Mobile phone data is expensive and difficult to access. It is typically stored in large databases and requires specialized analytical tools and techniques. This can create barriers to entry for researchers, policymakers, and the public.
Call Detail Records (CDRs) are records of calls, texts, and data usage made by mobile phone users. They include information about the time, duration, and location of the communication. Here is an example of that data
Note that each raw is a communication event. XDR (Extended Detail Records) also contains data events, such as data usage and app usage.
This means that the spatial accuracy of CDR data is typically around 100 meters in urban areas, but it is very large in rural areas. This is one of CDR data’s main limitations in studying small urban spaces.
CDR temporal resolution: CDR data typically have a temporal resolution of minutes or hours. This means that we can track individuals’ movements over time and study their behavior in detail.
Population coverage of CDR: CDR data typically covers a large fraction of the population since a large fraction owns a mobile phone. CDR data does not require a data connection, only typically 3G or 4G. Around 97% of the global population (and 90% in the least developed countries) have mobile network coverage. Only 55% of devices in least developed countries are data-enabled mobile devices. This means that we can study the behavior of a large number of individuals and make inferences about the general population.
Mobile coverage rate worldwide, from Statista
There are 3 types of information we can get from CDR data:
Social interaction: CDR data contains highly detailed information about when and where people communicate with each other. This information can be used to study social networks, the spread of information, and the dynamics of social groups.
Mobility: CDR data contains information about the user’s location at the time of the communication. This information can be used to reconstruct the mobility patterns of individuals and areas.
Data usage: XDR data contains information about the user’s data usage, including the type of application and the location where the data is consumed. This information can be used to study the digital behavior of users.
This wealth of information made CDRs a powerful tool for understanding social and urban systems.
For example, using CDR data to reconstruct the social network of people, Onnela and collaborators [3] studied the Granovetter hypothesis that weak ties are more important for the spread of information than strong ties at a societal level.
From [3]
Marta Gonzalez, Cesar Hidalgo, and Laszlo Barabasi, and collaborators used CDR data to study the mobility patterns of individuals in urban areas [18]. They also found that people tend to move in predictable patterns, with a few locations accounting for most of their time and activity [19]
From [19]
Dynamic Population Mapping Using Mobile Phone Data [20]
Traffic by app to show the inequality of data usage by different demographic groups [21]
This kind of data was used in the last Netmob 2023 Data Challenge, see [22]
CDR data is prone to most of the problems of secondary data; see Lecture 1-2.
Incomplete: Most CDR data used in analyses come from a single mobile phone operator, which means that they may not represent the general population. Thus, we are missing social connections with a large fraction of the population.
Drifting: Users often change their mobile phone operators (churns). In developed countries churn rate is around 3% monthly
Bias: Although penetration of phones and smartphones is high in developed countries, mobile phone users are not representative of the general population in other geographies. For example, in developing countries, mobile phone users tend to be younger, wealthier, and more urban than the general population.
Here are some reviews to understand the value of CDR data:
Location-based Services data come from apps that use users’ geolocation to provide a service. For example, maps, place recommendations, weather, ride-sharing, and shopping apps collect users’ locations.
Apps collect those locations at different times. When used, they typically collect them in the foreground, but they also collect them in the background to minimize the response time.
From RST
Some apps sell location data to third parties. Aggregator companies like Cuebiq, Safegraph, Placer.ai, and others collect locations from different apps and curate and aggregate them to obtain detailed user trajectories in urban areas.
The Location Data Industry: Collectors, Buyers, Sellers, and Aggregators, from The Markup
This raw trajectory data is processed and combined with census data and Points-of-Interest (POI) datasets to produce secondary datasets:
Here is what the raw data looks like. Each raw corresponds to a different location (ping) at a specific time for a particular device (id). Sometimes, companies have meta-data about the device, such as the device type, the app that generated the data, or the demographics of users of that device.
LBS data comes from smartphones’ A-GPS technology. This technology uses the GPS signal to locate the device and cell towers to triangulate its position. Because of this, the accuracy of the location is not perfect and can vary from a few meters to 100 meters. LBS data typically reports a circle (center as lat,long, and radius as accuracy) where the user is most likely to be.
A-GPS technology, from the Wikipedia
Because it uses cell towers and satellite signals, the location data can be affected by shadowing of buildings (urban canyon problem) or indoor localization problems.
From GEO-awesome
We can reconstruct an individual mobility from the raw data by ordering them in time. Note that each point has an accuracy (radius) around it.
Raw data is typically processed to extract user activity and mobility information. Many techniques, including clustering, trajectory segmentation, and activity recognition, have been proposed for processing LBS data.
This research is typically known as trajectory data mining, and there is extensive literature about it; see [25] for a review.
The first step in processing this data is to clean it. This involves removing duplicate locations, filtering out locations with low accuracy, and removing outliers.
The following step in processing those trajectories is to extract information about visits and trips. This can be done by clustering the locations in space and time. Many methods exist based on ideas similar to DBSCAN (or HDBSCAN). Some of them are the Hariharan and Tomaya [26] or the InfoStop [27].
At the same time, we can detect the trips between those visits. Using map-matching, public transportation schedules, and information about the trips, we can detect the mode of transportation.
We can also detect users ’ home and work locations using the most common visits at night and during working hours. Using information about the area where people live, we can also assign some demographic traits to the device users (only at the census area level). That demographic assignment can also be used to correct population and demographic biases.
Finally, we can use an external dataset of Points-of-Interest (POIs) or urban polygons to understand the type of places users visit. This process is called visit attribution, and it depends on the quality of the POI dataset, the accuracy of the visit detection algorithm, the density of POIs, etc. See more information about it, its limitations, and challenges in [28] and in this SafeGraph white paper.
Apart from the obvious uses of LBS data for location-based services, there are many other applications in urban science, primarily related to understanding human mobility and behavior. Some of the most common applications include:
This, together with the possibility of breaking down this data by demographics, time of the day, day of the week, etc., makes LBS data a powerful tool for understanding urban systems.
Use of LBS data to understand the diversity of people visiting places or urban spaces. E.g., the “Atlas of Inequality” investigates the (income) diversity of the people visiting different places.
From Atlas of Inequality
Use mobility data to detect contact matrices between individuals by type of place
From [12]
Detect vehicular travel flow between urban areas
From [1]
LBS Data is prone to most of the problems of secondary data, see Lecture 1-2
Different types of biases in LBS data
Bias: There are many techniques to alleviate some of those biases. Some of these methods are:
Pre-stratification or panel definition: since the sample of users in LBS data is not random, we can use external data to correct those biases to select a more representative sample of users. Typically, we would like those users to be equally distributed across geographical areas, demographics, and behaviors. This is only possible if we have access to individual trajectories.
Post-stratification or weighting: if we cannot correct the biases in the sample, we can use the external data to weight the data to make it more representative. This is possible even for aggregate data. For example, if we know a particular outcome variable \(m_d\) by demographic strata \(d\) and we know \(w_d\), the penetration rate of our LBS data by strata \(d\), we can use the ratio of the outcome in the sample to the outcome in the population to weigh the data:
\[ \hat m = \sum_d w_d m_d \]
Bias: One way to correct for potential biases is to use external ground truth data to validate and reweight our data. For example:
Comparison of the official average attendance to major professional sports games (NFL, NHL, and NBA) with estimations from LBS data
Bias: One way to correct for potential biases is to use external ground truth data to validate and reweight our data. For example:
In [30], voting roll data validated the number of visits to polling places, allowing for a more detailed comparison of demographic groups.
Or student seasonality behavior for specific brands in US College Towns or worker count for Manufacturing Facilities in the US, done by Unacast.
The same technique applies to flows or trips. Using the Department of Transportation (DoT) data, [1] compared the results from LBS data to the number of trips in the DoT data.
Privacy:
LBS data is highly sensitive. It can be used to infer personal information about individuals, such as home and work locations, daily routines, and social interactions.
This information can be used to track individuals, target them with ads, or discriminate against them. There are many techniques to protect the privacy of LBS data, such as differential privacy or k-anonymity. For example:
Data access: LBS data is expensive and difficult to access. It is often sold by aggregators to third parties, who may not have the expertise to analyze it. This creates unequal access for researchers, policymakers, and the public.
Data processing: LBS data is complex and requires specialized analytical tools and techniques. It is often stored in large databases and requires significant computational resources. There are some libraries to process that kind of data:
CUS 2025, ©SUNLab group socialurban.net/CUS