1. What is spatial data
Difference to “normal” data
Spatial data formats
Representation
2. What is a GIS
Data to wisdom
GI systems
3. GIS as a research method
How to lie with maps (and data)
Point patterns
Autocorrelation
Spatial modelling
Networks
Earth observation
Associate Professor in Spatial Data Science and Visualization at CASA, UCL
Lead MSc modules in:
Research:
Big data for allocating funding
Spatial data is just like normal data except it has an extra “geometry column” often referred to as “geom”
Spatial data is typically an empty shape or object:
The magic happens when we join non spatial data to spatial data
e.g. here is an excel file that has data on:
“While we teach our students the benefits of visualization, answering the specific hypothesis-driven questions did not require plotting the data. We found that very often, the students driven by specific hypotheses skipped this simple step towards a broader exploration of the data. In fact, overall, students without a specific hypothesis were almost five times more likely to discover the gorilla when analyzing this dataset”
Yanai and Lercher (2020)
Each column also has a data type
Geographic Coordinate Reference System
Projected Coordinate Reference System
Mercator:
Gall-Peters projection:
South UP?
We need to separate Geographic Information / Spatial Data SCIENCE from the tools we use to carry it out:
GI / SD Science is facilitated by the information systems (technologies – hardware, software + human interaction) we use to:
Developed by Roger Tomlinson at UCL in 1970s
From 1982, ESRI cornered the GIS market in terms of data storage, tools for Query /Processing / Manipulation / Analysis and Display / Visualisation
The Graphical User Interface – GUI – allowed non-specialists to carry out GI Science for the first time
“Everyone does need to learn to code. It is no longer sufficient for a GI Scientists to just work with a standard GIS interface: menus, buttons and black boxes. ” Brunsdon and Comber (2020)
Comparison of 4 GWR approaches all produced the same result except the method in ArcGIS from ESRI who didn’t reveal the code.
Data management/ storage facilitated through connection to a range of file formats (almost any you can think of)
Plus a huge host of software packages for reading, writing and converting data held within these files into a format R can handle
Electoral college
Democrats got more votes, but lost
Every 10 years electoral districts are re-drawn “redistricting”– Thomas Hofeller (republican) = PACK and CRACK
“Redistricting is democracy at work” - Tom Hofeller
Urban Systems: Cities [or functions within cities] that can be considered linked together [there is a relationship between them]
+
Urban Systems: Cities [or functions within cities] that can be considered linked together [there is a relationship between them]
+
Urban Science: Urban issues and problems
Urban Systems: Cities [or functions within cities] that can be considered linked together [there is a relationship between them]
+
Urban Science: Urban issues and problems
=
Smart Cities: networks and services are made more efficient with the use of digital solutions for the benefit of its inhabitants and business.
Source: Smart Cities, European Comission
GIS can be applied to any field as long as there is some spatial data:
Health
Agriculture
Sustainability
Forestry…
Participatory GIS
Monitoring (social, economic, resource) changes
Environmental management / environmental impact assessments
What is a forest? Conceptual vagueness of geographic concepts. Bennett(2001)
Are these points distributed in a random way or is there some sort of pattern (uniform or clustered)?
We should expect randomness but randomness conforms to known probability distributions
How (dis)similar are our values assigned to geographic units across geographic space
Something in the water: the mythology of Snow’s map of cholera. Source:Kenneth Field
What is fixed?
Point Pattern Analysis
Properties fixed, but space (location - x,y) can vary
Spatial Autocorrelation
Space is fixed, but properties (values) can vary
Question: Are the points clustered or are they random?
If we identify clusters what are the socio-economic characteristics ?
What are the factors that might lead to variation in Average GCSE point scores across the city?
What are the factors that might lead to variation in Average GCSE point scores across the city?
Dependent variable labelled y
Residuals are the dots not on line
\(y_i = \beta_0 + \beta_1x_i + \epsilon_i\)
\(\beta_0\) is the intercept (the value of \(y\) when \(x = 0\) - somewhere around 370 on the graph above);
\(\beta_1\) is sometimes referred to as the ‘slope’ parameter and is simply the change in the value of \(y\) for a 1 unit change in the value of \(x\) (the slope of the blue line) - around -40.
\(\epsilon_i\) is a random error term sums 0 - add all of the vertical differences between the blue line and all of the residuals
Here we haven’t accounted for the influence of a neighbour…
In our model the residuals (observed-predicted) should show no pattern (or autocorrleation)…in other words they should be random!
At this point we have several options…
Global - single regression models
lag model - we can include the value of neighboring GCSE values in the model
error model - we treat the errors as a nuisance but we must account for them.
Local model
Geographically weighted regression - we do lots of local regression models and get coefficients that vary over the study area.
Machine Learning versions = geographically weighted random forests…
You are in charge of a supermarket chain and you want to open a new store
You work for a delivery company and want to locate a warehouse
In the event of an emergency call, where is the nearest fire station / ambulance / police car?
Inputs:
What are you solving for?
Minimize Impedance: locate warehouses, because it can reduce the overall transportation costs of delivering goods to outlets
Maximize Coverage: fire stations - must arrive to demand point within response time
Maximize Capacitated Coverage: hospitals - serve demand without going over capacity
food deserts’ where people are likely to pay a higher cost for their weekly food shopping and have to shop in more expensive small convenience stores with a limited stock of good value fresh products.
The E-food Desert Index (EFDI) is a composite index which measures accessibility to groceries. Source:Trust for London
Instead of just using the network we can now combine public transport, elevation, use timetables and set parameters such as maximum waiting time using R5R
R5R example. Source:Duncan Smith
Not all countries (e.g. developing countries) will have data on transport, networks and facilities.
Question: The World Bank wanted to determine how many people in rural Tanzania had access to a water point within 30 minutes.
Data:
Leonardo Brito became chief of police at the Police Specialized in Crimes Against the Environment (DEMA) in Brazil’s Amapá stated, he noticed that the department hardly ever investigated environmental crimes
2 employees, two vehicles, a boat and a drone, which collects only 20 minutes of footage at a time, to patrol an area of forest the size of Nepal.
Brito said that since they starting using the app, Amapá’s environmental police have been able to detect 5,000 areas of deforestation in the state, both legal and illegal. He adds that every day he sees new locations to add to the ever-growing list.
Trying to clear small patches to avoid detection!
Ran 4 scenarios:
Big geospatial data include datasets that are too large to be processed using traditional GIS tools
Source: GIS Harvard
Raster
Landsat satellite data: 400 scenes of Earth a day, revising each location every 16 days
Vector
New York City Taxi and Limousine Commission (TLC) all records from Yellow and Green Cabs
Open Street Map
We are moving from row based storage to column based
About 50x faster than a .csv
It groups our data.
For example a row group size of 2, puts rows all the data from 1 and 2 next to each other then we have 3! = GROUPS or PARTITION
If we have large data this means we can skip groups we don’t need
In the R for Data Science book a 9BG .csv
is queried in
Database management system
Columnar data
No installation
Convert our Parquet file to DuckDB and back again!
Regarding performance, parquet is 717 times faster than the same query on a csv file, and duckdb is 2808 times faster.
Source: Christophe Nicault
All (parquet and DuckDB) make sure of dplyr
! select()
, filter()
, groupby()
= direct integration with R
Currently the support for spatial data is very limited
sfarrow - can load and query the data but can’t do any analysis!
5 million random points
Despite all these tools we must start with the basics.
Often this is in Quantum GIS (free) or ArcMap($)
It is essential to use data to inform decisions…BUT we must develop a critical awareness of:
Almost any data can be spatial
We must recognize that:
Scientists must have a say in the future of cities, McPhearson 2016
GIS as a research method • Andy MacLachlan