GIS as a research method

Andy MacLachlan
The Barlett Centre for Advanced Spatial Analysis, UCL

Talk overview

1. What is spatial data

  • Difference to “normal” data

  • Spatial data formats

  • Representation


2. What is a GIS

  • Data to wisdom

  • GI systems

3. GIS as a research method

  • How to lie with maps (and data)

  • Point patterns

  • Autocorrelation

  • Spatial modelling

  • Networks

  • Earth observation

Who am i

What is spatial data?

Spatial data is just like normal data except it has an extra “geometry column” often referred to as “geom”

Spatial data

Spatial data is typically an empty shape or object:

  • Grid cells / hexagons
  • Polygons
  • Lines
  • Points (e.g. restaurants)


The magic happens when we join non spatial data to spatial data

If someone gives you some (spatial) data, what is the first thing you might …

e.g. here is an excel file that has data on:

  • BMI
  • steps taken per day
  • location

Plot it!

“While we teach our students the benefits of visualization, answering the specific hypothesis-driven questions did not require plotting the data. We found that very often, the students driven by specific hypotheses skipped this simple step towards a broader exploration of the data. In fact, overall, students without a specific hypothesis were almost five times more likely to discover the gorilla when analyzing this dataset”

Yanai and Lercher (2020)

Types of spatial data: raster (‘grid’) Vs vector

Vector

Raster

Features - vector data

Each column also has a data type

  • Text
  • Numeric?

Representing the World

  • The earth is a 3D sphere (well, almost). It’s wider than it is tall
  • In order to locate a point on the surface of a sphere, we need a set of coordinates
  • Coordinates will tell us how near to the top or bottom of the sphere we are, or how far around
  • But where do we start?

Representing the World

Geographic Coordinate Reference System

  • treats the globe as if it was a sphere divided into 360 equal parts called degrees


Projected Coordinate Reference System

  • flat, two-dimensional plane (through projecting a spheroid onto a 2D surface) giving it constant lengths, angles and areas

Coordiante reference systems

Representing the World

What is wrong with this map?

Can you make an accurate map


Representing the World

Mercator:

  • European imperialist attitude
  • Size = POWER, ethnic bias
  • Are you a sailor?

Gall-Peters projection:

  • Right size of countries

South UP?

  • Top = important
  • First known map 1154 Arab geographer Muhammad al-Idrisi

What is a GIS

We need to separate Geographic Information / Spatial Data SCIENCE from the tools we use to carry it out:


GI / SD Science is facilitated by the information systems (technologies – hardware, software + human interaction) we use to:

  • Store (input)
  • Manipulate / Process
  • Distribute
  • Analyse
  • Retrieve (output)
  • Data / Information

We do GI Science on the GI Systems

Data vs information

The Canada GIS

Developed by Roger Tomlinson at UCL in 1970s

ESRI

  • From 1982, ESRI cornered the GIS market in terms of data storage, tools for Query /Processing / Manipulation / Analysis and Display / Visualisation

  • The Graphical User Interface – GUI – allowed non-specialists to carry out GI Science for the first time

QGIS (Quantum GIS)

  • Under development since 2002
  • Free & Open Source – everything is on github!
  • Probably best GUI GIS aside from Arc
  • Connects to PostGIS database very effectively
  • Slick maps with nice default features
  • Large library of plugins for analysis

Programming + GIS

“Everyone does need to learn to code. It is no longer sufficient for a GI Scientists to just work with a standard GIS interface: menus, buttons and black boxes. ” Brunsdon and Comber (2020)

Comparison of 4 GWR approaches all produced the same result except the method in ArcGIS from ESRI who didn’t reveal the code.

R + Pthyon

  • Data management/ storage facilitated through connection to a range of file formats (almost any you can think of)

  • Plus a huge host of software packages for reading, writing and converting data held within these files into a format R can handle

R + Pthyon

How to lie/ manipulate maps and outcomes

What is wrong with this?

What is wrong with this?

Issues

  • Polygon size does not equal population
  • Did all the people in each county vote for 1 party? NO.

Electoral college

  • Each state gets three votes
  • Remaining divided by population
  • BONUS to smaller states
  • 1 vote in Wyoming = 193,000 people
  • 1 vote in California = over 700,000 people

Democrats got more votes, but lost

A people based map?

Who has made our boundary data?

Who has made manipulated our boundary data?

Who has made our boundary data?

Redlining

  • 1930s – American Home Owner’s Loan Corporation – prevent missed payments…residential security maps based on race
    • People abandon areas
    • Can’t refinance
    • Less property tax for services
    • Social equity issues remain
    • 1968 Fair Housing Act

Los Angeles Redlining

Who has made our boundary data?

Gerrymandering

Every 10 years electoral districts are re-drawn “redistricting”– Thomas Hofeller (republican) = PACK and CRACK

  • PACK = put all the democrat voters in 1 district
  • CRACK = sprinkle them out so they never have majority

Gerrymandering

“Redistricting is democracy at work” - Tom Hofeller

Urban systems science

Urban systems science



Urban Systems: Cities [or functions within cities] that can be considered linked together [there is a relationship between them]

+

Urban systems science



Urban Systems: Cities [or functions within cities] that can be considered linked together [there is a relationship between them]

+



Urban Science: Urban issues and problems

Urban systems science



Urban Systems: Cities [or functions within cities] that can be considered linked together [there is a relationship between them]

+



Urban Science: Urban issues and problems

=



Smart Cities: networks and services are made more efficient with the use of digital solutions for the benefit of its inhabitants and business.

Source: Smart Cities, European Comission

How smart are cities?


GIS applications

GIS can be applied to any field as long as there is some spatial data:

  • Health

  • Agriculture

  • Sustainability

  • Forestry…

  • Participatory GIS

  • Monitoring (social, economic, resource) changes

  • Environmental management / environmental impact assessments

    • e.g. wind turbine?

Vague concepts?

What is a forest? Conceptual vagueness of geographic concepts. Bennett(2001)

Problematic

Point / continuous data

Questions we can ask / set


Points

Are these points distributed in a random way or is there some sort of pattern (uniform or clustered)?

We should expect randomness but randomness conforms to known probability distributions


Spatially continuous observations (e.g. values of polygons)

How (dis)similar are our values assigned to geographic units across geographic space

What patterns show

Something in the water: the mythology of Snow’s map of cholera. Source:Kenneth Field

Spatial Epidemiology: Lung Cancer

  • Similar methods that we will use today - incidences of lung cancer relative to the physical envrionment - are they clustered and where?…leads to why
  • are the locations of lung cancer similar?
  • Does the incinerator have any influence ?

Spatial Epidemiology: Mortality

  • Similar values might suggest there is something more going on
  • Some sort of spatial influence
  • Note the difference between spatially continuous and point data

Quantifying Spatial Patterns

What is fixed?

Point Pattern Analysis

  • Properties are fixed (e.g. binary - present or not)
  • Discrete objects - present or not, binary, yes or no.
  • Examples: fly tipping, stop and search, blue plaques, pharmacies

Properties fixed, but space (location - x,y) can vary

Spatial Autocorrelation

  • Space (e.g. the location of the spatial units - wards, boroughs etc) is fixed
  • The values of the spatial units vary
  • Where the values are similar we say they exhibit Spatial Autocorrelation

Space is fixed, but properties (values) can vary

Examples

  • Question: Are the points clustered or are they random?

  • If we identify clusters what are the socio-economic characteristics ?

Examples

  • Question: Are the values similar between certain wards?

Spatial modelling

What are the factors that might lead to variation in Average GCSE point scores across the city?

Spatial modelling

What are the factors that might lead to variation in Average GCSE point scores across the city?

  • Dependent variable labelled y

  • Residuals are the dots not on line

  • \(y_i = \beta_0 + \beta_1x_i + \epsilon_i\)

  • \(\beta_0\) is the intercept (the value of \(y\) when \(x = 0\) - somewhere around 370 on the graph above);

  • \(\beta_1\) is sometimes referred to as the ‘slope’ parameter and is simply the change in the value of \(y\) for a 1 unit change in the value of \(x\) (the slope of the blue line) - around -40.

  • \(\epsilon_i\) is a random error term sums 0 - add all of the vertical differences between the blue line and all of the residuals

Spatial regression

  • Here we haven’t accounted for the influence of a neighbour…

  • In our model the residuals (observed-predicted) should show no pattern (or autocorrleation)…in other words they should be random!

Spatial regression

At this point we have several options…

Global - single regression models

  1. lag model - we can include the value of neighboring GCSE values in the model

  2. error model - we treat the errors as a nuisance but we must account for them.

Local model

  1. Geographically weighted regression - we do lots of local regression models and get coefficients that vary over the study area.

  2. Machine Learning versions = geographically weighted random forests…

Networks / location allocation

You are in charge of a supermarket chain and you want to open a new store

You work for a delivery company and want to locate a warehouse

In the event of an emergency call, where is the nearest fire station / ambulance / police car?

Networks / location allocation

Inputs:

  • Street network (info on roads, speeds, turn restrictions)
    • Open Street Map (free, Volunteered Geographic Information data)
    • Ordnance Survey Master Map (UK national mapping agency)
  • Demand points
    • Often where people live (e.g. houses or census data)
  • Facilities
    • The resource (e.g. fire station, hospital)

Solver type…

What are you solving for?

  • Minimize Impedance: locate warehouses, because it can reduce the overall transportation costs of delivering goods to outlets

  • Maximize Coverage: fire stations - must arrive to demand point within response time

  • Maximize Capacitated Coverage: hospitals - serve demand without going over capacity

Great…however…

food deserts’ where people are likely to pay a higher cost for their weekly food shopping and have to shop in more expensive small convenience stores with a limited stock of good value fresh products.

The E-food Desert Index (EFDI) is a composite index which measures accessibility to groceries. Source:Trust for London

More poweful analysis

Instead of just using the network we can now combine public transport, elevation, use timetables and set parameters such as maximum waiting time using R5R

R5R example. Source:Duncan Smith

Cost distance analysis

Not all countries (e.g. developing countries) will have data on transport, networks and facilities.

Question: The World Bank wanted to determine how many people in rural Tanzania had access to a water point within 30 minutes.

Data:

  • Landcover from Global Human Settlement
  • Water points from World Bank
  • Friction surface from Malaria Atlas (the cost to travel across the cell)
  • 100m population data from World Pop
  • Tanzania outline from GADM

Cost distance analysis

Earth Observation 1

  • Leonardo Brito became chief of police at the Police Specialized in Crimes Against the Environment (DEMA) in Brazil’s Amapá stated, he noticed that the department hardly ever investigated environmental crimes

  • 2 employees, two vehicles, a boat and a drone, which collects only 20 minutes of footage at a time, to patrol an area of forest the size of Nepal.

Earth Observation 1

Brito said that since they starting using the app, Amapá’s environmental police have been able to detect 5,000 areas of deforestation in the state, both legal and illegal. He adds that every day he sees new locations to add to the ever-growing list.

Trying to clear small patches to avoid detection!

Earth Observation 2

Fremantle Woolstore, Western Australia

An example….UHI


Data

Models

Scenarios

Ran 4 scenarios:

  1. Original (existing) development (from satellite imagery)
  2. Proposed redevelopment as in the plan
  3. Proposed redevelopment removing trees
  4. Proposed redevelopment with trees covering the hottest pixels

Policy

The future: Big Data

Big data

Big geospatial data include datasets that are too large to be processed using traditional GIS tools

Source: GIS Harvard

Why are they large?

Raster

  • Landsat satellite data: 400 scenes of Earth a day, revising each location every 16 days

    • Each scene is about 1GB
    • We’d used Google Earth Engine - not considered here

Vector

  • New York City Taxi and Limousine Commission (TLC) all records from Yellow and Green Cabs

    • 150GB, 1.2 billion records
  • Open Street Map

    • 1764.5GB when uncompressed

What can we do about it?

Parquet files

  • We are moving from row based storage to column based

  • About 50x faster than a .csv

  • It groups our data.

    • For example a row group size of 2, puts rows all the data from 1 and 2 next to each other then we have 3! = GROUPS or PARTITION

    • If we have large data this means we can skip groups we don’t need

Demystifying the Parquet File Format

New York City Taxi and Limousine Commission (TLC) all records from Yellow and Green Cabs

Concepts

  • You may come across Arrow - this is an in-memory format, Parquet is a storage format

 

  • In the R for Data Science book a 9BG .csv is queried in

    • 11 seconds for standard code
    • 0.063 seconds using a parquet file! 100x faster

We can go faster!


DuckDB

  • Database management system

  • Columnar data

  • No installation

  • Convert our Parquet file to DuckDB and back again!

to_duckdb() 
to_arrow()

Regarding performance, parquet is 717 times faster than the same query on a csv file, and duckdb is 2808 times faster.

Source: Christophe Nicault

Notes

  • All (parquet and DuckDB) make sure of dplyr ! select(), filter(), groupby() = direct integration with R

  • Currently the support for spatial data is very limited

  • sfarrow - can load and query the data but can’t do any analysis!

Postgres

Postgres = object-relational database

DVD Rental Model

PostgreSQL has a PostGIS extention

This allows the “geometry” column and spatial quieres

Making random points in polygons

5 million random points

  • QGIS = 226 seconds
  • PostGIS = 18 seconds

Source: Why should you care about PostGIS? — A gentle introduction to spatial databases

PostGIS


Starting

  • Despite all these tools we must start with the basics.

  • Often this is in Quantum GIS (free) or ArcMap($)

Conclusion

  • It is essential to use data to inform decisions…BUT we must develop a critical awareness of:

    • How the data has been created
    • How the boundary data has been created
    • What the agenda was for collecting the data
  • Almost any data can be spatial

  • We must recognize that:

    • Data is a snapshot / sample of the population
    • Analysis attempts to model the world - it is never perfect.

Scientists must have a say in the future of cities, McPhearson 2016