Pedagogic challenges 2: urban systems science

Andy MacLachlan
The Barlett Centre for Advanced Spatial Analysis, UCL

Talk overview


Discipline background

  • Part 1: Urban systems science?

  • Part 2: Spatial data


Pedagogic challenges

  • Part 3: Data contamination / manipulation

  • Part 4: Big data

  • Part 5: Reproducibility

  • Part 6: Teaching criticality, data bias, reproducibility

Who am i

Part 1: Urban systems science ?

What do we mean

Urban systems

A set of towns and cities [or functions within cities] that can be considered linked together by various forms of social and economic interaction

Source: Oxford reference

Systems thinking

Methods aimed at studying a system through its collective behavioral features

Source: Cristiano et al. 2020

Tools for Systems Thinkers: The 6 Fundamental Concepts of Systems Thinking

Science of cities

The science of cities – using evidence to understand how cities work – is forever expanding

Source: UK Government

Urban science

Urban science is an interdisciplinary field that studies diverse urban issues and problems

Source: Wikipedia

Urban systems science

Urban systems science



Urban Systems: Cities [or functions within cities] that can be considered linked together [there is a relationship between them]

+

Urban systems science



Urban Systems: Cities [or functions within cities] that can be considered linked together [there is a relationship between them]

+



Urban Science: Urban issues and problems

Urban systems science



Urban Systems: Cities [or functions within cities] that can be considered linked together [there is a relationship between them]

+



Urban Science: Urban issues and problems

=



Smart Cities: networks and services are made more efficient with the use of digital solutions for the benefit of its inhabitants and business.

Source: Smart Cities, European Comission

Urban system science approach:

Updated from Grolemund & Wickham's classis R4DS schematic, envisioned by Dr. Julia Lowndes for her 2019 useR! keynote talk and illustrated by Allison Horst. Source: Allison Horst data science and stats illustrations

The same as regular data science but with spatial data

An example..Urban Heat Island effect

Fremantle Woolstore, Western Australia

An example….UHI


An example….UHI

Ran 4 scenarios:

  1. Original (existing) development (from satellite imagery)
  2. Proposed redevelopment as in the plan
  3. Proposed redevelopment removing trees
  4. Proposed redevelopment with trees covering the hottest pixels

How smart are cities?


Part 2: Spatial data

What is spatial data?

  • The earth is a 3D sphere (well, almost). It’s wider than it is tall
  • In order to locate a point on the surface of a sphere, we need a set of coordinates
  • Coordinates will tell us how near to the top or bottom of the sphere we are, or how far around
  • But where do we start?

What is spatial data 2?

Geographic Coordinate Reference System

  • treats the globe as if it was a sphere divided into 360 equal parts called degrees


Projected Coordinate Reference System

  • flat, two-dimensional plane (through projecting a spheroid onto a 2D surface) giving it constant lengths, angles and areas

Coordiante reference systems

Simply

Spatial data is just like normal data except it has an extra “geometry column”

Pedagogic challenges

Part 3: Data contamination / manipulation ?

Data contamination / manipulation


Ok, what about geographic data

Who has made our boundary data?

Who has made manipulated our boundary data?

Who has made our boundary data?

Redlining

  • 1930s – American Home Owner’s Loan Corporation – prevent missed payments…residential security maps based on race
    • People abandon areas
    • Can’t refinance
    • Less property tax for services
    • Social equity issues remain
    • 1968 Fair Housing Act

Los Angeles Redlining

Who has made our boundary data?

Gerrymandering

Every 10 years electoral districts are re-drawn “redistricting”– Thomas Hofeller (republican) = PACK and CRACK

  • PACK = put all the democrat voters in 1 district
  • CRACK = sprinkle them out so they never have majority

Gerrymandering

“Redistricting is democracy at work” - Tom Hofeller

Pedagogic challenges

Part 4: Big Data

Big data

Big geospatial data include datasets that are too large to be processed using traditional GIS tools

Source: GIS Harvard

Why are they large?

Raster

  • Landsat satellite data: 400 scenes of Earth a day, revising each location every 16 days

    • Each scene is about 1GB
    • We’d used Google Earth Engine - not considered here

Vector

  • New York City Taxi and Limousine Commission (TLC) all records from Yellow and Green Cabs

    • 150GB, 1.2 billion records
  • Open Street Map

    • 1764.5GB when uncompressed

What can we do about it?

Parquet files

  • We are moving from row based storage to column based

  • About 50x faster than a .csv

  • It groups our data.

    • For example a row group size of 2, puts rows all the data from 1 and 2 next to each other then we have 3! = GROUPS or PARTITION

    • If we have large data this means we can skip groups we don’t need

Demystifying the Parquet File Format

New York City Taxi and Limousine Commission (TLC) all records from Yellow and Green Cabs

Concepts

  • You may come across Arrow - this is an in-memory format, Parquet is a storage format

 

  • In the R for Data Science book a 9BG .csv is queried in

    • 11 seconds for standard code
    • 0.063 seconds using a parquet file! 100x faster

We can go faster!


DuckDB

  • Database management system

  • Columnar data

  • No installation

  • Convert our Parquet file to DuckDB and back again!

to_duckdb() 
to_arrow()

Regarding performance, parquet is 717 times faster than the same query on a csv file, and duckdb is 2808 times faster.

Source: Christophe Nicault

Notes

  • All (parquet and DuckDB) make sure of dplyr ! select(), filter(), groupby() = direct integration with R

  • Currently the support for spatial data is very limited

  • sfarrow - can load and query the data but can’t do any analysis!

Postgres

Postgres = object-relational database

DVD Rental Model

PostgreSQL has a PostGIS extention

This allows the “geometry” column and spatial quieres

Making random points in polygons

5 million random points

  • QGIS = 226 seconds
  • PostGIS = 18 seconds

Source: Why should you care about PostGIS? — A gentle introduction to spatial databases

PostGIS


Starting

  • Despite all these tools we must start with the basics.

  • Often this is in Quantum GIS (free) or ArcMap($)

  • We will be exploring QGIS in the workshop later

Pedagogic challenges

Part 5: Reproducibility

What led me here?

  • Lecture with Carl Howe

2017, 90% of the data in the world today has been created in the last two years alone, at 2.5 quintillion bytes of data a day! - IBM

Ok, what about geographic data

A shifting landscape

Paper: Opening practice: supporting reproducibility and critical spatial data science
  • A comparison of Geographical Weighted regression across:
    • 4 open software packages
    • 2 black box / commercial implementations

All of the implementations were tested with the same input data.


They all gave the same results except the ESRI/ArcGIS implementation (Li 2018)


and although ESRI provide help for the GWR tools, the actual coding is closed—the underlying code is not revealed

Part 6: Teaching criticality, data bias, reproducibility


    1. Lead by example


  • 1b. Listen to Alumni / employers


  • 1c. Learn by doing


    1. Don’t assess it, make it mandatory for the assessment*

1. Lead by example

  • Traditional labs and were distributed in pdfs, word documents and powerpoints.

  • Used ArcGIS 💰

1. Lead by example

1b. Listen to Alumni / employers

1c. Design and outputs

Learning happens by doing


Weekly homework that we dedicate time to discussing

  • Week 1-5 tasks
  • Week 6-9 practice exam

1c. Design and output

Part 1: GIS tools…subject based learning

You need calculate the average percent of science students (in all) grades per county meeting the required standards

Part 2: GIS analysis… problem based learning

Each practical answers a question….

What are the factors that might lead to variation in Average GCSE point scores across the city?

What are we assessing?


Can students apply the tools / methods with different scenarios and data ?


Can students critique the process

2. Make it mandatory for the assessment

Part 2: GIS analysis, example practice question

New York City wish to conduct a study that aims to prevent people being evicted through understand possible related factors.You have been enlisted as a consultant and tasked to conduct an analysis of their data from 2020.

Data:

2. Make it mandatory for the assessment

DISCUSS

  • How were the evictions recorded

  • Why were there limited evictions during 2020/ then a sudden peak? - COVID ban on evictions

  • How can identifying spatially related factors to evictions be useful

  • Are there certain areas that have higher evictions than others - why might this be?

  • What assumption does the data make

  • What assumptions does the model make

2. Make it mandatory for the assessment

Students

  • Click the URL and generates a new repository

  • Staff can see their work and when they make edits (commit / push)

Conclusion

  • It is essential to use data to inform decisions…BUT we must develop a critical awareness of:

    • How the data has been created
    • How the boundary data has been created
    • What the agenda was for collecting the data
  • In addition we must recognize that:

    • Data is a snapshot / sample of the population
    • Analysis attempts to model the world - it is never perfect.

Scientists must have a say in the future of cities, McPhearson 2016