All Collections
Direct Database Access (BigQuery)
Converting BigQuery Results to Shapefile Format
Converting BigQuery Results to Shapefile Format

Step-by-step instructions for how to create shapefiles from BigQuery results

J
Written by John Buckholz
Updated over a week ago

Replica offers direct database access to Places data via Google’s BigQuery (BQ). You can learn more about what we offer via BQ access here. BQ offers a flexible way for customers to analyze our trip and population datasets. Some customers extend the power of BQ by connecting to our datasets with Python. Accessing BQ via Python unlocks the ability to export large query results. It can also facilitate analysis of our data alongside external datasets.

This article documents a method to export Replica trip or population query results to shapefile or or other geospatial file formats.

When would I want to export table data to shapefile format?

Replica provides geographic information (e.g. trip start points) in Well-Known-Text (WKT) format. This standard allows many desktop softwares to interpret a written set of coordinates as a geometric object. Although it is typically a straightforward process to visualize tabular data in a .csv file containing WKT, you may prefer the ease-of-use of the industry standard shapefile format. Replica offers the ability to download shapefile formats in both Places and Trends.

The purpose of this article is to offer the same flexibility to customers who are using BQ for custom aggregations, large queries, or analyses involving Replica outputs and third-party datasets. The article includes an example query that generates custom tract-level statistics for the synthetic population of Indiana, using the Fall 2021 season for the Great Lakes megaregion, as well as a demonstration of the technique to convert that tabular output into a shapefile.

What do I need to connect to BQ with Python to create a shapefile?

This workflow and example have the following requirements:

  • A working installation of Python that includes the following libraries:

  • Replica Places Direct BQ Access. You can request access to BQ via this link. We typically grant access within 5 business days.

  • Familiarity with the tables and associated schema available to you via BQ access. A user guide can be found here.

Example use case: Custom population portrait

In this example, we’ll generate tract-level portraits of educational attainment for the Fall 2021 Great Lakes season. Specifically, we’ll use basic math functions to calculate the proportion of:

  1. Adults between ages of 25 and 65 who have an advanced degree; and

  2. The share of those adults who are classified as:

    1. Low-income (individual income < $35,000);

    2. Medium-income (individual income between $35,000 and $75,000); and

    3. High-income (individual income > $75,000)

This query highlights the power of BQ to create on-the-fly summarizations of data contained in one or more columns. The output of the query includes several columns that might be used to symbolize a thematic choropleth map.

The image below depicts the shapefile loaded into QGIS and symbolized to display the percentage of residents of each tract who have an advanced degree and whose annual individual income exceeds $100,000.

Example map of Indiana population metrics

Steps to execute workflow

  1. Customize the BQ SHAPEFILE python script reproduced below (and available to download here)

    1. If desired, designate a location where the “output” dataframe should be saved and add language to save the dataframe to that location. The “to_file” function can generate other file types, depending on available drivers.

    2. Modify the query as desired.

  2. Save the script to your hard drive

  3. Execute the script

Considerations

Because a Python interpreter is designed to identify and debug Python errors, it can be difficult to diagnose errors that may be present in a SQL query contained within a Python script. This query includes nested functions (e.g. “COUNTIF” within a “ROUND” function). It may be easier to track and validate necessary changes to syntax within the BQ UI before copying a modified query into this Python script.

Note that this query utilizes the “SAFE_DIVIDE,” function, which will return a NULL value if an error occurs. “Safe” functions like this one are often useful for calculated fields that are generated by performing operations on columns that may contain NULL or 0 values.

Finally, note the use of the Block Group FIPS code to filter quickly to residents of Indiana. The nesting nature of FIPS codes makes them a helpful way to quickly filter queries by Census-defined geographies.

# Python Script - BQ SHAPEFILE

import geopandas as gpd
import pandas_gbq
import shapely.wkt

query = '''
WITH counts AS (
SELECT
COUNT(person_id) AS people,
ROUND(COUNTIF(education = "advanced_degree" AND
individual_income < 35000) /
COUNT(person_id), 4) AS low_inc_adv,
ROUND(COUNTIF(education = "advanced_degree" AND
individual_income BETWEEN 35000 AND 75000) /
COUNT(person_id), 4) AS med_inc_adv,
ROUND(COUNTIF(education = "advanced_degree" AND
individual_income > 75000) /
COUNT(person_id), 4) AS high_inc_adv,
LEFT(BLOCKGROUP, 11) AS tract
FROM replica-customer.great_lakes.great_lakes_2021_Q4_population
WHERE age BETWEEN 25 AND 65
AND LEFT(BLOCKGROUP, 2) = "18"
GROUP BY tract)

SELECT
counts.*,
ROUND(low_inc_adv + med_inc_adv + high_inc_adv, 4) AS total_adv,
ROUND(SAFE_DIVIDE(high_inc_adv, low_inc_adv + med_inc_adv
+ high_inc_adv), 4) AS high_inc_prob,
geom AS geometry
FROM counts
JOIN replica-customer.Geos.trct
ON counts.tract = trct.raw_id
ORDER BY tract
'''

# execute the query above and convert the result to a pandas dataframe
output = pandas_gbq.read_gbq(query, project_id='replica-customer')

# convert the WKT geometry representation to a shapely geometry
geometry = output['geometry'].map(shapely.wkt.loads)

# remove the original geometry column from the dataframe
output = output.drop('geometry', axis=1)

# create a GeoPandas geodataframe from the original dataframe, using
# the shapely geometry created above
gdf = gpd.GeoDataFrame(output, crs='EPSG:4326', geometry=geometry)

# now that the geodataframe is created, delete the original dataframe to conserve memory
del output

# convert the geodataframe to an ESRI shapefile. This is the default file format
# for the geopandas "to_file" function. Other file formats can be created using this
# method by specifying the appropriate file path and driver.

# modify the path to indicate the location where the file should be saved on your hard drive

gdf.to_file('/path.shp')

Did this answer your question?