How to Connect to BigQuery with Python

Connecting to BigQuery with Python

Replica offers direct database access to Places data via Google’s BigQuery (BQ). You can learn more about what we offer via BQ access here. Popular uses of BQ include generating link-based origin-destination matrices; performing flexible queries; summarizing trip and person characteristics; and comparing trip or people metrics across geographies. When the outputs of a query are relatively small and when the inputs to a query are all contained within existing customer-facing tables, customers are able to execute queries and export the results to their hard drive or to their own Google Cloud account.

Why would I want or need to connect remotely?

Those with BQ access are supplied with the “Data Viewer” permissions profile. This level of permissions allows you to execute queries against existing tables. It does not allow you to create new tables or to modify existing tables. In addition, the BQ GUI imposes size limits on query result exports.

These limitations can potentially interfere with workflows that require large queries. For instance, a query that returns detailed attribute information about 50,000 trips will exceed the maximum size for a local download. Fragmenting long query results into exportable files is a potentially laborious and error-prone process.

The limitations also affect workflows that involve comparing Replica outputs to external data. Common examples include network gate analysis and association of trip and/or people locations with custom geometries. Because Data Viewers cannot create new tables or modify existing tables, there is no self-contained solution within the BQ GUI that allows you to “join” query results against custom lists or tables. It may be possible to circumvent this restriction by manually entering lists of stableEdgeIds or WKT geometries as query parameters. But this workaround is cumbersome and prone to error. And it might exceed other limits on the number of parameters or total characters that a query can contain.

In these cases, it is often preferable to connect remotely to BQ. Using Python (or R) to access BQ allows you to save large tables to their local hard drive or other external-to-BQ storage. It also enables direct comparison of BQ data to data stored in external-to-BQ files. This guide contains examples of data and workflow related to the two example use cases mentioned above:

Gate Analysis
Custom Geographies

What do I need to connect to BQ with Python?

This workflow and example have the following requirements:

A working installation of Python that includes the following libraries:
- pandas;
- pandas-gbq;
- GeoPandas [only required for queries involving geospatial resources, such as shapefiles, GeoJSON files, or .csv files containing Well-Known Text (WKT)]; and
- Shapely [like GeoPandas, only required for queries involving geospatial resources]
Replica Places Direct BQ Access. You can request access to BQ via this link. We typically grant access within 5 business days.
Familiarity with the tables and associated schema available to you via BQ access. A user guide can be found here.

Example use case: Custom geographies

In this example, we’ll compare trip start and end points from the North Atlantic Fall ‘21 season to a custom geography representing U.S. Census Bureau County Subdivisions in the New York portion of the New York City Metropolitan Area. An interactive map of this custom geography, which is pictured below, can be found here. Replicate this example workflow by downloading the GeoJSON containing County Subdivision geometries from here.

In this case, we’lll benefit from the fact that although Data Viewers cannot create permanent BQ tables, they can create temporary tables (that will persist only for the length of the BQ session). We will therefore load our GeoJSON from a local file as a geodataframe and create a temporary table representing the contents of that GeoJSON (after applying some light transformations to the data). We will join the trip table against that temporary table to obtain results that:

Only include trips that both start and end in one of the custom polygons; and
Include columns based on the custom geography contained in the GeoJSON to describe the start and end points of those trips.
Summarize the data to provide a count of trips associated with each OD pairs.

The below script will create a pandas dataframe that contains the results of this query. The script can be modified to include language to save this dataframe to the local computer or to a file system where the user has permissions. And of course, the query contained within this script can be modified.

Steps to execute workflow

Customize the CUSTOM GEOGRAPHIES python script reproduced below (and available to download here)
1. If desired, designate a location where the “output” dataframe should be saved and add language to save the dataframe to that location
2. Make any other modifications. For instance, the query might count trips by OD zone and primary mode
Save the script to your hard drive
Execute the script
1. If running from an IDE:
  1. Modify line 15 to point to the path where you’ve saved the example GeoJSON
  2. Comment out line 18
  3. Modify the query as desired
  4. Add language to save the dataframe to a permanent file, if desired
2. If running from Terminal or Command Prompt:
  1. Comment out line 15
  2. Modify the query as desired
  3. Add language to save the dataframe to a permanent file
  4. Include the path where you’ve saved the example GeoJSON as a command parameter

Considerations

Because a Python interpreter is designed to identify and debug Python errors, it can be difficult to diagnose errors that may be present in a SQL query contained within a Python script. Especially if your query is more complex than the example query, it may be helpful to “test” the main query (Lines 60-69 in the example CUSTOM GEOGRAPHIES script) in the BQ GUI by replacing the custom table with one of the available customer-facing tables.

Also note the comment about “SAFE” functions in the CUSTOM GEOGRAPHIES script. BQ will reject geometries with relatively minor errors, including duplicated vertices and self-overlapping polygons. You should check the output of the script to ensure that all custom geometries are represented. The simplification and buffering applied to geometries in Line 25 serve two useful functions in this respect.

First, they offer a solution that generally overcomes geometry deficiencies with minimal distortion to the underlying shapes. Second, they limit the length of the query by reducing the number of coordinate points that must be described. Depending on the number and complexity of polygons in your custom geography, it may be possible to adjust or even eliminate these parameters. Please note that the simplification process may introduce incidental inconsistencies between results derived from Places Explorer and BQ.

Example output when CUSTOM GEOGRAPHIES script is run using Visual Studio Code


# Python Script - CUSTOM GEOGRAPHIES

import pandas_gbq as pd
import geopandas as gpd
from shapely.geometry import MultiPolygon
import shapely.wkt

# optional, include if running script from Terminal or Command Prompt
import sys

# function to serialize geometry to text string
def to_wkt(x):
     x = shapely.wkt.dumps(x)
     return x

# Use this line if running from IDE
infile = '/Path/ny_metro_ctycsub.geojson'

# use this line if running from Terminal or Command Prompt
infile = sys.argv[1]

gdf = gpd.read_file(infile)

# convert geometry to shapely object
gdf['geometry'] = gdf['geometry'].apply(MultiPolygon)

# simplify geometries to avoid a query with 1M+ characters 
gdf['geometry'] = gdf['geometry'].simplify(0.05).buffer(0)

# return geometry to WKT
gdf['geometry'] = gdf['geometry'].apply(to_wkt)

# convert each row into a tuple containing name and geometry.
# This step is required to load the data into a temporary table in BQ
rows = str([tuple(r) for r in gdf.values]).strip('[]')

# create a string representation of the query. We use the "rows"
# variable created above as a parameter to populate the temporary table
# called "geo_table". Finally, note that the query uses a  "SAFE" 
# function. If there are geometry errors in the input custom geography, 
# the temp table will have rows with NULL geometry. This means that 
# trips will not be associated with these geometries. The "SAFE" 
# function prevents queries from failing completely due to geometry
# errors . As part of the QAQC process, check to ensure that all
# geometries are represented in the output table.

query = '''
CREATE TEMP TABLE geo_table (
     name STRING,
     geometry STRING
);

INSERT geo_table(name, geometry) VALUES{};

CREATE TEMP TABLE query_geo AS
SELECT
     name,
     SAFE.ST_GEOGFROMTEXT(geometry) AS geometry
FROM geo_table;

SELECT
     COUNT(activity_id) AS trips,
     o.name AS origin_geo,
     d.name AS destination_geo
FROM replica-customer.north_atlantic.north_atlantic_2021_Q4_thursday_trip
JOIN query_geo AS o
     ON ST_COVERS(o.geometry, ST_GEOGPOINT(origin_bgrp_lng,        
          origin_bgrp_lat))
JOIN query_geo AS d
     ON ST_COVERS(d.geometry, ST_GEOGPOINT(destination_bgrp_lng,
          destination_bgrp_lat))
GROUP BY origin_geo, destination_geo
'''.format(rows)

# execute the query. This will prompt a browser authentication. Be sure 
# to sign in with the Google-based address used to obtain BQ access
output = pd.read_gbq(query, 'replica-customer')