Google Cloud's data warehousing solution, BigQuery, is quickly becoming a popular choice amongst data engineers, including myself. Compared to AWS Redshift and Azure Synapse it's signficantly cheaper for a variety of use-cases because BQ is serverless and charges for query data scanning and storage.

I opted to use BigQuery for a project in which I needed to distribute analytics reports and frequently changing source data to a variety of stakeholders. I aggregated all of the stakeholders' data due to similar schemas across the board, thus I also needed a method to automatically create these reports and deliver the appropriate data depending on the stakeholder's identity.

To start, I first imported the aggregated datasets to BigQuery using GCP's Storage Transfer service since the data was sitting in S3 as partitioned parquet files. The UI was intuitive and within the hour all of my data was copied from S3 onto a GCS bucket.

Next, I needed to create tables within my BigQuery project. BQ offers a couple of different table types, including External and Native tables. External tables function similarly to AWS' Athena service, where the blob storage files are queried directly in the bucket. Native tables import the data into Google's proprietary columnar format and stores it on their Colossus file system. Because the data is frequently changing I figured the External tables would be best, but while the data was easily updated, the queries took too long even with external hive partitioning configured.

After creating all of my Native tables in BigQuery I discovered I could solve my permissioning issue with Authorized Views. According to the documentation,

An authorized view allows you to share query results with particular users and groups without giving them access to the underlying tables

I decided to split my data into four different BQ datasets:

  • unfiltered - Raw data imported from GCS
  • filtered - An authorized view to hide privileged data
  • access_control - Contains permissions table

If a user has access to a view and the underlying tables referred to in the view, then we don't have to authorize the view. When we don't want the user to have access to the underlying data but we do want them to have access to the queryset returned by the view, we must use authorized views.

Because I needed to return only the data belonging to the user making the query, I had to use authorized views. However, I also needed to implement my own row level security within the view.

To return the respective data for a particular user on a department and team, we need to create a permissions table in the access_control dataset. The schema is simple:

{'department': STRING, 'team': STRING, 'user': STRING}

In our unfiltered datasets there should be matching columns for the user attributes that you use in access_control. Thus the SQL code to return only the rows that match the department and team of the user is as follows:

SELECT *
FROM project.unfiltered.table unfiltered
    INNER JOIN (
        SELECT department AS dp, team AS tm
        FROM access_control.permissions
        WHERE SESSION_USER() = user
    ) perm
    ON unfiltered.department = perm.dp AND unfiltered.team = perm.tm;

SESSION_USER() is a BigQuery function that returns the email of the user making the query.

To create the view in the filtered dataset using the Python SDK:

from google.cloud import bigquery
client = bigquery.Client()

dataset_ref = client.dataset("filtered")

view = bigquery.Table(dataset_ref.table(table_name))
sql_template = """
SELECT *
FROM project.unfiltered.table unfiltered
    INNER JOIN (
        SELECT department AS dp, team AS tm
        FROM access_control.permissions
        WHERE SESSION_USER() = user
    ) perm
    ON unfiltered.department = perm.dp AND unfiltered.team = perm.tm;
"""
view.view_query = sql_template.format(table_name)
view = client.create_table(view)

Once the view is created you'll be able to query it like a normal table. However, in order to prevent your end users from accessing the underlying tables while allowing them access to the view, you'll need to authorize the view with the source dataset. The following Python function will do just that:

from google.cloud import bigquery
client = bigquery.Client()

def authorize_view(view, source_dataset_name):
	try:
		# Authorize the view with the source dataset
		source_dataset = client.get_dataset(client.dataset(source_dataset_name))
		access_entries = source_dataset.access_entries
		access_entries.append(
		    bigquery.AccessEntry(None, "view", view.reference.to_api_repr()),
		)
		source_dataset.access_entries = access_entries
		source_dataset = client.update_dataset(
		    source_dataset, ["access_entries"]
		)
	except BadRequest:
		# Authorization already exists
		return

Where view is the view we just created and source_dataset_name is the name of the dataset we need to authorize the view for. Now, the view we just created will be able to query the unfiltered tables even if the querying user does not have access to those tables.

To share the filtered dataset with your end users you need to grant them the roles/bigquery.dataViewer role on that dataset. I suggest creating a group to make permissioning more manageable. However, when you grant a user the Data Viewer role you'll find they're unable to query the tables in the dataset, and that's intentional. BigQuery bills queries to the GCP Project from which the query is submitted, so a user with the Data Viewer role is unable to bill queries to your project. Instead, they can pin the dataset in BQ, open BQ from within their own project, and they'll be able to query. If you need your end users to bill queries to your project, simply grant them the roles/bigquery.user role on your project. Note they will be unable to access other datasets/tables unless you grant them access.

The final, optional step is to create a view on top of the view we created in order to hide the SQL permissioning logic from the end user. To do this we could create a new dataset, public, that we'd share with the user instead of filtered. The SQL query would simply be:

SELECT * FROM project.filtered.table;

to retrieve all of the data. However, it is very important to note that simply authorizing this new view with the filtered view is not sufficient. We must also authorize it with the source dataset(s), which would be unfiltered in our case.