BigQuery User-Defined-Function for mapping tables

In this post I am trying to explain, how to use a User-Defined-Function (“UDF”) to give Analyst access to small mapping and dimension tables without using a SQL-JOIN, thus making the procedure more reliable and standardized.

Let’s say you have two tables:

  1. One with purchase events tracked by the web analytics tool of your choice. Among other things the schema includes a campaign-ID, which can be mapped to campaign details.
  2. A second table “campaign-data” that consists of distinct campaign-IDs with the corresponding campaign, source and medium.

The goal is to give reliable access to both table while minimizing the risk of poorly written JOINs or even differently written ones for different SQL-operations. You can avoid that by creating a UDF that takes the JOIN-column as an argument and returns the complete row with the corresponding field as a STRUCT. This way, you can complement your event data (table 1) with campaign information reliably, without performing a JOIN.
In addition to that you ensure, that the campaign mapping is always performed correctly as long as you supply the function to everyone.
But there is one thing you need to keep in mind:
This only works for mapping/dimension tables that consist of distinct rows only. Duplicate rows create a syntax error in BigQuery and prevent your whole query from running successfully.

Let’s take a closer look on the SQL-code:

-- Create UDF for campaign data table
CREATE OR REPLACE FUNCTION dataset.function_name(c_id STRING)
RETURNS STRING
AS (
  SELECT AS STRUCT
    *
  FROM
    project.dataset.campaign-data
  WHERE
    campaign_id = c_id
  LIMIT
    1
) OPTIONS (
  description = 'DESCRIPTION'
);

-- Apply function on event data
SELECT
  * EXCEPT(campaign_id),
  project.dataset.function_name(campaign_id) AS campaign_data
FROM
  project.dataset.eventdatatable

I hope this gave you an idea on a neat alternative to JOINs in BigQuery – at least for a specific set of cases.