So recently I’ve been working alot with dbt so thought I’d go through the process of setting up a dbt project and dbtCloud account with Google Big Query (I’m more familiar with Snowflake, however I’ve already used the free Snowflake credits for an individual account!).
What is dbt?
dbt describes itself as “a transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation”. In a nutshell, cue Austin Powers in a nutshell gif,
dbt is a tool that allows us to do the transform part of the ELT pipeline within our data warehouse and allows us to create reusable, version controlled analytics code through a combination of SQL and Jinja. I spent the first few months of my role as a data scientist wondering what it actually did and it wasn’t until I got my hands dirty and properly got involved that I came to understand the power of the tool, especially when you throw the use of dbtCloud into the mix as well. I’d recommend reading this for a better explanation.
Getting Started
For reference, all of the code and be found in my GitHub repo here.
The first few steps in setting up a dbtCloud project are fairly well documented in the dbt docs. All we do (after signing up for a free developer account on dbtCloud) is:
- Create a new project in dbtCloud.
- Add the credentials to our data warehouse. For this example I’m using BigQuery. The only trouble I had with this stage was trying to find how to export the JSON key for a dbt-user account within BigQuery - turns out it’s fairly simple if you’re familiar with Google Cloud Platform - however if you’re not it may take a little while to find…
- Link the dbtCloud project up with a git repo.
For more detailed instructions just check out the docs
And now we’re ready to start…
dbt_project.yml
After running dbt init
in our dbt project directory, we’ll have the folder structure setup to start building our models. For this project I’ll be using the NCAA public dataset in BigQuery which “contains data about NCAA Basketball games, teams, and players. Game data covers play-by-play and box scores back to 2009, as well as final scores back to 1996. Additional data about wins and losses goes back to the 1894-5 season in some teams’ cases.”.
The dbt_project.yml
file is where we’ll add our project structure and configurations on an overall project level. Part of this will involve applying different configs to different folders within our dbt project. Using this as a guide to structuring the project, we’ll start by looking at the different schema we’ll create within our project.
models:
ncaa_basketball:
+materialized: view
staging:
+schema: staging
marts:
+materialized: table
core:
+schema: core
intermediate:
+schema: intermediate
So going over the yml
code above:
ncaa_basketball
is he overall name of our project - in BigQuery we have the hierarchy of a project (which we’d normally class as a database), which has multiple datasets (more commonly classed as schema) and each dataset contain tables and views.- Although the default materialization in dbt is a view, we explicitly state that the default materialization for all models in our project will be
+materialized: view
. This can then be configured on a folder level further down. - To start with, our datasets will follow a simple structure. We’ll have one dataset/schema that contains all of our staging models - dbt will automatically generate this as
<project_name>_<schema_name>
within BigQuery. So our staging dataset will bencaa_basketball_staging
. The staging models will be the only models in the whole project that will directly reference the raw NCAA tables within BigQuery. All other models which sit on top of this layer will reference the models within the staging layer. - Above the staging layer sits the
marts
layer. These models will contain the tables which will subsequently be used for analysis and insight. These will be configured to be materialized as tables. If we want quick query response times when providing data to a business, using tables will be quicker than views but the trade off is the increase in storage to store the table. - Within the marts we will have our
core
schema consisting of our top level analysis and an intermediate schema which will contain some intermediate transformations between the staging tables and the analytical models.
It’s also useful to add the dbt-utils
package to out packages.yml
file as it contains some useful macros.
packages:
- package: dbt-labs/dbt_utils
version: 0.8.2
Staging Tables
The first models we need to make are our staging tables, so within out staging folder we need a few things:
- A source file which we’ll call
src_ncaa_basketball.yml
, a sample of which can be seen below. In this file we’re telling our project where to look for the raw dataset, adding descriptions about each source for documentation, adding each schema we’ll be reading from and then adding the desired tests we want to run on each of our sources (I’ll cover testing and source freshness in a separate post).
version: 2
sources:
- name: ncaa_basketball
database: bigquery-public-data
description: NCAA basket ball data from BigQuery public datasets
schema: ncaa_basketball
tables:
- name: mascots
description: NCAA team mascot data
columns:
- name: id
description: Primary key for mascots
tests:
- unique
- not_null
- name: mbb_games_sr
description: Team level box scores
columns:
- name: game_id
description: Primary key for each game
tests:
- unique
- not_null
- name: mbb_historical_teams_games
description: Final scores for men's basketball games, starting with the 1996-97 season. Each game is included twice, with one entry per team.
- name: mbb_historical_teams_seasons
description: Season record information for Men's Basketball, starting with the 1894-95 season. Each game is included twice, with one entry per team.
...
- A folder level
yml
file, for this folder it’s labelledstg_ncaa_basketball.yml
. This is where we can apply configuration to individual models within this directory. So if we wanted one of the tables to sit within it’s own schema in BigQuery we could use the following:version: 2 models: - name: stg_mascots description: NCAA basketball team stg_mascots config: schema: mascot_schema
- A model corresponding to each of the raw tables. For this example I’ve kept the models relatively simple and pulled them in exactly the same as they are in the source tables. Some of the tables have over 150 columns so I didn’t fancy typing out the 90 columns from each table I might actually use. Each of these staging models looked something like this:
with box_scores as ( select * from ) select * from box_scores
Creating Insight
Now we have our staging tables sorted, we can look at how we can use dbt to create some transformations. The first example was a transformation to look at how many minutes of basketball each player played in each season.
Let’s dive straight in with the code:
{% set seasons = dbt_utils.get_column_values(ref('stg_players_games_sr'), 'season') %}
with player_summary as (
select
player_id,
full_name,
team_name,
{% for year in seasons | sort %}
sum(
case when season = {{ year }}
then minutes_int64
else 0
end
) as season_{{ year }}_minutes
{%- if not loop.last -%},{%- endif -%}
{%- endfor %}
from {{ ref('stg_players_games_sr') }}
group by 1,2,3
)
select * from player_summary
So what’s happening here?
{% set seasons = dbt_utils.get_column_values(ref('stg_players_games_sr'), 'season') %}
- using a dbt-utils macro, we can get all of the unique values within a column from one of our models. In this instance all of the unique seasons from ourstg_players_games_sr
table.- The
player_summary
CTE - in this CTE, we’re using the season values from theseasons
variable to loop through each season and create a case statement to sum the minutes played - essentially we’re just pivoting the data.
dbt-utils
has a built in pivot
macro which I did try initially however it wouldn’t accept an integer value to pivot on. In this instance the season is an integer and the pivot
macro was expecting a string in order to create categories. I’m sure there’d be a workaround by modifying the macro but this method seemed to work just as well. The compiled SQL code looks like this:
with player_summary as (
select
player_id,
full_name,
team_name,
sum(
case when season = 2013
then minutes_int64
else 0
end
) as season_2013_minutes,
sum(
case when season = 2014
then minutes_int64
else 0
end
) as season_2014_minutes,
sum(
case when season = 2015
then minutes_int64
else 0
end
) as season_2015_minutes,
sum(
case when season = 2016
then minutes_int64
else 0
end
) as season_2016_minutes,
sum(
case when season = 2017
then minutes_int64
else 0
end
) as season_2017_minutes
from `nth-platform-338121`.`ncaa_basketball_staging`.`stg_players_games_sr`
group by 1,2,3
)
select * from player_summary
As we can see from the code, dbt has saved a lot of time and effort to write the multiple case statements. Not only that, but when we get another season in our database, we don’t have to remember to go back and modify the query. When we run our model again, the get_column_values
macro will do this for us and generate the SQL with the updated number of seasons.
For the second example, we’ll look at some rolling averages. In this instance we’ll look at a 5 game rolling average and 28 day rolling average for the two point and three point shot percentages. For this we’ll need to use BigQuery’s window functions for which we’ll see the syntax in the query below. For these values we need to sum up both the number of two and three points scored and number attempted over the relevant window, and then use these to calculate the average percentage over the window period.
{% set point_cols = ["two_points_made", "two_points_att", "three_points_made", "three_points_att"] %}
/* Step 1 */
with game_summary as (
select
name as team_name,
scheduled_date,
two_points_made,
two_points_att,
three_points_made,
three_points_att
from {{ ref('stg_box_scores_v2') }}
where minutes is not null
),
/* Step 2 */
five_game_summary as (
select
team_name,
scheduled_date,
{% for col in point_cols %}
sum({{ col }})
over (
partition by team_name
order by scheduled_date
rows between 4 preceding and current row
) as {{ col }}_sum,
sum({{ col }})
over (
partition by team_name
order by unix_date(scheduled_date)
range between 28 preceding and current row
) as {{ col }}_sum_28{% if not loop.last %},{% endif %}
{% endfor %}
from game_summary
)
/* Step 3 */
select
team_name,
scheduled_date,
round(safe_divide(two_point_made_sum, two_point_att_sum) * 100, 2) as two_point_5_game_pct,
round(safe_divide(three_point_made_sum, three_point_att_sum) * 100, 2) as three_point_pct_5_game_pct,
round(safe_divide(two_point_made_sum_28, two_point_att_sum_28) * 100, 2) as two_point_28_day_pct,
round(safe_divide(three_point_made_sum_28, three_point_att_sum_28) * 100, 2) as three_point_pct_28_day_pct,
from five_game_summary
order by 1,2
- Step 1 - pull through the relevant columns from the staging tables.
- Step 2 - Iterate over the columns set at the top of the model and for each each column sum up the number over the relevant window. For the 5 games window we use the
rows between 4 preceding and current row
syntax (we know the previous 4 rows will contain match data as we stipulatedminutes is not null
from the raw table). For the 28 days rolling window we use therange between 28 preceding and current row
syntax. - Step 3 - use the calculated sums to produce a rolling average two and three point scoring percentage (using
safe_divide
to avoid division by zero errors).
And here’s what the output looks like:
team_name | scheduled_date | two_point_5_game_pct | three_point_pct_5_game_pct | two_point_28_day_pct | three_point_pct_28_day_pct |
---|---|---|---|---|---|
49ers | 2013-12-18 | 44.08 | 31.94 | 43.64 | 31.13 |
49ers | 2013-12-20 | 47.08 | 27.50 | 44.08 | 29.63 |
49ers | 2013-12-30 | 42.99 | 26.51 | 42.99 | 26.51 |
49ers | 2014-01-04 | 42.11 | 27.66 | 42.11 | 27.66 |
49ers | 2014-11-15 | 42.86 | 35.24 | 51.28 | 59.09 |
Conclusion
So there it is for the first post on dbt, a brief worked example on getting a project setup with dbtCloud and BigQuery. In the following weeks posts we’ll look at creating some different models and introduce tests and documentation. Until then…