dbt, NCAA and Other Assorted Acronyms

Grant Beasley · March 27, 2022

So recently I’ve been working alot with dbt so thought I’d go through the process of setting up a dbt project and dbtCloud account with Google Big Query (I’m more familiar with Snowflake, however I’ve already used the free Snowflake credits for an individual account!).

What is dbt?

dbt describes itself as “a transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation”. In a nutshell, cue Austin Powers in a nutshell gif,

Austin Powers "Help! I'm in a Nutshell!"

dbt is a tool that allows us to do the transform part of the ELT pipeline within our data warehouse and allows us to create reusable, version controlled analytics code through a combination of SQL and Jinja. I spent the first few months of my role as a data scientist wondering what it actually did and it wasn’t until I got my hands dirty and properly got involved that I came to understand the power of the tool, especially when you throw the use of dbtCloud into the mix as well. I’d recommend reading this for a better explanation.

Getting Started

For reference, all of the code and be found in my GitHub repo here.

The first few steps in setting up a dbtCloud project are fairly well documented in the dbt docs. All we do (after signing up for a free developer account on dbtCloud) is:

  1. Create a new project in dbtCloud.
  2. Add the credentials to our data warehouse. For this example I’m using BigQuery. The only trouble I had with this stage was trying to find how to export the JSON key for a dbt-user account within BigQuery - turns out it’s fairly simple if you’re familiar with Google Cloud Platform - however if you’re not it may take a little while to find…
  3. Link the dbtCloud project up with a git repo.

For more detailed instructions just check out the docs

And now we’re ready to start…

dbt_project.yml

After running dbt init in our dbt project directory, we’ll have the folder structure setup to start building our models. For this project I’ll be using the NCAA public dataset in BigQuery which “contains data about NCAA Basketball games, teams, and players. Game data covers play-by-play and box scores back to 2009, as well as final scores back to 1996. Additional data about wins and losses goes back to the 1894-5 season in some teams’ cases.”.

The dbt_project.yml file is where we’ll add our project structure and configurations on an overall project level. Part of this will involve applying different configs to different folders within our dbt project. Using this as a guide to structuring the project, we’ll start by looking at the different schema we’ll create within our project.

models:
  ncaa_basketball:
    +materialized: view
    staging:
      +schema: staging
    marts:
      +materialized: table
      core:
        +schema: core
        intermediate:
          +schema: intermediate

So going over the yml code above:

  • ncaa_basketball is he overall name of our project - in BigQuery we have the hierarchy of a project (which we’d normally class as a database), which has multiple datasets (more commonly classed as schema) and each dataset contain tables and views.
  • Although the default materialization in dbt is a view, we explicitly state that the default materialization for all models in our project will be +materialized: view. This can then be configured on a folder level further down.
  • To start with, our datasets will follow a simple structure. We’ll have one dataset/schema that contains all of our staging models - dbt will automatically generate this as <project_name>_<schema_name> within BigQuery. So our staging dataset will be ncaa_basketball_staging. The staging models will be the only models in the whole project that will directly reference the raw NCAA tables within BigQuery. All other models which sit on top of this layer will reference the models within the staging layer.
  • Above the staging layer sits the marts layer. These models will contain the tables which will subsequently be used for analysis and insight. These will be configured to be materialized as tables. If we want quick query response times when providing data to a business, using tables will be quicker than views but the trade off is the increase in storage to store the table.
  • Within the marts we will have our core schema consisting of our top level analysis and an intermediate schema which will contain some intermediate transformations between the staging tables and the analytical models.

It’s also useful to add the dbt-utils package to out packages.yml file as it contains some useful macros.

packages:
  - package: dbt-labs/dbt_utils
    version: 0.8.2

Staging Tables

The first models we need to make are our staging tables, so within out staging folder we need a few things:

  1. A source file which we’ll call src_ncaa_basketball.yml, a sample of which can be seen below. In this file we’re telling our project where to look for the raw dataset, adding descriptions about each source for documentation, adding each schema we’ll be reading from and then adding the desired tests we want to run on each of our sources (I’ll cover testing and source freshness in a separate post).

version: 2
sources:
  - name: ncaa_basketball
    database: bigquery-public-data
    description: NCAA basket ball data from BigQuery public datasets
    schema: ncaa_basketball
    tables:
      - name: mascots
        description: NCAA team mascot data
        columns:
          - name: id
            description: Primary key for mascots
            tests:
              - unique
              - not_null
      - name: mbb_games_sr
        description: Team level box scores
        columns:
          - name: game_id
            description: Primary key for each game
            tests:
              - unique
              - not_null
      - name: mbb_historical_teams_games
        description: Final scores for men's basketball games, starting with the 1996-97 season. Each game is included twice, with one entry per team.
      - name: mbb_historical_teams_seasons
        description: Season record information for Men's Basketball, starting with the 1894-95 season. Each game is included twice, with one entry per team.
...
  1. A folder level yml file, for this folder it’s labelled stg_ncaa_basketball.yml. This is where we can apply configuration to individual models within this directory. So if we wanted one of the tables to sit within it’s own schema in BigQuery we could use the following:
    version: 2
    models:
      - name: stg_mascots
     description: NCAA basketball team stg_mascots
     config:
       schema: mascot_schema
    
  2. A model corresponding to each of the raw tables. For this example I’ve kept the models relatively simple and pulled them in exactly the same as they are in the source tables. Some of the tables have over 150 columns so I didn’t fancy typing out the 90 columns from each table I might actually use. Each of these staging models looked something like this:
    with box_scores as (
     select * from 
      )
    select * from box_scores
    

Creating Insight

Now we have our staging tables sorted, we can look at how we can use dbt to create some transformations. The first example was a transformation to look at how many minutes of basketball each player played in each season.

Let’s dive straight in with the code:

{% set seasons = dbt_utils.get_column_values(ref('stg_players_games_sr'), 'season') %}

with player_summary as (
    select
        player_id,
        full_name,
        team_name,
        {% for year in seasons | sort %}
            sum(
                case when season = {{ year }}
                  then minutes_int64
                else 0
                end
            ) as season_{{ year }}_minutes
            {%- if not loop.last -%},{%- endif -%}
        {%- endfor %}
        from {{ ref('stg_players_games_sr') }}
        group by 1,2,3
)

select * from player_summary

So what’s happening here?

  1. {% set seasons = dbt_utils.get_column_values(ref('stg_players_games_sr'), 'season') %} - using a dbt-utils macro, we can get all of the unique values within a column from one of our models. In this instance all of the unique seasons from our stg_players_games_sr table.
  2. The player_summary CTE - in this CTE, we’re using the season values from the seasons variable to loop through each season and create a case statement to sum the minutes played - essentially we’re just pivoting the data.

dbt-utils has a built in pivot macro which I did try initially however it wouldn’t accept an integer value to pivot on. In this instance the season is an integer and the pivot macro was expecting a string in order to create categories. I’m sure there’d be a workaround by modifying the macro but this method seemed to work just as well. The compiled SQL code looks like this:

with player_summary as (
    select
        player_id,
        full_name,
        team_name,

            sum(
                case when season = 2013
                  then minutes_int64
                else 0
                end
            ) as season_2013_minutes,
            sum(
                case when season = 2014
                  then minutes_int64
                else 0
                end
            ) as season_2014_minutes,
            sum(
                case when season = 2015
                  then minutes_int64
                else 0
                end
            ) as season_2015_minutes,
            sum(
                case when season = 2016
                  then minutes_int64
                else 0
                end
            ) as season_2016_minutes,
            sum(
                case when season = 2017
                  then minutes_int64
                else 0
                end
            ) as season_2017_minutes
        from `nth-platform-338121`.`ncaa_basketball_staging`.`stg_players_games_sr`
        group by 1,2,3
)

select * from player_summary

As we can see from the code, dbt has saved a lot of time and effort to write the multiple case statements. Not only that, but when we get another season in our database, we don’t have to remember to go back and modify the query. When we run our model again, the get_column_values macro will do this for us and generate the SQL with the updated number of seasons.

For the second example, we’ll look at some rolling averages. In this instance we’ll look at a 5 game rolling average and 28 day rolling average for the two point and three point shot percentages. For this we’ll need to use BigQuery’s window functions for which we’ll see the syntax in the query below. For these values we need to sum up both the number of two and three points scored and number attempted over the relevant window, and then use these to calculate the average percentage over the window period.

{% set point_cols = ["two_points_made", "two_points_att", "three_points_made", "three_points_att"] %}

/* Step 1 */
with game_summary as (
    select
        name as team_name,
        scheduled_date,
        two_points_made,
        two_points_att,
        three_points_made,
        three_points_att
    from {{ ref('stg_box_scores_v2') }}
    where minutes is not null
),

/* Step 2 */
five_game_summary as (
    select
        team_name,
        scheduled_date,
        {% for col in point_cols %}
          sum({{ col }})
          over (
              partition by team_name
              order by scheduled_date
              rows between 4 preceding and current row
          ) as {{ col }}_sum,
          sum({{ col }})
          over (
              partition by team_name
              order by unix_date(scheduled_date)
              range between 28 preceding and current row
          ) as {{ col }}_sum_28{% if not loop.last %},{% endif %}
        {% endfor %}

    from game_summary
)

/* Step 3 */
select
  team_name,
  scheduled_date,
  round(safe_divide(two_point_made_sum, two_point_att_sum) * 100, 2) as two_point_5_game_pct,
  round(safe_divide(three_point_made_sum, three_point_att_sum) * 100, 2) as three_point_pct_5_game_pct,
  round(safe_divide(two_point_made_sum_28, two_point_att_sum_28) * 100, 2) as two_point_28_day_pct,
  round(safe_divide(three_point_made_sum_28, three_point_att_sum_28) * 100, 2) as three_point_pct_28_day_pct,
from five_game_summary
order by 1,2
  • Step 1 - pull through the relevant columns from the staging tables.
  • Step 2 - Iterate over the columns set at the top of the model and for each each column sum up the number over the relevant window. For the 5 games window we use the rows between 4 preceding and current row syntax (we know the previous 4 rows will contain match data as we stipulated minutes is not null from the raw table). For the 28 days rolling window we use the range between 28 preceding and current row syntax.
  • Step 3 - use the calculated sums to produce a rolling average two and three point scoring percentage (using safe_divide to avoid division by zero errors).

And here’s what the output looks like:

team_name scheduled_date two_point_5_game_pct three_point_pct_5_game_pct two_point_28_day_pct three_point_pct_28_day_pct
49ers 2013-12-18 44.08 31.94 43.64 31.13
49ers 2013-12-20 47.08 27.50 44.08 29.63
49ers 2013-12-30 42.99 26.51 42.99 26.51
49ers 2014-01-04 42.11 27.66 42.11 27.66
49ers 2014-11-15 42.86 35.24 51.28 59.09

Conclusion

So there it is for the first post on dbt, a brief worked example on getting a project setup with dbtCloud and BigQuery. In the following weeks posts we’ll look at creating some different models and introduce tests and documentation. Until then…

Twitter, Facebook