The Grooviest Prog Band with PySpark and Spotify - Part 1

I’ve recently started working with Databricks and PySpark and it’s been quite an interesting journey familiarising myself with some of the benefits and challenges of working with distributed data. To get acquainted with the syntax and some of the idiosyncrasies let’s dive in and start working with the Spotify API…

Read More

Causal Inference and the dreaded OVB

I’ve recently been working through an excellent resource on causal inference from Matheus Facure. Machine learning models are great at predicting things (well, in some cases…*cough* Zillow *cough*). You throw some data at them, make some predictions and try to find a model which gives you the best accuracy (or whatever measure it is you’re trying to optimize). Some models may be explainable as to why they make their predictions and some are black box models which can be incredibly difficult to interpret which in turn runs the risk of a lack of trust in the predictions. However, consider a machine learning model trained on airline prices…it would see that ticket sales a relatively low when the prices are low (during term time for example) and ticket sales go up as prices go up (during school holidays). A naive model may suggest increasing prices to increase sales…

Read More

dbt, NCAA and Other Assorted Acronyms

So recently I’ve been working alot with dbt so thought I’d go through the process of setting up a dbt project and dbtCloud account with Google Big Query (I’m more familiar with Snowflake, however I’ve already used the free Snowflake credits for an individual account!).

Read More