050: ELT with DBT

Data Build Tool (DBT) allows you to easily create and manage ELT-pipelines to automatically build and populate a target analytical model.

The weekly mood

Thanksgiving is over and I wish I could integrate an Advent calendar into my blog, but in this case I should have had the idea before. Therefore, let us put it to next year... perhaps.

Talking about ideas, I am still puzzled on how to help my organisation becoming data-driven. Is it ok to struggle with some basic Data Warehouse implementation, or is it a shame? As a matter of fact, we need to do something, and we need to do it now. See also: Building An Analytics-Centric Organisation.

Today we'll have a look at a tool that is quite popular for populating dimensional data models: DBT.

What is DBT

Data Build Tool (DBT) is an Open-Source CLI written in Python that is used to manage, document, test, compile, package and run compute queries on a Data Warehouse (DW) or a Data Lake (DL). DBT is developed by a Philadelphia-based startup called Fishtown Analytics, with revenue coming from solid investors, consulting services and a managed service offering called DBT Cloud, that allows for centralizing all operations of the SQL-pipeline life-cycle (ex. management, scheduling and monitoring).

DBT generally relies on a third-party Data Loading tool (like for example Fivetran and StitchData) to bring the data in, although it can also ingest CSV files (so called "Seeds"). Its descriptive language translates into SQL queries including specific features and functions of the underlying supported systems, as well as User Defined Functions (UDF) via Jinja macros.

In my opinion, one of the strengths of DBT is the ability to share modelling assets on the central repository https://hub.getdbt.com. Especially for mid-sized businesses which might not be able to on-board new datasources such as Salesforce and GoogleAds into their DW every month, what could be more convenient than just downloading a re-usable package including the definition of the operational source, a certified analytical target and its corresponding transformation?

In some extend, DBT could become a reference tool and a One-stop-shop, just like Docker did 6 years ago. Although DBT currently seems to be missing some private repository or marketplace features on the hub, I actually expect this to be added in the near future. Last but not least, DBT reference documentation is of very good quality, and there are a number of shared best practices worth to read, like for example this article written with passion by the creators, or this usage guideline and cheat sheet from GitLab development team.

Hands-on

To start with, let's setup DBT client using Python package manager:

$ python3 -m venv ~/.venv/dbt-env
$ source ~/.venv/dbt-env/bin/activate
$ pip install dbt

You will find additional steps by Operating system here. This is what made installation work on my Ubuntu 20:

$ sudo apt-get install git libpq-dev python-dev python3-pip python3-venv libicu-dev
$ sudo apt-get remove python-cffi
$ pip install --upgrade cffi
$ pip install cryptography==1.7.2

Once this is done, we can check DBT version and out-of-the-box plugins.

$ dbt --version installed version: 0.18.1 latest version: 0.18.1 Up to date! Plugins: - snowflake: 0.18.1 - postgres: 0.18.1 - redshift: 0.18.1 - bigquery: 0.18.1

Visit this page in case you are missing an adapter plugin, and this one in case you consider creating your own connector.

Project

Let's locally create a new DBT project as part of a cloned GitHub repo.

$ cd <myGitHubRepos>/data-eng $ dbt init myProject Running with dbt=0.18.1 Creating dbt configuration folder at /home/tncad/.dbt With sample profiles.yml for redshift Your new dbt project "myProject" was created! If this is your first time using dbt, you'll need to set up your profiles.yml file (we've created a sample file for you to connect to redshift) -- this file will tell dbt how to connect to your database. You can find this file by running: xdg-open /home/tncad/.dbt ... $ mv myProject 050-dbt

As we can see, a project skeleton has been created under our repo, as well as a profile.yml under our user home folder. Also note that I've rename the project folder since I don't plan to have multiple projects here.

Profile

The profile is a YAML-file which allows you to configure your database connection settings, e.g.

<profile-name>: target: <target-name> outputs: <target-name>: type: <bigquery | postgres | redshift | snowflake | other> schema: <schema_identifier> threads: <natural_number>

Multiple "profile" sections can be defined as part of the file, the one in use being referenced from the project.yml file (ex. default). Profiles are typically segregated by different source or project.

A profile can define multiple "target" sections, which ones should be used to specify different environments (ex. dev, test, prod), the one in use being referenced by the target property.

In order to avoid messing-up in production, it is actually recommended to have one different user or one different version of the profile accessed via the DBT_PROFILE_PATH environment variable.

The default profile settings are for Redshift. So the first thing you need to do is go the list of supported databases, copy the snippet dedicated to your system like in my case Snowflake, and adjust connection settings. Since credentials need to be defined in clear text inside the profile, you might alternatively prefer to make them load from an environment variable, using the following Jinja macro expression:

user: "{{ env_var('DBT_USER') }}"

pass: "{{ env_var('DBT_PASSWORD') }}"

Here is how to test the connection:

$ dbt debug ... Configuration: profiles.yml file [OK found and valid] dbt_project.yml file [OK found and valid] Required dependencies: - git [OK found] Connection: account: tncad user: me@tncad.org database: testdb schema: testsc warehouse: testwh role: testrole client_session_keep_alive: False Connection test: OK connection ok

Now that the connection works, we are good to go with the central piece of DBT which is modeling.

Models

A DBT model is a collection of SQL-files defining the creation and transformations of physical model objects such as views and tables. Each SQL-file should contains a SELECT clause that is injected into a CREATE VIEW AS clause by default, so that a new database view can be created based on its defined columns. Note that each model produces exactly one object inside the schema defined as part of the profile, and named after the SQL-file (without extension).

select <columns_to_create> from <sql_input>

You can overwrite the default model configuration (of creating views in current schema) from the project.yml settings...

models: my_new_project: +materialized: table +schema: testsc2

...or using another Jinja macro expression as part of the SQL-file.

Note that a WITH clause is recommended for tables, so that even table rows can be created along with the table object.

with <alias> as (

select <columns_to_use> from <objects_to_use>

)

select <columns_to_create> from <sql_input_or_1_to_skip_row_creation>

Or in case you just wanted to create a table with no data, you would do

with <alias> as (

select null as <col1>, null as <col2>, ... ) select * from <alias> where id is not null

As long as you already have some input data for creating models, then the queries will typically return correct column types, so that you don't need to care about them within the model definition. Otherwise, you need to specify them by casting columns, for example through one of the example syntaxes below:

with <alias> as ( select to_number(null) as id, null::timestamp as created ) select * from <alias> where id is not null

However, I didn't find any way to create integrity constraints. Interestingly in my particular project, I do see the responsibility of provisioning and populating staging tables on the ingestion side (Talend Studio), not on the modelling side (DBT). In that context, DBT would just re-use a pre-defined source schema and its data types for creating the target schema. Still some significant effort is required for creating the models, since DBT does not come with a feature to conveniently auto-import source schema.

Other forms of Materializations such as incremental (models to only insert or update records since the last time that dbt was run) and ephemeral (models that are not persisted in database, but used by dbt as such for transformation).

In addition to the physical model, you can define logical properties inside further YAML-files placed under the models folder. Properties may apply at different levels of the definition:

name should be matching a SQL-file-name
description ...
docs controls the visibility of the model when auto-generating documentation (see further below)
meta information such as owner, model_maturity, contains_pii
quote controls whether or not column names should be quoted
tests integrity validation ex. not_null, unique, accepted_values, relationships, severity
tags are user-defined-values which can be used in commands as selector ex. --model tag:pii

As you can imagine, you just have to create as many files as you have tables. Once you are ready with your model, you can test and run it.

$ dbt run $ dbt test

Which basically creates your target schema, then if applicable provisions it with data, then if applicable executes some integrity tests. In case you need to troubleshoot, you can find the generated SQL-statements from with the target folder, and the logs from within the logs folder.

There are a number of other interesting features and commands:

Analyses are custom SQL-files which do not fit in the models folder such as queries to run for other purpose than a model creation and transformation.
Compile is a command for exporting a "model" or an "analyse" to an executable format.
Snapshots is a command for implementing Slowly changing dimension type 2 (SCD).

Use-case

In a previous post, we used Talend Studio - ETL capability to extract data from the source application Rocketchat running on a MongoDB database, and load this data to a Snowflake stating table MESSAGE. Then we used Talend Studio - ELT capability to populate a pre-defined Star schema. Let us now try to achieve this last exercise using DBT.

To start with, it is a good idea to create one folder per stage underneath your models folder, e.g.

$ tree models models ├── properties.yml ├── staging │ └── stg_message.sql ├── starschema │ ├── dim_time.sql │ ├── dim_user.sql │ ├── fact_message.sql

dbt_project.yml

$ tail -5 dbt_project.yml
models: my_new_project: # Applies to all files under models/example/ example: materialized: view $ cat >> dbt_project.yml << EOF materialized: table

staging: +pre-hook: "create schema if not exists {{ this }}" +schema: staging starschema: +pre-hook: "create schema if not exists {{ this }}" +schema: starschema +post-hook: "grant select on {{ this }} to group reporter"

EOF

Note that the schema specialization extends the schema name defined in the profile, and that each model i.e. SQL file must have a unique name. This allows you to reference it independently from the schema. You can still achieve identical table names in different schema (ex. staging.message and starschema.message) using the alias model configuration, for example via Jinja macro expression since it typically applies to one model only.

Now, since the Star schema builds on Staging tables, it is likely for any SQL-file of the starschema folder to reference another SQL-file from the staging folder, which can be achieved again using a Jinja macro expression:

SELECT COUNT(_id) from {{ ref('stg_message') }}

It looks like I found a bug with the schema configuration, indeed DBT creates the custom schemas but fails at creating the tables:

1 of 4 START table model testsc_staging.stg_message.... [RUN] 1 of 4 ERROR creating table model testsc_staging.stg_message [ERROR in 3.22s] 2 of 4 SKIP relation testsc_starschema.dim_time........ [SKIP] 3 of 4 SKIP relation testsc_starschema.dim_user........ [SKIP] 4 of 4 SKIP relation testsc_starschema.fact_message.... [SKIP] Finished running 4 table models in 22.25s. Completed with 1 error and 0 warnings: Database Error in model stg_message (models/staging/stg_message.sql) 002003 (42S02): SQL compilation error: Table 'TESTDB.TESTSC_STAGING.STG_MESSAGE' does not exist or not authorized. compiled SQL at target/compiled/my_new_project/models/staging/stg_message.sql

It works as expected when running the queries by hand, or by skipping the schema configuration:

1 of 4 START table model testsc.stg_message...... [RUN] 1 of 4 OK created table model testsc.stg_message. [SUCCESS 1 in 4.16s]

2 of 4 START table model testsc.dim_time......... [RUN]

2 of 4 OK created table model testsc.dim_time.... [SUCCESS 1 in 4.60s]

3 of 4 START table model testsc.dim_user......... [RUN]

3 of 4 OK created table model testsc.dim_user.... [SUCCESS 1 in 4.56s]

4 of 4 START table model testsc.fact_message..... [RUN]

4 of 4 OK created table model testsc.fact_message [SUCCESS 1 in 4.59s]

Documentation

Now if we want to make this development including descriptions transparent to our stakeholders in the form of a reference documentation, we'll do:

$ dbt docs generate

The easiest way to consume the documentation is to start dbt web server locally on your machine.

$ dbt docs serve

You can obviously upload your content to an AWS S3 bucket configured to serve static web pages, or deploy dbt server to somewhere else, for example as a container.

A very cool feature of the documentation is that green button in the bottom-right corner of the screen, which visually represents your overall Data lineage (at table-level). Clicking on the tables will highlight paths and even allow you to move them around. Also, the filters from the bottom toolbar can be handy when your schema includes a lot of tables.

Packaging

Another capability of DBT is the concept of packages. At a high-level, DBT packages are collections of assets in a format that can be shared for re-use. For example, the dbt_utils dependency comes with a number of commonly used macros that you can locally import like follows:

$ cat << EOF > packages.yml packages: - package: fishtown-analytics/dbt_utils version: 0.6.2 EOF $ dbt deps
Installing fishtown-analytics/dbt_utils@0.6.2 Installed from version 0.6.2

It creates a folder dbt_modules containing your assets and their basic dependency from the central module repository. The same process can be applied to fetch more elaborated dependencies like for example pre-defined models (we'll come to this in the next section).

DBT is fully text-based so that you can obviously commit your changes to the source repository of your choice, for example using a Git client or an IDE. Moreover, you can externalize DBT commands to an automation server agent for Continuous Integration (CI), as demonstrated in this video around the example of Snowflake Cost Usage analysis deployed and operated using GitHub Actions.

Take-away

Pros: Data Operations

In its free and open-source version, DBT CLI is an awesome command line tool which perfectly supports pipeline automation, sharing and re-usability.

In its commercial offering, DBT Cloud is a managed service with a nice Web-UI, as well as additional governance and operations features. However, I must say that I didn't take the time to evaluate it so far.

Contras: Design and Mapping

As compared to the approach of Data Integration tools, DBT lacks some schema fetch and propagate, as well as some interactive syntax check and SQL preview features. As compared to the imperative approach to Database Change Management (DCM) such as Flyway and Snowchange, it lacks support for a metadata store and schema change evolution.

Sources

https://github.com/tncad/data-eng/tree/master/050-dbt

The newbie cloud architect diary

Search This Blog

050: ELT with DBT

Labels

Comments

Post a Comment