Skip to content

dbt package for monitoring of dbt run, test, sources and models

Notifications You must be signed in to change notification settings

techindicium/elementary-dbt-monitoring

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dbt Monitoring

This repository is in experimental stage. It is NOT ready for production yet.

Changes are being made!

This package allows you to easily monitor the quality, dependency, volume, schema and how up-to-date the data is your dbt, providing helpful info to improve your data pipeline.

🏃 Quickstart

New to dbt packages? Read more about them here.

Before creating a branch

Pay attention, it is very important to know if your modification to this repository is a release (breaking changes), a feature (functionalities) or a patch(to fix bugs). With that information, create your branch name like this:

  • release/<branch-name>
  • feature/<branch-name>
  • patch/<branch-name>

Requirements

dbt version

  • dbt version >= 1.0.0

dbt_utils package. Read more about them here.

  • dbt-labs/dbt_utils version: >=0.9.0 and <1.2.0

elementary package. Read more about them here.

  • elementary-data/elementary version: 0.7.1

Installation elementary package and create first tables to dbt monitoring modelling

Installation elementary package

  1. Include this package in your packages.yml file.
packages:
  - package: elementary-data/elementary
    version: 0.7.1
  1. Run dbt deps to install the package.

Configuring models elementary package

  1. The package's models can be configured in your dbt_project.yml by specifying the package under models.
models:
    elementary:
        +schema: 'elementary'
  1. Run dbt run -m elementary to build the package inside your dbt project.

"This command will create tables that at first will be empty, but will be fed with the results of these executions of each “dbt run”, “dbt test” and “dbt build” within the project."

Installation elementary CLI

Reports can be generated by the elementary package by installing the monitoring module via the CLI. To install it in your project folder, just install elementary according to the used platform:

pip install 'elementary-data[snowflake]'
pip install 'elementary-data[bigquery]'
pip install 'elementary-data[redshift]'
pip install 'elementary-data[databricks]'

In order to connect, Elementary needs a connection profile in a file named profiles.yml. This profile will be used by the CLI, to connect to the DWH and find the dbt package tables.

The easiest way to generate the profile is to run the following command within the dbt project where you deployed the elementary dbt package:

dbt run-operation elementary.generate_elementary_cli_profile

Copy the output, fill in the missing fields and add the profile to your profiles.yml.

Profile name: elementary
Schema name: The schema of elementary models, default is <your_dbt_project_schema>_elementary

Installation elementary-dbt-monitoring package and Configuring models

Installation elementary-dbt-monitoring package

  1. Include this package in your packages.yml file and specify the version you want to be installed
packages:
  - git: https://github.com/techindicium/elementary-dbt-monitoring # insert git SSH URL
        ## revision: v0.1.0 (example, if specific version is needed)
  1. Run dbt deps to install the package.

Configuring models package

The package's models can be configured in your dbt_project.yml by specifying the package under models and the start date of the dbt monitoring data.

models:
    elementary_dbt_monitoring:
        staging:
            materialized: ephemeral
        marts:
            materialized: table

...

vars:
    elementary_dbt_monitoring:
        dbt_monitoring_start_date: cast('2022-08-01' as date)

To ensure the package runs correctly, you must declare an environment variable named ELEMENTARY_SOURCE_SCHEMA. This variable allows dbt to locate the source tables that feed all the models. The schema you define here must match the schema where the tables are being created by Elementary.

By setting this as an environment variable, you gain flexibility to adjust the schema as environments change, such as when switching between development, QA, or production environments. This setup ensures that dbt can always find the correct source tables, regardless of which environment you're working in.

Imagine that the Elementary package is creating the tables in the elementary schema, as we recommended above. You need to set the environment variable ELEMENTARY_SOURCE_SCHEMA to elementary so that dbt knows where to find the source tables for the models.

For Bash CLI, you can use this:

export ELEMENTARY_SOURCE_SCHEMA="elementary"

Or to CMD (Windows):

set ELEMENTARY_SOURCE_SCHEMA=elementary

Deduplication of staging models

We identify an issue involving duplication of IDs in some models. These staging models feed into dimension models, and the duplicates were causing inconsistencies in the data pipeline.

The duplication of IDs in the staging models was propagating to both dimension and fact models, leading to potential inaccuracies in downstream processes. These duplications can negatively impact various use cases of this package, such as data analysis, reporting, and dashboards in data visualization tools, resulting in misleading insights.

So, in the last update of this package, we've added a step to deduplicate this models. The deduplication is based on the ID and the generated_at timestamp. The following SQL criterion was applied:

qualify row_number() over (
    partition by model_id -- source_id or test_id
    order by generated_at desc
) = 1

Recommendations

We strongly recommend that you use this package separatly from the production jobs. This is a way to prevent package-related issues from affecting your production jobs.

It has been observed that one possible issue with the package is related to its installation with the dbt deps command.

A possible solution to this problem is just installing the package when the job that runs this package would be trigged. This ensures that although you've separeted the package from production jobs, the dbt deps command that will install all packages of your project don't beak anything in production.

A possible way to do this is remain the packages.yml without the installation of the elementary_dbt_monitoring in your dbt project and have another yaml file in your dbt project folder, like package_monitoring.yml. So, the original yaml can be like this:

packages:
  - package: dbt-labs/codegen
    version: 0.12.1
  - package: dbt-labs/dbt_utils
    version: 1.1.1
  ## Docs: https://docs.elementary-data.com
  - package: elementary-data/elementary
    version: 0.14.1
  - package: calogica/dbt_expectations
    version: 0.10.4

The monitoring package yaml can be like this:

  # Elementary dbt Monitoring
  - git: https://github.com/techindicium/elementary-dbt-monitoring
    revision: v2.1.0

And in your monitoring job you can run the following bash command before the dbt deps:

cat packages_monitoring.yml >> packages.yml

New releases

Want a new release (major/minor/patch) ?

  1. Push your modifications to main
  2. Push the tag you want, example: "git tag v1.0.1"
  3. git push origin tag v1.0.1 or git push --tags (warning: It pushes all tags you have)