Skip to content

Demo project (self-contained with docker-compose) that showcases an ELT process to load data into a warehouse, and process from raw layers into semantic and analytic layers with DBT. The data is then presented in a Superset dashboard.

Notifications You must be signed in to change notification settings

teddy-ambona/demo-elt-stack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ELT Data Stack

This is a demo project (self-contained with docker-compose) that showcases an ELT process to load data into a warehouse, and process from raw layers into semantic and analytic layers with DBT. The data is then presented in a Superset dashboard.

Target setup

1 - Prerequisites

This demo has been developed using MACOS Monterey v12.5, you may encounter DB host issues with docker-compose if you are running it on windows (You can change DB host to host.docker.internal in Apache superset for instance).

2 - Quickstart

make build-all-images

make up  # Start DB, Airflow, Superset services

You can now access (I recommend using incognito mode because logging into Superset with log you out of Airflow otherwise):

Airflow DAG(running on localhost:8080)

username: airflow, password: airflow

Now you should be able to see the transactions_elt DAG:

Manually trigger the DAG

The test task will fail because one test does not pass.

Now you can open Superset(running on localhost:8088)

username: admin, password: admin

And manually import the example_dashboard_export.zip(this should be done programmatically at init time in the container, but manual import will do for this demo). Postgres Db password: postgres

You can now see the "Monitoring Dashboard" that contains one graph that displays the accounts which withdrew more than 1 billion $USD over the period. Note that very little formatting has been done as this was not the purpose of this demo.

3 - Project file structure

.
 ├──init_db.sql // SQL Query run during initialization of postgresDB
 ├──docker-compose.yaml
 └──dbt
 │  ├──profiles.yml
 │  ├──tests
 │  │ ├──.gitkeep
 │  │ ├──ensure_withdrawal_or_deposit.sql
 │  │ └──value_date_greater_than_transaction_date.sql
 │  ├──models
 │  │ ├──staging
 │  │ │ ├──schema.yml
 │  │ │ ├──source.yml
 │  │ │ └──transactions.sql
 │  ├──macros // custom macro to override the schema name
 │  └──dbt_project.yml
 ├──Makefile  // Makes terminal interactions much faster
 ├──README.md
 ├──.gitignore
 ├──data_ingestion  // Contains the Python script
 │ ├──requirements.in
 │ ├──requirements.txt
 │ ├──Dockerfile
 │ ├──populate_csv_into_db.py
 │ └──data
 ├──superset/  // Superset custom app
 ├──docs  // Folder containing diagrams and screenshots
 └──airflow  // Custom Airflow app

4 - ELT

Dataset

The raw dataset (sourced from Kaggle) is located in transactions_data.csv. It represents synthetic bank transactions. An extract is presented below:

Extract & Load

DBT is a compiler and runner, it doesn't handle raw data loading directly so we use a Python script to load the CSV data. Note that a PostgresOperator in Airflow would do for the data ingestion but I didn't want to ingest the full CSV and PostgresOperator isn't great for custom CSV imports. Another workaround that could be used for the demo is using the DBT "seed" but it is only recommended for static data whilst the withdrawls history is dynamic data.

Python script that will read the CSV, remove duplicates and drop the last column, we don't want to store data that carries no information.

Transform

DBT will read data from the raw schema, parse and format the data (cf transactions.sql)and finally dump it into the staging schema. An extract of the staging transaction table is displayed below:

Tests

  • DBT tests are located under ./dbt/tests/
  • The Python script should also have unit/functional tests but they have been omitted to keep the demo simple. You can refer to my other repo financial-data-api if you want more details on python tests.

You will note that the Airflow task test_transformed_data fails. I leave it as fail to showcase its utility. if you look at the logs you will notice that not all value dates and greater than the transaction dae, which means some transactions are backdated.

5 - Data Visualization

Note that the superset service may take 4-5min to spin up.

6 - Improvements

  • Improve data viz with storytelling
  • Dependency injections for test/dev/staging/prod environments
  • Airflow DAGS and config should be included in the Docker image (mounting volumes allowed for fast iterations)
  • Superset dashboard should be programmatically imported at init time

7 - Resources:

8 - Useful commands

Check DBT DB connection status

dbt debug

Generate DBT documentation

dbt docs generate
dbt docs serve

Dry-run DBT

dbt compile

Run DBT tests

dbt test

About

Demo project (self-contained with docker-compose) that showcases an ELT process to load data into a warehouse, and process from raw layers into semantic and analytic layers with DBT. The data is then presented in a Superset dashboard.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published