This project has been developed as part of 2023 Data Engineering Zoomcamp. The goal of the project is to implement NYC's Citibike data pipeline. Its a batch pipeline which extracts data from NYC's Citibike Dataset and stores the raw data into Google Cloud Storage and Google Big Query. Stored data from bigquery will be transformed using DBT and transformed dataset will be used by Google Looker data studio to develop visualizations for analytics purposes.
Citi Bike is NYC’s official bike share program, designed to give residents and visitors a fun, affordable and convenient alternative to walking, taxis, buses and subways. Citi Bike believes that biking is the best way to see NYC! It's a quick and affordable way to get all around the city, and it even allows you to sightsee along the way. Project answers the below questions and helps bikers to explore the NYC.
- Where do Citi Bikers ride?
- Which stations are most popular?
- What days of the week are most rides taken on?
- What are the total number of trips?
you can find the report here
Following technologies are used in implementing this pipeline
- Cloud: Goggle Cloud Platform
- Data Lake: Google Cloud Storage
- Data warehouse: Google Big Query
- Terraform: Infrastructure as code (IaC) - creates project configuration for GCP to bypass cloud GUI.
- Workflow orchestration: Prefect
- Data Transformation: DBT
- Data Visualisation: Google Looker data studio
-
Clone the git repo to your system
git clone <your-repo-url>
-
Install the neccesary packages/pre-requisites for the project with the following command
pip install -r requirements.txt
-
Next you need to setup your Google Cloud environment
- Create a Google Cloud Platform project, if you do not already have one (https://console.cloud.google.com/cloud-resource-manager)
- Configure Identity and Access Management (IAM) for the service account, provide the following privileges:
- BigQuery Admin
- Storage Admin
- Storage Object Admin
- Download the JSON credentials and save it somehwere you'll remember, which will be JSON key.
- Install the Google Cloud SDK
- Configure the environment variable point to your GCP key (https://cloud.google.com/docs/authentication/application-default-credentials#GAC) and authenticate it using following commands
export GOOGLE_APPLICATION_CREDENTIALS=<path_to_your_credentials>.json gcloud auth application-default login
- Set up the infrastructure of the project using Terraform
-
If you do not have Terraform installed you can install it from here and then add it to your PATH
-
Once donwloaded navigate to the terraform folder :
cd terraform/
-
then run the following commands to create your project infrastructure
terraform init terraform plan -var="project=<your-gcp-project-id>" terraform apply -var="project=<your-gcp-project-id>"
- Run python code in Prefect folder
-
you have installed the required python packages in step 1, prefect should be installed with it. Confirm the prefect installation with following command
prefect --version
-
You can start the prefect server so that you can access the UI using the command below:
prefect orion start
-
access the UI at: http://127.0.0.1:4200/
-
Then change out the blocks so that they are registered to your credentials for GCS and Big Query. This can be done in the Blocks options
-
You can keep the blocks under the same names as in the code or change them. If you do change them make sure to change the code to reference the new block name
-
Go back to the terminal and run:
cd prefect/
-
then run
python citibike_data_pipeline.py
-
The python script will then store the citibike data both in your GCS bucket and in Big Query
- Running the dbt flow
- Create a dbt account and log in using dbt cloud here
- Once logged in clone the repo for use
- in the cli at the bottom run the following command:
dbt run
- this will run all the models and create the final dataset called "fact_citibike"
- On successful run , the linage of fact_citibike looks as below :
- Visualization
- You can now utilize the fact_citibike dataset and use it within Looker for visualizations.
- you can find the report here