Skip to content

Latest commit

 

History

History
57 lines (47 loc) · 1.38 KB

Readme.md

File metadata and controls

57 lines (47 loc) · 1.38 KB

Project-based Learning - Data Engineering

This repository is dedicated to honing skills in healthcare data engineering through practical projects and exercises with support from Synthea, a synthetic clinical data simulator to output realistic, but not real, patient data. The objective behind this repository is to provide hands-on experience by leveraging Python and SQL programming languages, along with a diverse set of technologies and tools commonly used in the field of data engineering.

Tech Stack

Programming Languages

  • Python
  • SQL

Technologies and Tools

  • Docker
  • Terraform
  • PostgreSQL
  • Google Cloud Platform (GCP)
  • Mage (alternative to Airflow)
  • BigQuery
  • DBT (Data Build Tool)
  • Apache Spark (Python & SQL)
  • Kafka
  • Faust
  • KSQL
  • ksqlDB
  • Make

Modules

  • Module 1: Containerization and Infrastructure as Code (IaC)
    • Docker
    • Terraform
    • GCP
  • Module 2: Workflow Orchestration
    • Data Lake
    • Mage
    • Airflow
  • Module 3: Data Warehouse
    • Data Warehouse
    • BigQuery
  • Module 4: Analytics engineering
    • ELT vs. ETL
    • DBT
    • Testing (unit & integration testing)
  • Module 5: Batch processing
    • Apache Spark (Python & SQL)
  • Module 6: Streaming
    • Kafka
    • Faust
    • KSQL
    • ksqlDB
    • Exposure to examples with Java & Scala

Workshops

  • Workshop 1: Data Ingestion
  • Workshop 2: Stream Processing with SQL