Repository contains airflow pipelines (DAGs) that derive blockchain data from dedicated node via Ethereum ETL to Postgres database with quite plot structure.
clickhouse-eth-data - connection to ClickHouse for raw data and DBT models google-cloud - connection to Google Cloud Storage (for transfer datasets between databases) pg-prod-scoring-api - connection to Postgres for showcase API is hosted slack_notifications - connection to Slack (for notifications)
- /dags - the main package containing definitions of our airflow dags
- /common - common modules
- /deployments/local - contains docker-compose file for local debugging
- /dbt_clickhouse - contains DBT models
- analytics: Contains DAGs related to analytics tasks, such as generating reports, data analysis, or running machine learning models.
- api: Includes DAGs related to API integration tasks, such as fetching and processing data from external APIs.
- attributes: Contains DAGs focused on extracting and processing attribute-related data, such as wallet attributes or labels.
- audiences: Holds DAGs related to audience-related tasks, such as creating and updating audience segments.
- chains/ethereum_etl: Contains DAGs specifically related to Ethereum ETL tasks, handling the extraction, transformation, and loading of Ethereum blockchain data.
- control: Contains DAGs that serve as control mechanisms or orchestrators for other DAGs.
- dbt: Holds DAGs related to DBT (Data Build Tool) tasks, responsible for transforming and modeling the extracted data.
- deanonimization: Includes DAGs related to de-anonymization tasks, linking anonymized data to specific individuals or entities.
- ens: Contains DAGs related to Ethereum Name Service (ENS) tasks, including data extraction, processing, and storage.
- erc_1155: Holds DAGs specifically related to ERC-1155 token tasks, including data extraction, processing, and storage.
- external: Includes DAGs that interact with external systems or services outside of the Airflow environment.
- ml: Contains DAGs related to machine learning tasks, such as training models and running data pipelines.
- nft_holders: Holds DAGs specifically related to NFT (Non-Fungible Token) holder tasks, including data extraction, processing, and storage.
- report: Contains DAGs related to generating and delivering reports based on the extracted data.
- snapshot: Holds DAGs related to snapshotting or creating snapshots of the blockchain data at specific points in time.
- tests: Includes DAGs specifically designed for testing purposes, such as testing data pipelines or validating data quality.
- top_collections: Contains DAGs related to tasks involving top collections or popular items within a collection.
- utility: Holds utility or helper DAGs providing common functions or tasks used across other DAGs.
- monodag.py: Represents a single DAG that encapsulates a specific workflow or task.
The airflow DAGs in this repository follow the following pipeline:
-
Load raw data: The DAG uses the open-source tool named Ethereum ETL (https://github.com/blockchain-etl/ethereum-etl) to download raw data from the Ethereum blockchain.
-
Insert into ClickHouse: The downloaded raw data is then inserted into ClickHouse, a columnar database optimized for analytics.
-
Business Analytics with DBT: The data stored in ClickHouse is further processed using DBT (https://github.com/dbt-labs/dbt-core) to calculate several business analytics entities. DBT provides a powerful toolkit for transforming and modeling data.
-
API Integration: The resulting tables or models from DBT are then utilized in an API, allowing users to access the derived analytics data.
make run-local
- running local airflow instance with CeleryExecutor and all production connections (postgres, clickhouse, google-cloud)
make down
- stopping a local airflow instance and deleting all containers and volumes
make delete
- the same as make down
above, only with removal all images