Skip to content

Latest commit

 

History

History
206 lines (182 loc) · 9.86 KB

README.md

File metadata and controls

206 lines (182 loc) · 9.86 KB

rtdl - The Real-Time Data Lake ⚡️


MIT License
rtdl is a universal real-time ingestion and pre-processing layer for every data lake – regardless of table format, OLAP layer, catalog, or cloud vendor. It is the easiest way to build and maintain real-time data lakes. You send rtdl a real-time data stream – often from a tool like Segment – and it builds you a real-time data lake on AWS S3, GCP Cloud Storage, and Azure Blob Storage.

You provide the data, rtdl builds your lake.

Stay up-to-date on rtdl via our website and blog, and learn how to use rtdl via our documentation.

V0.2.0 - Current status -- what works and what doesn't

What works? 🚀

rtdl's initial feature set is built and working. You can use the API on port 80 to configure streams that ingest json from an rtdl endpoint on port 8080, process them into Parquet, and save the files to a destination configured in your stream. rtdl can write files locally, to HDFS, to AWS S3, GCP Cloud Storage, and Azure Blob Storage and you can query your data via Dremio's web UI at http://localhost:9047 (login with Username: rtdl and Password rtdl1234). rtdl supports writing in the Delta Lake table format as well as integration with the AWS Glue and Snowflake External Tables metadata catalogs.

What's new? 💥

  • Upgrading to v0.2.0 requires following the steps in our upgrade guide.
  • Added Delta Lake support.
  • Switched to file-based configuration storage (removed dependency on PostgreSQL).

What doesn't work/what's next on the roadmap? 🚴🏼

  • Community contribution: Stateful Function for PII detection and masking.
  • Making AWS Glue, Snowflake External Tables, and Delta Lake support on a by-stream basis.
  • git integration for stream configurations.
  • Research and implementation for Apache Hudi, Apache Iceberg, and Project Nessie.
  • Graphical user interface.
  • Dremio Cloud support.

Quickstart 🌱

Initialize and start rtdl

For more detailed instructions, see our Initialize rtdl docs.

  1. Run docker compose -f docker-compose.init.yml up -d.
    • Note: This configuration should be fault-tolerant, but if any containers or processes fail when running this, run docker compose -f docker-compose.init.yml down and retry.
  2. After the containers rtdl_rtdl-db-init, rtdl_dremio-init, and rtdl_redpanda-init exit and complete with EXITED (0), kill and delete the rtdl container set by running docker compose -f docker-compose.init.yml down.
  3. Run docker compose up -d every time after.
    Note: Your memory setting in Docker must be at least 8GB. rtdl may become unstable if it is set lower.
    • docker compose down to stop.

Note #1: To start from scratch, run rm -rf storage/ from the rtdl root folder.
Note #2: If you experience file write issues preventing Dremio and/or Redpanda services from starting, please add user: root to the docker-compose.init.yml and docker-compose.yml files in the Dremio and Redpanda service definitions. This issue has been encountered on Linux.

Setup your storage buckets (in AWS) and stream in rtdl

For more detailed setup instructions for your cloud provider, see our setup docs:

  1. Create a new S3 bucket.
  2. Create a new IAM user.
  3. Create a IAM new policy.
    • Use the below permissions, and attach the policy to the IAM user created in step 2. Replace <YOUR_BUCKET_NAME> with the name of the S3 bucket you created in step 1.
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "ListAllBuckets",
                  "Effect": "Allow",
                  "Action": [
                      "s3:GetBucketLocation",
                      "s3:ListAllMyBuckets"
                  ],
                  "Resource": [
                      "arn:aws:s3:::*"
                  ]
              },
              {
                  "Sid": "ListBucket",
                  "Effect": "Allow",
                  "Action": [
                      "s3:ListBucket"
                  ],
                  "Resource": [
                      "arn:aws:s3:::<YOUR_BUCKET_NAME>"
                  ]
              },
              {
                  "Sid": "ManageBucket",
                  "Effect": "Allow",
                  "Action": [
                      "s3:GetObject",
                      "s3:PutObject",
                      "s3:PutObjectAcl",
                      "s3:DeleteObject"
                  ],
                  "Resource": [
                      "arn:aws:s3:::<YOUR_BUCKET_NAME>/*"
                  ]
              }
          ]
      }
      
  4. Attach the policy created in step 3 to the IAM user created in step 2.
  5. Create access keys for your IAM user.
    • For more information, see Amazon's documentation.
    • Save the Access Key ID and Secret Access Key for use in configuring your stream in rtdl.
  6. Create a stream configuration record in rtdl.
    Send a call to the API at http://localhost:80/createStream.
    • Example createStream call body for creating a data lake on AWS S3.
      {
      "active": true,
      "message_type": "test-msg-aws",
      "file_store_type_id": 2,
      "region": "us-west-1",
      "bucket_name": "testBucketAWS",
      "folder_name": "testFolderAWS",
      "partition_time_id": 1,
      "compression_type_id": 1,
      "aws_access_key_id": "[aws_access_key_id]",
      "aws_secret_access_key": "[aws_secret_access_key]"
      }
      
    • Example createStream curl call for creating a data lake on AWS S3.
      curl --location --request POST 'http://localhost:80/createStream' \
      --header 'Content-Type: application/json' \
      --data-raw '{
      "active": true,
      "message_type": "test-msg-aws",
      "file_store_type_id": 2,
      "region": "us-west-1",
      "bucket_name": "testBucketAWS",
      "folder_name": "testFolderAWS",
      "partition_time_id": 1,
      "compression_type_id": 1,
      "aws_access_key_id": "[aws_access_key_id]",
      "aws_secret_access_key": "[aws_secret_access_key]"
      }'
      
    Note: A Postman collection with examples of all rtdl API calls can be found on GitHub at realtimedatalake/postman-rtdl-public.

Send data to rtdl

For more detailed instructions, see our Send data to rtdl docs.
All data should be sent to the ingest endpoint of the ingest service on port 8080 -- e.g. http://localhost:8080/ingest.

  • You can send any json with just stream_id in the payload and rtdl will add it to your lake.
    {
        "stream_id":"837a8d07-cd06-4e17-bcd8-aef0b5e48d31",
        "name":"user1",
        "array":[1,2,3],
        "properties":{"age":20}
    }
    
    You can optionally add message_type should you choose to override the message_type specified while creating the stream. rtdl will default to a message type rtdl_default if message type is absent in both stream definition and actual message.

Architecture 🏛

rtdl has a multi-service architecture composed of a new generation of open source tools to process and access your data and custom-built services to interact with them more easily. To learn more about rtdl's services and architecture, visit our Architecture docs.

License 🤝

MIT

Contributing 😎

Contributions are always welcome!
See our CONTRIBUTING for ways to get started. This project adheres to the rtdl code of conduct - a direct adaptation of the Contributor Covenant, version 2.1.

Appreciation 🙏

Authors ✍🏽