DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format.
DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community.
- Feature-rich SQL support and DataFrame API
- Blazingly fast, vectorized, multi-threaded, streaming execution engine.
- Native support for Parquet, CSV, JSON, and Avro file formats. Support
for custom file formats and non file datasources via the
TableProvider
trait. - Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL, other query languages, custom plan and execution nodes, optimizer passes, and more.
- Streaming, asynchronous IO directly from popular object stores, including AWS S3,
Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the
ObjectStore
trait. - Excellent Documentation and a welcoming community.
- A state of the art query optimizer with projection and filter pushdown, sort aware optimizations, automatic join reordering, expression coercion, and more.
- Permissive Apache 2.0 License, Apache Software Foundation governance
- Written in Rust, a modern system language with development productivity similar to Java or Golang, the performance of C++, and loved by programmers everywhere.
DataFusion can be used without modification as an embedded SQL engine or can be customized and used as a foundation for building new systems. Here are some examples of systems built using DataFusion:
- Specialized Analytical Database systems such as CeresDB and more general spark like system such a Ballista.
- New query language engines such as prql-query and accelerators such as VegaFusion
- Research platform for new Database Systems, such as Flock
- SQL support to another library, such as dask sql
- Streaming data platforms such as Synnada
- Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as qv
- A faster Spark runtime replacement (blaze-rs)
By using DataFusion, the projects are freed to focus on their specific features, and avoid reimplementing general (but still necessary) features such as an expression representation, standard optimizations, execution plans, file format support, etc.
- High Performance: Leveraging Rust and Arrow's memory model, DataFusion is very fast.
- Easy to Connect: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
- Easy to Embed: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific use case
- High Quality: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
Here is a comparison with similar projects that may help understand when DataFusion might be be suitable and unsuitable for your needs:
-
DuckDB is an open source, in process analytic database. Like DataFusion, it supports very fast execution, both from its custom file format and directly from parquet files. Unlike DataFusion, it is written in C/C++ and it is primarily used directly by users as a serverless database and query system rather than as a library for building such database systems.
-
Polars: Polars is one of the fastest DataFrame libraries at the time of writing. Like DataFusion, it is also written in Rust and uses the Apache Arrow memory model, but unlike DataFusion it does not provide SQL nor as many extension points.
-
Facebook Velox is an execution engine. Like DataFusion, Velox aims to provide a reusable foundation for building database-like systems. Unlike DataFusion, it is written in C/C++ and does not include a SQL frontend or planning /optimization framework.
-
DataBend is a complete database system. Like DataFusion it is also written in Rust and utilizes the Apache Arrow memory model, but unlike DataFusion it targets end-users rather than developers of other database systems.
There are a number of community projects that extend DataFusion or provide integrations with other systems.
Here are some of the projects known to use DataFusion:
- Ballista Distributed SQL Query Engine
- Blaze Spark accelerator with DataFusion at its core
- CeresDB Distributed Time-Series Database
- Cloudfuse Buzz
- CnosDB Open Source Distributed Time Series Database
- Cube Store
- Dask SQL Distributed SQL query engine in Python
- datafusion-tui Text UI for DataFusion
- delta-rs Native Rust implementation of Delta Lake
- Flock
- Greptime DB Open Source & Cloud Native Distributed Time Series Database
- InfluxDB IOx Time Series Database
- Kamu Planet-scale streaming data pipeline
- Parseable Log storage and observability platform
- qv Quickly view your data
- ROAPI
- Seafowl CDN-friendly analytical database
- Synnada Streaming-first framework for data products
- Tensorbase
- VegaFusion Server-side acceleration for the Vega visualization grammar
Please see the example usage in the user guide and the datafusion-examples crate for more information on how to use DataFusion.
Please see Roadmap for information of where the project is headed.
There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together.
- (July 2022): DataFusion and Arrow: Supercharge Your Data Analytical Tool with a Rusty Query Engine: recording and slides
- (March 2021): The DataFusion architecture is described in Query Engine Design and the Rust-Based DataFusion in Apache Arrow: recording (DataFusion content starts ~ 15 minutes in) and slides
- (February 2021): How DataFusion is used within the Ballista Project is described in *Ballista: Distributed Compute with Rust and Apache Arrow: recording
Please see User Guide for more information about DataFusion.
Please see Contributor Guide for information about contributing to DataFusion.