-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is Zarr an OLAP Database? #290
Comments
I'm not too familiar with the OLAP conceptualization, but I tend to think of the N-dimensional array API as a special case of a table, where 1 column contains numeric values, and another column contains N-tuples (the indices of the data), which are used to index the values. Array indexing can then be expressed as querying the values based on the index, and all many other array operations can be expressed as transformations of the results of such queries. If we want to extend the analogy beyond the API, most N-dimensional array libraries store array data in contiguous buffers, which would make the array a "column-oriented table". With this in mind, then Zarr can be described as a "columnar database" with a few performance / storage optimizations for large columns, but it's not a database that competes with something like DuckDB, since Zarr doesn't have good support for variable-length types. All that being said, if the world of databases is a superset of the world of N-dimensional arrays, then it's almost certainly the case that we can use tools from database theory / software to advance Zarr. |
IIUC, OLAP is a way to represent ND Arrays in a RDBMS s.t. a range of
common analytic queries are performant. These center around the snowflake
or star DB schema pattern. But, I’m a DB newcomer, so take this explanation
with a grain of salt.
I tend to think of the N-dimensional array API as a special case of a
table, where 1 column contains numeric values, and another column contains
N-tuples (the indices of the data), which are used to index the values. Array indexing can then be expressed as querying the values based on the
index, and all many other array operations can be expressed as
transformations of the results of such queries. If we want to extend the
analogy beyond the API, most N-dimensional array libraries store array data
in contiguous buffers, which would make the array a "column-oriented table".
Funny you mention this! I’m exploring something similar here (in a
read-only capacity): https://github.com/alxmrs/xarray-sql
since Zarr doesn't have good support for variable-length types.
Can you explain this a bit more? Would this be like string or varchar
support?
it's almost certainly the case that we can use tools from database theory / software to advance Zarr.
Top of mind for me here is streaming reads and writes. I think some sort of
rosetta stone with features in the Postgres ecosystem would really
highlight potential new areas of development.
…On Mon, Mar 18, 2024 at 3:16 PM Davis Bennett ***@***.***> wrote:
I'm not too familiar with the OLAP conceptualization, but I tend to think
of the N-dimensional array API as a special case of a table, where 1 column
contains numeric values, and another column contains N-tuples (the indices
of the data), which are used to index the values. Array indexing can then
be expressed as querying the values based on the index, and all many other
array operations can be expressed as transformations of the results of such
queries. If we want to extend the analogy beyond the API, most
N-dimensional array libraries store array data in contiguous buffers, which
would make the array a "column-oriented table".
With this in mind, then Zarr can be described as a "columnar database"
with a few performance / storage optimizations for large columns, but it's
not a database that competes with something like DuckDB, since Zarr doesn't
have good support for variable-length types.
All that being said, if the world of databases is a superset of the world
of N-dimensional arrays, then it's almost certainly the case that we can
use tools from database theory / software to advance Zarr.
—
Reply to this email directly, view it on GitHub
<#290 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AARXABZE6KE7UXYWJJR574TYY2ZY3AVCNFSM6AAAAABE3B2YEWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBTGM3DEOJQHA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Zarr is designed for numeric types that are a fixed size, e.g. uint8; there's some effort towards supporting variable length strings in Zarr, but it's not a common use case. Compared to the types supported by a typical relational database, the numeric types Zarr focuses on is a tiny subset -- e.g., look at the types supported by postgres. |
I've been doing some background research for a project I've been working on. I came across the definition of an OLAP DB and OLAP Cube, and I can't help but see the similarities to Zarr.
https://en.wikipedia.org/wiki/OLAP_cube
Consider the operations section of this wikipedia page:
This makes me wonder if Zarr could be (mis)used as a traditional DB, say, to handle analytics and business use cases. Furthermore, maybe the literature around OLAP DBs could inspire improvements to Zarr as a format.
xref: alxmrs/xarray-sql#47
The text was updated successfully, but these errors were encountered: