Is Zarr an OLAP Database? #290

alxmrs · 2024-03-18T07:55:17Z

I've been doing some background research for a project I've been working on. I came across the definition of an OLAP DB and OLAP Cube, and I can't help but see the similarities to Zarr.

https://en.wikipedia.org/wiki/OLAP_cube

Consider the operations section of this wikipedia page:

slicing & dicing is just like array indexing and slicing.
roll-ups seem like the overviews Zarr extension considered in GeoZarr and other bioinformatics projects.
drill down/up doesn't seem to have a mapping to Zarr -- maybe xarray-datatree is most similar?
pivots are either queries on Zarr (say, with Xarray) or else rechunk operations to support faster queries.

This makes me wonder if Zarr could be (mis)used as a traditional DB, say, to handle analytics and business use cases. Furthermore, maybe the literature around OLAP DBs could inspire improvements to Zarr as a format.

xref: alxmrs/xarray-sql#47

d-v-b · 2024-03-18T09:46:31Z

I'm not too familiar with the OLAP conceptualization, but I tend to think of the N-dimensional array API as a special case of a table, where 1 column contains numeric values, and another column contains N-tuples (the indices of the data), which are used to index the values. Array indexing can then be expressed as querying the values based on the index, and all many other array operations can be expressed as transformations of the results of such queries. If we want to extend the analogy beyond the API, most N-dimensional array libraries store array data in contiguous buffers, which would make the array a "column-oriented table".

With this in mind, then Zarr can be described as a "columnar database" with a few performance / storage optimizations for large columns, but it's not a database that competes with something like DuckDB, since Zarr doesn't have good support for variable-length types.

All that being said, if the world of databases is a superset of the world of N-dimensional arrays, then it's almost certainly the case that we can use tools from database theory / software to advance Zarr.

alxmrs · 2024-03-18T11:11:11Z

IIUC, OLAP is a way to represent ND Arrays in a RDBMS s.t. a range of common analytic queries are performant. These center around the snowflake or star DB schema pattern. But, I’m a DB newcomer, so take this explanation with a grain of salt.

I tend to think of the N-dimensional array API as a special case of a

table, where 1 column contains numeric values, and another column contains N-tuples (the indices of the data), which are used to index the values. Array indexing can then be expressed as querying the values based on the index, and all many other array operations can be expressed as transformations of the results of such queries. If we want to extend the analogy beyond the API, most N-dimensional array libraries store array data in contiguous buffers, which would make the array a "column-oriented table". Funny you mention this! I’m exploring something similar here (in a read-only capacity): https://github.com/alxmrs/xarray-sql

since Zarr doesn't have good support for variable-length types.

Can you explain this a bit more? Would this be like string or varchar support?

it's almost certainly the case that we can use tools from database theory / software to advance Zarr.

Top of mind for me here is streaming reads and writes. I think some sort of rosetta stone with features in the Postgres ecosystem would really highlight potential new areas of development.

…

On Mon, Mar 18, 2024 at 3:16 PM Davis Bennett ***@***.***> wrote: I'm not too familiar with the OLAP conceptualization, but I tend to think of the N-dimensional array API as a special case of a table, where 1 column contains numeric values, and another column contains N-tuples (the indices of the data), which are used to index the values. Array indexing can then be expressed as querying the values based on the index, and all many other array operations can be expressed as transformations of the results of such queries. If we want to extend the analogy beyond the API, most N-dimensional array libraries store array data in contiguous buffers, which would make the array a "column-oriented table". With this in mind, then Zarr can be described as a "columnar database" with a few performance / storage optimizations for large columns, but it's not a database that competes with something like DuckDB, since Zarr doesn't have good support for variable-length types. All that being said, if the world of databases is a superset of the world of N-dimensional arrays, then it's almost certainly the case that we can use tools from database theory / software to advance Zarr. — Reply to this email directly, view it on GitHub <#290 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AARXABZE6KE7UXYWJJR574TYY2ZY3AVCNFSM6AAAAABE3B2YEWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBTGM3DEOJQHA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

d-v-b · 2024-03-18T12:04:16Z

since Zarr doesn't have good support for variable-length types.

Can you explain this a bit more? Would this be like string or varchar
support?

Zarr is designed for numeric types that are a fixed size, e.g. uint8; there's some effort towards supporting variable length strings in Zarr, but it's not a common use case. Compared to the types supported by a typical relational database, the numeric types Zarr focuses on is a tiny subset -- e.g., look at the types supported by postgres.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is Zarr an OLAP Database? #290

Is Zarr an OLAP Database? #290

alxmrs commented Mar 18, 2024

d-v-b commented Mar 18, 2024

alxmrs commented Mar 18, 2024 via email •

edited

Loading

d-v-b commented Mar 18, 2024

Is Zarr an OLAP Database? #290

Is Zarr an OLAP Database? #290

Comments

alxmrs commented Mar 18, 2024

d-v-b commented Mar 18, 2024

alxmrs commented Mar 18, 2024 via email • edited Loading

d-v-b commented Mar 18, 2024

alxmrs commented Mar 18, 2024 via email •

edited

Loading