Thoughts on storing data with tabular file formats? #63

psadil · 2023-07-25T00:12:32Z

psadil
Jul 25, 2023

Thanks for all the hard work on this format! This is a follow-up to a question from the presentation given in the Open Science Room at OHBM (2023). If this is not the right place for the follow-up, please feel free to move it.

I’m speaking as someone that is mainly just a user of tractography data, but also as someone with a decent amount of experience in the data science ecosystem. From that perspective, parts of the development for the TRX format look like they are duplicating parts of the development of modern tabular formats. I wonder about trying to leverage those tools, particularly Apache Arrow (and the corresponding file format Apache Parquet) and DuckDB.

For example, the dps and dpp subdirectories look very similar to a standard arrow table, with the files contained in the folders corresponding to columns. Arrow/parquet has solid abilities to work with data on disk, allows control over the datatype of the columns, supports metadata, and is implemented in several languages (e.g., C++, Rust, Java, Julia, Go). In addition, the format brings several niceties: packages exist in higher-level languages for performing analyses (py-polars, ibis, R, MATLAB, JavaScript), most of the high-level packages are built to allow efficient multi-threading, there is cuda support, and the arrow/parquet format is actively used and tested by a wide community.

Using something like Arrow adds a dependency, which is a cost. However, I wonder if this may be worthwhile, especially if it could make it simpler to develop and maintain lower-level implementations of the TRX format, like the C++ one.

There are different ways this use of a modern tabular format could look. One approach could involve replacing TRX subfolders with individual parquet files. Another might involve trying to link information across metadata and subfolders through a duckdb. But I wonder what the initial impressions are for this suggestion, or what kinds of demonstrations or evidence the community might find informative.

Thanks!

psadil · 2023-10-11T14:21:59Z

psadil
Oct 11, 2023
Author

FWIW, I'm still interested in this. If it would be helpful to move the discussion along, I'd be happy to prototype something in either python, rust, or R.

0 replies

arokem · 2023-10-11T20:26:50Z

arokem
Oct 11, 2023
Maintainer

Thanks @psadil : I'd be interested in seeing a prototype that we can benchmark against the existing implementation. I think there will probably be a tradeoff to make between the performance boost you might see here and the added complexity and dependencies, so I think that would be good to quantify and assess.

0 replies

psadil · 2023-11-12T21:53:41Z

psadil
Nov 12, 2023
Author

Sorry that this has taken so long. I put together a demo here: https://github.com/psadil/trx-parquet. The repo contains a notebook that walks through the prototype: main.ipynb. The notebook aims to describe the prototype format and present timings for reading and writing large parquet files. The class definition shows what working with the format in python could look like (I used py-polars, but, as mentioned above, others are available).

In that prototype, the data have been consolidated into a single file, and an effort was made to avoid making temporary files when converting between formats (e.g., parquet includes compression, and columns with compressed data can be interacted with directly). No effort was made to optimize the datatypes or representation on disk (e.g., the notebook shows the positions being represented as both Float32 and Float64).

0 replies

psadil · 2023-11-12T21:59:14Z

psadil
Nov 12, 2023
Author

Next, I can work on repeating the timings mentioned in the 2022 abstract. Is the code for those benchmarks available?

0 replies

arokem · 2024-04-03T03:36:58Z

arokem
Apr 3, 2024
Maintainer

Came back to this discussion as I just read this article, which shows some of the benefits of Apache Arrow. We also experienced some of this in our work on dipy/dipy#2593.

@frheault : is there any chance you could share the benchmark code that was mentioned above (from the 2022 OHBM abstract) on a GitHub repository, where we could experiment with some of these formats? Otherwise, I think that a benchmarking repo that with code and/or ideas that we could test across different implementations would be quite helpful.

2 replies

frheault Apr 8, 2024
Maintainer

Since the format was updated and the code modified, I will put together an updated version of the data and some code to do a simple benchmark very similar to that original benchmark.

psadil Apr 9, 2024
Author

@frheault , that's great, thank you. When the data and code are available, I'd be happy to contribute a few versions based on arrow and parquet for comparison.

skoudoro · 2024-04-03T04:06:06Z

skoudoro
Apr 3, 2024

very interesting topic

is implemented in several languages (e.g., C++, Rust, Java, Julia, Go). In addition, the format brings several niceties: packages exist in higher-level languages for performing analyses

I like a lot this point

0 replies

frheault · 2024-05-02T12:43:12Z

frheault
May 2, 2024
Maintainer

I had to fix a bug ( see #81 ) to then re-generate some of the data here

Sorry for the time it took, I had a hard time finding the problem. The link provided contains all the files even though they can be generated (see convert.sh). There some examples of how to use the TRX python library in the 3 python files [1,2,3]_*.py. And the benchmark.py is an example of how I timed the different loading initially (can be done with saving too, not in the file)

3 replies

psadil May 5, 2024
Author

@frheault Thank you for sharing! Using that data, I'm experimenting a bit over at https://github.com/psadil/trx_benchmarks.

FWIW, I'm noticing a few differences at even the level of reading random arrays or series with either numpy.fromfile vs polars.read_parquet (details of platform are in the repo). As a few highlights, here are results for reading from disk into memory.

1,000,000 elements

Name (time in ms)	Mean	StdDev	Median	IQR
test_read[1000000-numpy]	1.9915	0.0547	1.9811	0.0294
test_read[1000000-parquet]	4.1526	0.5367	4.1253	0.1555

(yes, at this scale numpy was faster)

100,000,000 elements

Name (time in ms)	Mean	StdDev	Median	IQR
test_read[100000000-parquet]	262.4691	2.2440	262.1845	3.7876
test_read[100000000-numpy]	337.1754	3.0695	336.4485	3.9795

There are a few differences, but at this scale I think it's nothing that would impact analyses too severely.

These are with numpy.memmap and polars.read_parquet(memory_map=False) (note difference in time units as compared to full reading).

1,000,000 elements

Name (time in us)	Mean	StdDev	Median	IQR
test_mmap[1000000-parquet]	1,748.6016	172.9189	1,709.7080	89.4899
test_mmap[1000000-numpy]	4,977.1597	967.4249	4,986.7295	363.1670

(creating a mmap to a v2 feather file is two orders of magnitude faster).

I would guess that there are also differences in querying and computing on the streamlines. Which ones in particular would you like to see? Would be happy to show whatever comparisons seem relevant.

frheault May 6, 2024
Maintainer

I have a few points, just to be sure is 1M elements like 1M streamlines with a certain amount of points or 1M points?
I am not sure I see the two orders of magnitude faster. From what I see it is twice as fast (not 100 times faster).
Since you are testing on a random array, I think it would be important to know if the positions are in float16-float32-float64 (float32 being faster and float64 being slowest).

I believe a few good tests from the benchmark file I sent would be 1) load (as memmap, not in memory), get a bundle group, then save the bundle group. 2) load (as memmap, not in memory), get the first 1000000 streamlines, save it 3) load (as memmap, not in memory), get a random 1000000 streamlines, save it 4) Concatenate a few TRX together into one big one without using memory.

psadil May 6, 2024
Author

@frheault , thanks for the input! I'll work on those specific benchmarks. I still think the main advantage in either an arrow/parquet approach will be the interoperability, but my expectation is that there will still be a few speedups.

For these specific points

From what I see it is twice as fast (not 100 times faster).

Fair enough! I presented that in a confusing way and didn't show the numbers with the feather file. Like a parquet file, a feather file is a way of storing arrow data on disk (it's also sometimes called an "ipc" file, referring to how it's good for sending information between processes).

Still, I'm not sure whether anything under 5 milliseconds would be noticeable.

I think it would be important to know if the positions are in float16-float32-float64

Those reported numbers were with float32.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts on storing data with tabular file formats? #63

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Thoughts on storing data with tabular file formats? #63

psadil Jul 25, 2023

Replies: 7 comments · 5 replies

psadil Oct 11, 2023 Author

arokem Oct 11, 2023 Maintainer

psadil Nov 12, 2023 Author

psadil Nov 12, 2023 Author

arokem Apr 3, 2024 Maintainer

frheault Apr 8, 2024 Maintainer

psadil Apr 9, 2024 Author

skoudoro Apr 3, 2024

frheault May 2, 2024 Maintainer

psadil May 5, 2024 Author

1,000,000 elements

100,000,000 elements

1,000,000 elements

frheault May 6, 2024 Maintainer

psadil May 6, 2024 Author

psadil
Jul 25, 2023

Replies: 7 comments 5 replies

psadil
Oct 11, 2023
Author

arokem
Oct 11, 2023
Maintainer

psadil
Nov 12, 2023
Author

psadil
Nov 12, 2023
Author

arokem
Apr 3, 2024
Maintainer

frheault Apr 8, 2024
Maintainer

psadil Apr 9, 2024
Author

skoudoro
Apr 3, 2024

frheault
May 2, 2024
Maintainer

psadil May 5, 2024
Author

frheault May 6, 2024
Maintainer

psadil May 6, 2024
Author