Replies: 7 comments 5 replies
-
FWIW, I'm still interested in this. If it would be helpful to move the discussion along, I'd be happy to prototype something in either python, rust, or R. |
Beta Was this translation helpful? Give feedback.
-
Thanks @psadil : I'd be interested in seeing a prototype that we can benchmark against the existing implementation. I think there will probably be a tradeoff to make between the performance boost you might see here and the added complexity and dependencies, so I think that would be good to quantify and assess. |
Beta Was this translation helpful? Give feedback.
-
Sorry that this has taken so long. I put together a demo here: https://github.com/psadil/trx-parquet. The repo contains a notebook that walks through the prototype: main.ipynb. The notebook aims to describe the prototype format and present timings for reading and writing large parquet files. The class definition shows what working with the format in python could look like (I used In that prototype, the data have been consolidated into a single file, and an effort was made to avoid making temporary files when converting between formats (e.g., parquet includes compression, and columns with compressed data can be interacted with directly). No effort was made to optimize the datatypes or representation on disk (e.g., the notebook shows the positions being represented as both |
Beta Was this translation helpful? Give feedback.
-
Next, I can work on repeating the timings mentioned in the 2022 abstract. Is the code for those benchmarks available? |
Beta Was this translation helpful? Give feedback.
-
Came back to this discussion as I just read this article, which shows some of the benefits of Apache Arrow. We also experienced some of this in our work on dipy/dipy#2593. @frheault : is there any chance you could share the benchmark code that was mentioned above (from the 2022 OHBM abstract) on a GitHub repository, where we could experiment with some of these formats? Otherwise, I think that a benchmarking repo that with code and/or ideas that we could test across different implementations would be quite helpful. |
Beta Was this translation helpful? Give feedback.
-
very interesting topic
I like a lot this point |
Beta Was this translation helpful? Give feedback.
-
I had to fix a bug ( see #81 ) to then re-generate some of the data here Sorry for the time it took, I had a hard time finding the problem. The link provided contains all the files even though they can be generated (see |
Beta Was this translation helpful? Give feedback.
-
Thanks for all the hard work on this format! This is a follow-up to a question from the presentation given in the Open Science Room at OHBM (2023). If this is not the right place for the follow-up, please feel free to move it.
I’m speaking as someone that is mainly just a user of tractography data, but also as someone with a decent amount of experience in the data science ecosystem. From that perspective, parts of the development for the TRX format look like they are duplicating parts of the development of modern tabular formats. I wonder about trying to leverage those tools, particularly Apache Arrow (and the corresponding file format Apache Parquet) and DuckDB.
For example, the dps and dpp subdirectories look very similar to a standard arrow table, with the files contained in the folders corresponding to columns. Arrow/parquet has solid abilities to work with data on disk, allows control over the datatype of the columns, supports metadata, and is implemented in several languages (e.g., C++, Rust, Java, Julia, Go). In addition, the format brings several niceties: packages exist in higher-level languages for performing analyses (py-polars, ibis, R, MATLAB, JavaScript), most of the high-level packages are built to allow efficient multi-threading, there is cuda support, and the arrow/parquet format is actively used and tested by a wide community.
Using something like Arrow adds a dependency, which is a cost. However, I wonder if this may be worthwhile, especially if it could make it simpler to develop and maintain lower-level implementations of the TRX format, like the C++ one.
There are different ways this use of a modern tabular format could look. One approach could involve replacing TRX subfolders with individual parquet files. Another might involve trying to link information across metadata and subfolders through a duckdb. But I wonder what the initial impressions are for this suggestion, or what kinds of demonstrations or evidence the community might find informative.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions