Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 10k-v2 and stock_simulated for testing and benchmarking Rust library #4

Closed
wants to merge 1 commit into from

Conversation

alecmocatta
Copy link

In the import of the Rust parquet library from http://github.com/sunchao/parquet-rs, the benchmarks and their required files 10k-v2.parquet and stock_simulated.parquet weren't copied across.

This adds them so that updated benchmarks can be included in apache/arrow#3461.

@wesm
Copy link
Member

wesm commented Feb 1, 2019

These files are a bit big. Can you generate some data files using pyarrow or similar as part of the benchmarking process?

@alecmocatta
Copy link
Author

Good idea. Is it possible though that the generated files could change as pyarrow changes over time? I'd like to ensure the files are bit-for-bit identical over time, for consistent benchmark results.

@wesm
Copy link
Member

wesm commented Feb 1, 2019

You could pin the pyarrow version used to generate them

@alecmocatta
Copy link
Author

Thanks. I'm not familiar with how to do that but I'll take a look once the aforementioned PR is merged. I removed the benchmarks from it so this PR isn't needed immediately.

@wesm
Copy link
Member

wesm commented Feb 13, 2019

I have a lot on my plate. @andygrove or @sunchao could you take ownership of this? I don't like the idea of checking 1MB files into source control, particularly for benchmarking, so please don't do that without checking with me first

@andygrove
Copy link
Member

andygrove commented Feb 13, 2019 via email

@andygrove
Copy link
Member

I'm about to embark on some benchmarking with large files too so I face similar challenges.

Generating test data as part of the build seems like the way to go in order to catch performance regressions in the core libraries.

I am also maintaining my own repo where I have benchmarks against popular public data sets (and I'm just relying on wget or curl to download them).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants