-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV-row filtering when reading #503
Comments
As an additional comment. Maybe a better place for this to live in is Tables.jl, but I did not find a function there that would take a table and return a lazy filtered wrapper of this table. |
I think that should all work with Query.jl already? If you use the row iteration of CSV.jl, pipe that into a |
The approach you recommend does not work on my machine unfortunately. If
|
I think you would have to use |
This solves (partially) the last use-case in simplest scenarios. Thank you. To make explicit the core of my question. I would like:
to work with type inference |
Could the same logic be applied to the columns? I think it would be nice to be able to specify specific columns and rows to read in. See #154
Would this make CSV more light on memory allocations in situations where we only want a small percent of the entire file? I also think it would be cool to have the simple use cases in the original post, as well as the ability to specify a list of datatypes that we want to read in, ignoring columns without that datatype. |
The biggest question if we were to support row filtering in I don't think this would be too hard to support, but I do have to say that I don't think it will really help the memory footrprint. We preallocate based on how many rows we expect to find in the file, so filtering while parsing just uses up less of the pre-allocated memory. We can obviously I've been slogging through a fairly gnarly internals rewrite for the past two weeks to help the memory footprint situation, so maybe with those changes, we won't see as many problems. I'm hoping to get a PR up in the next few days. |
Excellent. Thank you for all the work. If filtering will not help in reducing the memory footprint then this can be closed (@quinnj please reopen if you feel it is better to keep track of this). |
It would be good if
CSV.File
accepted an argument that would be a predicate that would take row number and row itself and allowed to filter rows being read conditional on this predicate.Simple use cases:
Why it is needed: if CSV.jl is optimized to be less memory hungry then such filtering would allow ingesting very large files by reading only a fraction of their contents.
As an additional option one can imagine that this would allow for piping a CSV file in and then writing it back as a CSV (or JSON) filtered without materializing the whole data set in memory (this is a relevant and frequent use case; note that in general this cannot be done by typical UNIX line processing utilities, as one record of a CSV file can in general span multiple rows).
@quinnj - do you think it would be doable?
The text was updated successfully, but these errors were encountered: