Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV-row filtering when reading #503

Closed
bkamins opened this issue Sep 27, 2019 · 8 comments
Closed

CSV-row filtering when reading #503

bkamins opened this issue Sep 27, 2019 · 8 comments

Comments

@bkamins
Copy link
Member

bkamins commented Sep 27, 2019

It would be good if CSV.File accepted an argument that would be a predicate that would take row number and row itself and allowed to filter rows being read conditional on this predicate.

Simple use cases:

  • store only even rows
  • store only random sample with 5% of ingested rows
  • store only rows that have a certain value in a certain column (e.g. in ML: select only rows that should go to training data set)

Why it is needed: if CSV.jl is optimized to be less memory hungry then such filtering would allow ingesting very large files by reading only a fraction of their contents.

As an additional option one can imagine that this would allow for piping a CSV file in and then writing it back as a CSV (or JSON) filtered without materializing the whole data set in memory (this is a relevant and frequent use case; note that in general this cannot be done by typical UNIX line processing utilities, as one record of a CSV file can in general span multiple rows).

@quinnj - do you think it would be doable?

@bkamins
Copy link
Member Author

bkamins commented Sep 27, 2019

As an additional comment. Maybe a better place for this to live in is Tables.jl, but I did not find a function there that would take a table and return a lazy filtered wrapper of this table.

@davidanthoff
Copy link

I think that should all work with Query.jl already? If you use the row iteration of CSV.jl, pipe that into a @filter from Query.jl, then that should give you a completely lazy streaming story. If you then pipe the result from the @filter clause into a CSV.save (or CSVFiles.save), then I think you should have something that is entirely streaming without ever materializing the entire file in memory.

@bkamins
Copy link
Member Author

bkamins commented Sep 27, 2019

The approach you recommend does not work on my machine unfortunately. If aaa.txt is a really large file I get:

julia> f = CSV.File("aaa.txt", use_mmap=false)
ERROR: SystemError: mmap: The operation completed successfully.

julia> f = CSV.File("aaa.txt", use_mmap=true)
ERROR: SystemError: mmap: The operation completed successfully.

@davidanthoff
Copy link

I think you would have to use CSV.Rows.

@bkamins
Copy link
Member Author

bkamins commented Sep 28, 2019

This solves (partially) the last use-case in simplest scenarios. Thank you.

To make explicit the core of my question. I would like:

CSV.read("aaa.txt", limit=10)

to work with type inference CSV.read gives you, even if aaa.txt is very large (and ideally limit could also accept a predicate as proposed above, not an integer).

@anandijain
Copy link

Could the same logic be applied to the columns?

I think it would be nice to be able to specify specific columns and rows to read in.

See #154
I like the piping functionality, but using read would be much more intuitive imo.

df = CSV.read("df.csv", use_headers=[:A, :B])

Would this make CSV more light on memory allocations in situations where we only want a small percent of the entire file?

I also think it would be cool to have the simple use cases in the original post, as well as the ability to specify a list of datatypes that we want to read in, ignoring columns without that datatype.

@quinnj
Copy link
Member

quinnj commented Oct 5, 2019

The biggest question if we were to support row filtering in CSV.File directly, is what the filter function operates on exactly: only the string values of the cells in a row? On the parsed values? It seems natural to allow the filter to operate on the parsed values, but then there's a wrinkle: the filter function will receive a row of the parse values with types as they've been inferred as so far in the file. i.e. you could be in a situation where the filter function receives Column1 which is Int64 type, but later on, we encounter a Float64 value, and the column is promoted to Float64, so then the filter function will get Float64. It's not the end of the world, and I imagine users can account for that in their function, it's just a wrinkle.

I don't think this would be too hard to support, but I do have to say that I don't think it will really help the memory footrprint. We preallocate based on how many rows we expect to find in the file, so filtering while parsing just uses up less of the pre-allocated memory. We can obviously resize! after we're done, but as far as I know, that rarely actually frees up the memory that's been resized.

I've been slogging through a fairly gnarly internals rewrite for the past two weeks to help the memory footprint situation, so maybe with those changes, we won't see as many problems. I'm hoping to get a PR up in the next few days.

@bkamins
Copy link
Member Author

bkamins commented Oct 6, 2019

Excellent. Thank you for all the work.

If filtering will not help in reducing the memory footprint then this can be closed (@quinnj please reopen if you feel it is better to keep track of this).

@bkamins bkamins closed this as completed Oct 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants