CSV-row filtering when reading #503

bkamins · 2019-09-27T11:02:56Z

It would be good if CSV.File accepted an argument that would be a predicate that would take row number and row itself and allowed to filter rows being read conditional on this predicate.

Simple use cases:

store only even rows
store only random sample with 5% of ingested rows
store only rows that have a certain value in a certain column (e.g. in ML: select only rows that should go to training data set)

Why it is needed: if CSV.jl is optimized to be less memory hungry then such filtering would allow ingesting very large files by reading only a fraction of their contents.

As an additional option one can imagine that this would allow for piping a CSV file in and then writing it back as a CSV (or JSON) filtered without materializing the whole data set in memory (this is a relevant and frequent use case; note that in general this cannot be done by typical UNIX line processing utilities, as one record of a CSV file can in general span multiple rows).

@quinnj - do you think it would be doable?

The text was updated successfully, but these errors were encountered:

bkamins · 2019-09-27T11:07:39Z

As an additional comment. Maybe a better place for this to live in is Tables.jl, but I did not find a function there that would take a table and return a lazy filtered wrapper of this table.

davidanthoff · 2019-09-27T16:08:59Z

I think that should all work with Query.jl already? If you use the row iteration of CSV.jl, pipe that into a @filter from Query.jl, then that should give you a completely lazy streaming story. If you then pipe the result from the @filter clause into a CSV.save (or CSVFiles.save), then I think you should have something that is entirely streaming without ever materializing the entire file in memory.

bkamins · 2019-09-27T23:17:52Z

The approach you recommend does not work on my machine unfortunately. If aaa.txt is a really large file I get:

julia> f = CSV.File("aaa.txt", use_mmap=false)
ERROR: SystemError: mmap: The operation completed successfully.

julia> f = CSV.File("aaa.txt", use_mmap=true)
ERROR: SystemError: mmap: The operation completed successfully.

davidanthoff · 2019-09-28T00:11:52Z

I think you would have to use CSV.Rows.

bkamins · 2019-09-28T06:43:49Z

This solves (partially) the last use-case in simplest scenarios. Thank you.

To make explicit the core of my question. I would like:

CSV.read("aaa.txt", limit=10)

to work with type inference CSV.read gives you, even if aaa.txt is very large (and ideally limit could also accept a predicate as proposed above, not an integer).

anandijain · 2019-10-02T16:14:12Z

Could the same logic be applied to the columns?

I think it would be nice to be able to specify specific columns and rows to read in.

See #154
I like the piping functionality, but using read would be much more intuitive imo.

df = CSV.read("df.csv", use_headers=[:A, :B])

Would this make CSV more light on memory allocations in situations where we only want a small percent of the entire file?

I also think it would be cool to have the simple use cases in the original post, as well as the ability to specify a list of datatypes that we want to read in, ignoring columns without that datatype.

quinnj · 2019-10-05T23:16:01Z

The biggest question if we were to support row filtering in CSV.File directly, is what the filter function operates on exactly: only the string values of the cells in a row? On the parsed values? It seems natural to allow the filter to operate on the parsed values, but then there's a wrinkle: the filter function will receive a row of the parse values with types as they've been inferred as so far in the file. i.e. you could be in a situation where the filter function receives Column1 which is Int64 type, but later on, we encounter a Float64 value, and the column is promoted to Float64, so then the filter function will get Float64. It's not the end of the world, and I imagine users can account for that in their function, it's just a wrinkle.

I don't think this would be too hard to support, but I do have to say that I don't think it will really help the memory footrprint. We preallocate based on how many rows we expect to find in the file, so filtering while parsing just uses up less of the pre-allocated memory. We can obviously resize! after we're done, but as far as I know, that rarely actually frees up the memory that's been resized.

I've been slogging through a fairly gnarly internals rewrite for the past two weeks to help the memory footprint situation, so maybe with those changes, we won't see as many problems. I'm hoping to get a PR up in the next few days.

bkamins · 2019-10-06T07:52:46Z

Excellent. Thank you for all the work.

If filtering will not help in reducing the memory footprint then this can be closed (@quinnj please reopen if you feel it is better to keep track of this).

bkamins closed this as completed Oct 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV-row filtering when reading #503

CSV-row filtering when reading #503

bkamins commented Sep 27, 2019

bkamins commented Sep 27, 2019

davidanthoff commented Sep 27, 2019

bkamins commented Sep 27, 2019

davidanthoff commented Sep 28, 2019

bkamins commented Sep 28, 2019

anandijain commented Oct 2, 2019

quinnj commented Oct 5, 2019

bkamins commented Oct 6, 2019

CSV-row filtering when reading #503

CSV-row filtering when reading #503

Comments

bkamins commented Sep 27, 2019

bkamins commented Sep 27, 2019

davidanthoff commented Sep 27, 2019

bkamins commented Sep 27, 2019

davidanthoff commented Sep 28, 2019

bkamins commented Sep 28, 2019

anandijain commented Oct 2, 2019

quinnj commented Oct 5, 2019

bkamins commented Oct 6, 2019