-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify the columns to read from file #154
Comments
Coming soon; stay tuned. |
R's readr uses |
On master, using CSV.File, you can now get this by iterating a File object, like: f = CSV.File(file)
for row in f
println("a=$(row.a), b=$(row.b)")
end This will only parse the |
Would it be able to use that through a keyword argument in |
I agree, it sounds like a legitimate request to have a convenient way of getting a DataFrame (or another structure) with only the specified columns. |
Nah, I don't think putting this in the function is the right abstraction. For example, I have a lazy Select transformation object that operates on any Tables.jl table and implements itself the interface. So, a user would be able to do: df = CSV.File("cool_file.csv") |> select(:a, :b) |> DataFrame to select just the I'll leave this issue open until we decide on what to do w/ things like |
Makes sense. Though if we keep the |
I like the lazy pipeline, but agree on the convenience if |
Definitely the pipeline is appealing (using Tables and the related querying framework) but I agree that it's good to have a keyword argument as a shorthand. I wonder if it can be somehow combined with |
The problem here is that trying to support |
When you say "internally", do you mean inside |
Yeah, but then we get this weird inversion of dependencies: we'd have to have CSV.jl depend on some "TableOperations.jl" package that provided |
I tend to agree with Jacob here. My philosophy is that composable tools are beneficial for both library writers and users, and this is a great example of that where .csv files are loaded lazily and the columns can be picked later. |
Yes, having CSV.jl depend on another data package isn't ideal. Let's see how the ecosystem evolves first then. |
I just realized that the example I posted in the original issue description still is not working. If I try to do something like f = CSV.File(file)
for row in f
println("a=$(row.a), b=$(row.b)")
end with a file
I get f = CSV.File("test1.csv")
ERROR: BoundsError: attempt to access ""
at index [1]
Stacktrace:
[1] checkbounds at .\strings\basic.jl:193 [inlined]
[2] codeunit at .\strings\string.jl:87 [inlined]
[3] getindex at .\strings\string.jl:206 [inlined]
[4] normalizename(::String) at C:\Users\Daniel\.julia\packages\CSV\dpCnm\src\filedetection.jl:11
[5] (::getfield(CSV, Symbol("##21#23")){Bool})(::Tuple{Int64,String}) at .\none:0
[6] iterate at .\generator.jl:47 [inlined]
[7] collect_to!(::Array{Symbol,1}, ::Base.Generator{Base.Iterators.Enumerate{Array{Union{Missing, String},1}},getfield(CSV, Symbol("##21#23")){Bool}}, ::Int64, ::Tuple{Int64,Int64}) at .\array.jl:656
[8] collect_to_with_first! at .\array.jl:643 [inlined]
[9] collect(::Base.Generator{Base.Iterators.Enumerate{Array{Union{Missing, String},1}},getfield(CSV, Symbol("##21#23")){Bool}}) at .\array.jl:624
[10] _totuple at .\tuple.jl:261 [inlined]
[11] Type at .\tuple.jl:243 [inlined]
[12] datalayout(::Int64, ::Parsers.Delimited{false,Parsers.Quoted{Parsers.Strip{Parsers.Sentinel{typeof(Parsers.defaultparser),Parsers.Trie{0x00,false,missing,2,Tuple{}}}}},Parsers.Trie{0x00,false,missing,8,Tuple{Parsers.Trie{0x2c,true,missing,8,Tuple{}},Parsers.Trie{0x0a,true,missing,8,Tuple{}},Parsers.Trie{0x0d,true,missing,8,Tuple{Parsers.Trie{0x0a,true,missing,8,Tuple{}}}}}}}, ::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int64, ::Bool) at C:\Users\Daniel\.julia\packages\CSV\dpCnm\src\filedetection.jl:123
[13] #File#1(::Int64, ::Bool, ::Int64, ::Nothing, ::Int64, ::Nothing, ::Bool, ::Nothing, ::Bool, ::Array{String,1}, ::String, ::String, ::Bool, ::Char, ::Nothing, ::Nothing, ::Char, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Dict{Type,Type}, ::Symbol, ::Bool, ::Bool, ::Bool, ::Type, ::String) at C:\Users\Daniel\.julia\packages\CSV\dpCnm\src\CSV.jl:288
[14] CSV.File(::String) at C:\Users\Daniel\.julia\packages\CSV\dpCnm\src\CSV.jl:263
[15] top-level scope at none:0 So, selecting valid columns is only possible if all columns are valid? |
…a manually iterating CSV.File or using Tables.select, but there was another issue where this file had an invalid column name that dies while trying to normalize the name.
With CSV.jl current release, I can do the following: julia> csv = """a,b,,
0, 1, , comment
12, 5, ,
"""
"a,b,,\n0, 1, , comment\n12, 5, ,\n"
julia> df = CSV.File(IOBuffer(csv)) |> Tables.select(:a, :b) |> DataFrame
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64⍰ │ Int64⍰ │
├─────┼────────┼────────┤
│ 1 │ 0 │ 1 │
│ 2 │ 12 │ 5 │ Note that in my example, the header row looks like |
Great, thanks a lot @quinnj ! |
Then, can we use a "select=" keyword? |
For anyone else here from the future, df = CSV.File(filename) |> TableOperations.select(:a, :b) |> DataFrame (Sorry for the noise, but I came here from google and I suspect others will too.) |
I often have to deal with CSV files that where edited by someone in Excel resulting in something like
test.csv
:When I try to read such a file (
CSV.read("test.csv")
) I getIt would be nice to be able to specify the columns to read with a keyword like Pandas'
usecols
to be able to easily avoid such issues.The text was updated successfully, but these errors were encountered: