Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of an explicit header #316

Closed
bkamins opened this issue Oct 3, 2018 · 5 comments
Closed

Handling of an explicit header #316

bkamins opened this issue Oct 3, 2018 · 5 comments

Comments

@bkamins
Copy link
Member

bkamins commented Oct 3, 2018

Those two behaviors seem inconsistent (in one case we are not strict, while we are in the other):

julia> str1 = """a,b,c
       1,2,3,4
       5,6,7,8"""
"a,b,c\n1,2,3,4\n5,6,7,8"

julia> str2 = """1,2,3,4
       5,6,7,8"""
"1,2,3,4\n5,6,7,8"

julia> CSV.read(IOBuffer(str1))
2×3 DataFrames.DataFrame
│ Row │ a      │ b      │ c      │
│     │ Int64⍰ │ Int64⍰ │ Int64⍰ │
├─────┼────────┼────────┼────────┤
│ 1   │ 1      │ 2      │ 3      │
│ 2   │ 5      │ 6      │ 7      │

julia> CSV.read(IOBuffer(str2), header=["a","b","c"])
ERROR: ArgumentError: The length of provided header (3) doesn't match the number of columns at row 1 (4)
@quinnj
Copy link
Member

quinnj commented Oct 3, 2018

Kind of...........in the first case, it detected 3 columns, but the file has inconsistent rows that include additional columns. In the second case, we detect 4 columns and you only provide 3. The latter is a clear error; the former feels like ok behavior too since we generally try to not error on invalid rows.

@bkamins
Copy link
Member Author

bkamins commented Oct 3, 2018

But passing explicit header is kind of "I pass an additional row of data that is missing in the file" so detection of number of columns should include it.

The reason for this issue is that this is a way to force dropping columns (i.e. keep only first few columns) - and this is something that recently was requested.

In general we could provide a separate kwarg that could be used to specify what columns to keep (not sure if it has been discussed earlier).

@quinnj
Copy link
Member

quinnj commented Oct 3, 2018

Yes, see the discussion starting here: #154 (comment)

@bkamins
Copy link
Member Author

bkamins commented Oct 3, 2018

Nice. I guess then that you can close this PR if you feel that your interpretation of detection of number of columns is more appropriate in CSV.jl context.

@quinnj
Copy link
Member

quinnj commented Oct 3, 2018

Well, for the use-case of "select"ing columns, I do feel that having something external to CSV.File is the way to go: it's composable and there's no performance difference. Trying to do a select=... keyword argument would just be re-implementing the external functionality internally, which would be a shame for all those other table types out there. I'll try to clean up my select implementation this week. Maybe we can just have it live in CSV.jl for now until someone wants to start a proper TableOperations.jl package or something (we already have transform in CSV.jl anyway).

@quinnj quinnj closed this as completed Oct 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants