-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a property to skip the first N rows #4
Comments
This functionality cannot be implemented at SerDe level as it controls only the row format. There is no information of current position in file for deserialize() method to do that. To implement a linesToSkip functionality, we should implement a custom input format (perhaps extending TextInputFormat: http://hadoop.apache.org/docs/r1.2.1/api/index.html). |
what about defining the csv-headers in the table's serdeproperties? The serde can then skip the line which matches the header-spec:
As of now I need a step that removes the header for each csv-file which is quite cumbersome: One table and a serde is, without this feature, not enough to parse a csv-file. EDIT: I forgot that a serde always need to output one row (unless it throws an exception but that is rather ugly). |
The problem is that it will make a string comparison for every row in the file, so a performance killer. The best would be to extend RecordReader and skip desired lines on initalize() method after calling parent's method. If split.getStart() == 0 then we read N dummy lines. Then we extend TextInputFormat to use this new class. Then we could use it like this :
A good starting point: Not sure this serde package would be the best place to put this code. |
This is solved since Hive 0.13.0 create external table testtable
( name string
, message string
)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties ( 'separatorChar' = ','
, 'quoteChar' = '"'
, 'escapeChar' = '\\' )
stored as textfile
location '/path/to/testtable'
tblproperties ( 'skip.header.line.count' = '1'
, 'skip.footer.line.count' = '2' ); |
Many CSV files have a "title row" e.g. :
It would be nice to be able to set a "linesToSkip" property to skip the first N rows:
(I am so not set up to do any Java coding -- If I were, I'd do it myself and submit a patch.)
The text was updated successfully, but these errors were encountered: