Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a property to skip the first N rows #4

Open
ccady opened this issue Mar 27, 2012 · 4 comments
Open

Add a property to skip the first N rows #4

ccady opened this issue Mar 27, 2012 · 4 comments

Comments

@ccady
Copy link

ccady commented Mar 27, 2012

Many CSV files have a "title row" e.g. :

id,lastname,firstname
"12323","Washington","George"
"12343","Lincoln","Abraham"

It would be nice to be able to set a "linesToSkip" property to skip the first N rows:

row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
    "separatorChar" = ",",
    "quoteChar" = "\"",
    "escapeChar" = "\\",
    "linesToSkip" = 1
)

(I am so not set up to do any Java coding -- If I were, I'd do it myself and submit a patch.)

@luisfmrosa
Copy link

This functionality cannot be implemented at SerDe level as it controls only the row format. There is no information of current position in file for deserialize() method to do that.

To implement a linesToSkip functionality, we should implement a custom input format (perhaps extending TextInputFormat: http://hadoop.apache.org/docs/r1.2.1/api/index.html).

@RickardCardell
Copy link

what about defining the csv-headers in the table's serdeproperties? The serde can then skip the line which matches the header-spec:

create table 
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
    "separatorChar" = ",",
    "quoteChar" = "\"",
    "escapeChar" = "\\",
    "header" = "id,lastname,firstname"
)

As of now I need a step that removes the header for each csv-file which is quite cumbersome: One table and a serde is, without this feature, not enough to parse a csv-file.

EDIT: I forgot that a serde always need to output one row (unless it throws an exception but that is rather ugly).

@luisfmrosa
Copy link

The problem is that it will make a string comparison for every row in the file, so a performance killer.

The best would be to extend RecordReader and skip desired lines on initalize() method after calling parent's method. If split.getStart() == 0 then we read N dummy lines. Then we extend TextInputFormat to use this new class.

Then we could use it like this :

mapred.skipNLines=3;

CREATE TABLE mytablewithheader(a string, b string, c string)
STORED AS INPUTFORMAT 'com.mypackage.extdtextinputformat.ExtTextInputFormat'"+
        "OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

A good starting point:
http://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/
https://github.com/edwardcapriolo/DualInputFormat/blob/master/pom.xml

Not sure this serde package would be the best place to put this code.

@sfr
Copy link

sfr commented May 24, 2016

This is solved since Hive 0.13.0

create external table testtable
( name    string
, message string
)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties ( 'separatorChar' = ','
                     , 'quoteChar'     = '"'
                     , 'escapeChar'    = '\\' )
stored as textfile
location '/path/to/testtable'
tblproperties ( 'skip.header.line.count' = '1'
              , 'skip.footer.line.count' = '2' ); 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants