Add support for reading columns as Apache Arrow arrays #79

andygrove · 2018-04-06T13:21:28Z

This is going to be easy. I have some ugly prototype code working already.

    let path = Path::new(&args[1]);
    let file = File::open(&path).unwrap();
    let parquet_reader = SerializedFileReader::new(file).unwrap();

    let row_group_reader = parquet_reader.get_row_group(0).unwrap();

    for i in 0..row_group_reader.num_columns() {
        match row_group_reader.get_column_reader(i) {
            Ok(ColumnReader::Int32ColumnReader(ref mut r)) => {

                let batch_size = 1024;
                let sz = mem::size_of::<i32>();
                let p = memory::allocate_aligned((batch_size * sz) as i64).unwrap();
                let ptr_i32 = unsafe { mem::transmute::<*const u8, *mut i32>(p) };
                let mut buf = unsafe {
                    slice::from_raw_parts_mut(ptr_i32, batch_size) };

//                let mut builder : Builder<i32> = Builder::with_capacity(1024);
//                let buffer = builder.finish();
                match r.read_batch(1024, None, None, &mut buf) {
                    Ok((count,_)) => {
                        let arrow_buffer = Buffer::from_raw_parts(ptr_i32, count as i32);
                        let arrow_array = Array::from(arrow_buffer);

                        match arrow_array.data() {
                            ArrayData::Int32(b) => {
                                println!("len: {}", b.len());
                                println!("data: {:?}", b.iter().collect::<Vec<i32>>());
                            },
                            _ => println!("wrong type")
                        }


                    },
                    _ => println!("error")
                }
            }
            _ => println!("column type not supported")
        }
    }

Now we need to come up with a real design and I probably need to add more helper methods to Arrow to make this easier.

andygrove · 2018-04-06T13:36:47Z

Here is a cleaned up version:

                let mut builder : Builder<i32> = Builder::with_capacity(batch_size);
                match r.read_batch(1024, None, None, builder.slice_mut(0, batch_size)) {
                    Ok((count,_)) => {
                        builder.set_len(count);
                        let arrow_array = Array::from(builder.finish());
                        match arrow_array.data() {
                            ArrayData::Int32(b) => {
                                println!("len: {}", b.len());
                                println!("data: {:?}", b.iter().collect::<Vec<i32>>());
                            },
                            _ => println!("wrong type")
                        }

andygrove · 2018-04-07T17:05:40Z

I've made good progress with integrating parquet-rs with datafusion .. I have examples like this working

    let mut ctx = ExecutionContext::local();

    let df = ctx.load_parquet("test/data/uk_cities.parquet/part-00000-bdf0c245-d300-4b28-bfdd-0f1f9cb898c4-c000.snappy.parquet").unwrap();

    ctx.register("uk_cities", df);

    // define the SQL statement
    let sql = "SELECT lat, lng FROM uk_cities";

    // create a data frame
    let df = ctx.sql(&sql).unwrap();

    df.show(10);

It only works for int32 and f32 columns though so far

sunchao · 2018-04-07T19:24:42Z

nice progress @andygrove ! really glad you can read parquet now using SQL.

sadikovi · 2018-04-08T01:37:16Z

Looks great! I am curious how Arrow handles/maps optional or repeated fields (when definition and/or repetition levels exist).

andygrove · 2018-04-08T16:27:57Z

I have merged the current Parquet support to master in DataFusion. I have added examples for both DataFrame and SQL API.

https://github.com/datafusion-rs/datafusion-rs/tree/master/examples

This is very rough code and I won't be promoting the fact that Parquet support is there until this is a little more complete and tested.

liurenjie1024 · 2018-10-10T08:58:13Z

@andygrove @sunchao Are you guys working on this? I'm working on an implementation which takes the cpp version as a reference.

sunchao · 2018-10-10T09:27:41Z

@liurenjie1024 are you working on the arrow part or the parquet-rs part? some more work in the arrow repo needs to be done so that parquet-rs can read into arrow format . I'm working (slowly) on that part.

liurenjie1024 · 2018-10-10T11:16:45Z

I'm working on the arrow part. Yes only part of cpp version can be implemented. I'll work out an early version. Chao Sun <[email protected]> 于 2018年10月10日周三下午5:27写道：

…

@liurenjie1024 <https://github.com/liurenjie1024> are you working on the arrow <https://github.com/apache/arrow> part or the parquet-rs part? some more work in the arrow repo needs to be done so that parquet-rs can read into arrow format . I'm working (slowly) on that part. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#79 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACpL5SeEZrsr5j8lEbv6H_y6s3s0SG-vks5ujb2OgaJpZM4TKCyA> .

nevi-me · 2018-10-31T20:32:23Z

This is unrelated, but I've seen that CSV readers are being implemented in Arrow (I think Python, C++, and there's a Go one that's an open PR).

BurntSushi's rust-csv got me wondering whether it'd be possible to implement a native CSV to arrow reader, which I think would also extend nicely to parquet.

@sunchao @andygrove , do you know if there's been discussion of something like this? Also, would some codegen be required? It's something I'd love to try contribute to in the coming weeks.

sunchao · 2018-10-31T20:37:45Z

BurntSushi's rust-csv got me wondering whether it'd be possible to implement a native CSV to arrow reader, which I think would also extend nicely to parquet.

Yes it is certainly do-able - it needs to be implemented in the arrow repo though. Some pieces may still be missing and you're welcome to work on the arrow repo :)

Also I'm not sure how this extend to parquet. Can you elaborate?

nevi-me · 2018-10-31T20:46:55Z

Yes, I've been following the Rust impl in Arrow. When I'm ready, I'll ask about it in the mailing list before opening a JIRA (didn't see one).

The extension to parquet was more in concept than anything; in the sense that if I can read a csv to arrow, I'd be able to get csv -> parquet working. I understand that csv is a "simpler" format due to its flat nature.

Also, I haven't checked to see if datafusion-rs supports csv and parquet as sinks, so one'd be able to create table my_table<parquet> as select * from input<csv>.
A lot of my curiousity comes as a result of having to watch paint dry at work while I use Spark, so I've been thinking of what I'd need to be able to do to replace some of my workflow with Rust.

sunchao · 2018-10-31T20:55:08Z

Yes, I've been following the Rust impl in Arrow. When I'm ready, I'll ask about it in the mailing list before opening a JIRA (didn't see one).

Yes feel free to create a JIRA :)

The extension to parquet was more in concept than anything; in the sense that if I can read a csv to arrow, I'd be able to get csv -> parquet working. I understand that csv is a "simpler" format due to its flat nature.

I see. Yes I think we discussed (cc @sadikovi ) multiple times about creating a CSV -> parquet converter. This will be convenient. We should certainly do it.

liurenjie1024 · 2018-11-06T01:59:54Z

I'm working on an arrow reader implementation and have finished the first step, converting parquet schema to arrow schema in this PR, please help to review this.

sunchao · 2018-11-07T18:53:54Z

The umbrella task is #186. Please watch that issue for progress on this matter.

sunchao added the new feature label May 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for reading columns as Apache Arrow arrays #79

Add support for reading columns as Apache Arrow arrays #79

andygrove commented Apr 6, 2018

andygrove commented Apr 6, 2018

andygrove commented Apr 7, 2018 •

edited

Loading

sunchao commented Apr 7, 2018

sadikovi commented Apr 8, 2018

andygrove commented Apr 8, 2018

liurenjie1024 commented Oct 10, 2018

sunchao commented Oct 10, 2018

liurenjie1024 commented Oct 10, 2018 via email

nevi-me commented Oct 31, 2018

sunchao commented Oct 31, 2018

nevi-me commented Oct 31, 2018

sunchao commented Oct 31, 2018

liurenjie1024 commented Nov 6, 2018

sunchao commented Nov 7, 2018

Add support for reading columns as Apache Arrow arrays #79

Add support for reading columns as Apache Arrow arrays #79

Comments

andygrove commented Apr 6, 2018

andygrove commented Apr 6, 2018

andygrove commented Apr 7, 2018 • edited Loading

sunchao commented Apr 7, 2018

sadikovi commented Apr 8, 2018

andygrove commented Apr 8, 2018

liurenjie1024 commented Oct 10, 2018

sunchao commented Oct 10, 2018

liurenjie1024 commented Oct 10, 2018 via email

nevi-me commented Oct 31, 2018

sunchao commented Oct 31, 2018

nevi-me commented Oct 31, 2018

sunchao commented Oct 31, 2018

liurenjie1024 commented Nov 6, 2018

sunchao commented Nov 7, 2018

andygrove commented Apr 7, 2018 •

edited

Loading