-
Notifications
You must be signed in to change notification settings - Fork 20
Add support for reading columns as Apache Arrow arrays #79
Comments
Here is a cleaned up version: let mut builder : Builder<i32> = Builder::with_capacity(batch_size);
match r.read_batch(1024, None, None, builder.slice_mut(0, batch_size)) {
Ok((count,_)) => {
builder.set_len(count);
let arrow_array = Array::from(builder.finish());
match arrow_array.data() {
ArrayData::Int32(b) => {
println!("len: {}", b.len());
println!("data: {:?}", b.iter().collect::<Vec<i32>>());
},
_ => println!("wrong type")
} |
I've made good progress with integrating parquet-rs with datafusion .. I have examples like this working let mut ctx = ExecutionContext::local();
let df = ctx.load_parquet("test/data/uk_cities.parquet/part-00000-bdf0c245-d300-4b28-bfdd-0f1f9cb898c4-c000.snappy.parquet").unwrap();
ctx.register("uk_cities", df);
// define the SQL statement
let sql = "SELECT lat, lng FROM uk_cities";
// create a data frame
let df = ctx.sql(&sql).unwrap();
df.show(10); It only works for int32 and f32 columns though so far |
nice progress @andygrove ! really glad you can read parquet now using SQL. |
Looks great! I am curious how Arrow handles/maps optional or repeated fields (when definition and/or repetition levels exist). |
I have merged the current Parquet support to master in DataFusion. I have added examples for both DataFrame and SQL API. https://github.com/datafusion-rs/datafusion-rs/tree/master/examples This is very rough code and I won't be promoting the fact that Parquet support is there until this is a little more complete and tested. |
@andygrove @sunchao Are you guys working on this? I'm working on an implementation which takes the cpp version as a reference. |
@liurenjie1024 are you working on the arrow part or the parquet-rs part? some more work in the arrow repo needs to be done so that parquet-rs can read into arrow format . I'm working (slowly) on that part. |
I'm working on the arrow part. Yes only part of cpp version can be
implemented. I'll work out an early version.
Chao Sun <[email protected]> 于 2018年10月10日周三 下午5:27写道:
… @liurenjie1024 <https://github.com/liurenjie1024> are you working on the
arrow <https://github.com/apache/arrow> part or the parquet-rs part? some
more work in the arrow repo needs to be done so that parquet-rs can read
into arrow format . I'm working (slowly) on that part.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#79 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACpL5SeEZrsr5j8lEbv6H_y6s3s0SG-vks5ujb2OgaJpZM4TKCyA>
.
|
This is unrelated, but I've seen that CSV readers are being implemented in Arrow (I think Python, C++, and there's a Go one that's an open PR). BurntSushi's @sunchao @andygrove , do you know if there's been discussion of something like this? Also, would some codegen be required? It's something I'd love to try contribute to in the coming weeks. |
Yes it is certainly do-able - it needs to be implemented in the arrow repo though. Some pieces may still be missing and you're welcome to work on the arrow repo :) Also I'm not sure how this extend to parquet. Can you elaborate? |
Yes, I've been following the Rust impl in Arrow. When I'm ready, I'll ask about it in the mailing list before opening a JIRA (didn't see one). The extension to parquet was more in concept than anything; in the sense that if I can read a csv to arrow, I'd be able to get csv -> parquet working. I understand that csv is a "simpler" format due to its flat nature. Also, I haven't checked to see if |
Yes feel free to create a JIRA :)
I see. Yes I think we discussed (cc @sadikovi ) multiple times about creating a CSV -> parquet converter. This will be convenient. We should certainly do it. |
I'm working on an arrow reader implementation and have finished the first step, converting parquet schema to arrow schema in this PR, please help to review this. |
The umbrella task is #186. Please watch that issue for progress on this matter. |
This is going to be easy. I have some ugly prototype code working already.
Now we need to come up with a real design and I probably need to add more helper methods to Arrow to make this easier.
The text was updated successfully, but these errors were encountered: