-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_ndjson #29
Comments
I found this issue stub while researching (I have some example data but not a full reprex below. If followup seems useful to you I'll happily generate more detail.) BigQuery efficiently ingests nested/repeated data when it's structured as a single JSON per newline-delimited row, without wrapping the row-object in an array. Each row in an R dataframe (with arbitrary list columns) becomes a single line wrapped in
Right now I'm making this work in an ugly way with
IMHO the best scenario for the BigQuery problem would be an option in FWIW jsonify has already saved me a ton of time on this application, so thank you! All yet-another-json-thingy self deprecation aside, it's a great package. |
Thanks for the use-case. It's useful to see how this would be used. I haven't used it myself, but are you aware of the bigrquery library? |
Oh yeah, Beyond the BigQuery case I'm seeing a lot more ndjson applications in the wild on larger datasets. A few data providers I work with are pushing it heavily for large spatial cases. E.g. R hits the 2^31 character vector length limits pretty quickly on big geojson files, and I suspect even |
Funny you should mention it, but I recently updated But I agree on the usefulness of ndjson and hope to implement this soon. |
First proof-of-concept df <- data.frame( x = 1:5, y = letters[1:5] )
cat( jsonify:::rcpp_to_ndjson( df ) )
{"x":1,"y":"a"}
{"x":2,"y":"b"}
{"x":3,"y":"c"}
{"x":4,"y":"d"}
{"x":5,"y":"e"} N <- 1e6
data <- data.frame(
longitude = runif(N, -180, 180),
latitude = runif(N, -180, 180),
tooltip = sample(LETTERS, N, replace = TRUE)
, stringsAsFactors = F
)
microbenchmark::microbenchmark(
json = {
res <- jsonify:::rcpp_to_json( data )
},
ndjson = {
res <- jsonify:::rcpp_to_ndjson( data )
},
times = 5
)
# Unit: seconds
# expr min lq mean median uq max neval
# json 1.314854 1.318062 1.351673 1.336339 1.372355 1.416754 5
# ndjson 1.480732 1.481594 1.502879 1.483283 1.485288 1.583497 5
res <- jsonify:::rcpp_to_ndjson( data )
str( res )
# chr "{\"longitude\":18.95450374111533,\"latitude\":-76.69491718523205,\"tooltip\":\"X\"}\n{\"longitude\":138.2996058"| __truncated__
cat( substr( res, 1, 500 ))
# {"longitude":18.95450374111533,"latitude":-76.69491718523205,"tooltip":"X"}
# {"longitude":138.2996058370918,"latitude":96.73179904930294,"tooltip":"K"}
# {"longitude":150.90493150055409,"latitude":41.85694866813719,"tooltip":"B"}
# {"longitude":52.576808631420139,"latitude":120.26858397759497,"tooltip":"M"}
# {"longitude":72.25942776538432,"latitude":-58.31666300073266,"tooltip":"W"}
# {"longitude":-35.5781511683017,"latitude":-87.4365452863276,"tooltip":"N"}
# {"longitude":-138.20940038189293,"latitude":3
|
data.frame by column df <- data.frame(
x = 1:5
, y = letters[1:5]
)
cat( jsonify:::rcpp_to_ndjson(df, FALSE, -1L, TRUE, TRUE, "column") )
# {"x":[1,2,3,4,5]}
# {"y":["a","b","c","d","e"]}
|
I'm testing this out now, with I'm using some live data from our BigQuery pipe to test right now -- I'll try to sanitize it or generate some comparable noise and post the results this weekend. |
@sheffe what do you think should be the result of lst <- list(
x = 1:5
, y = list(
a = letters[1:5]
, b = data.frame(i = 10:15, j = 20:25)
)
)
lst
# $x
# [1] 1 2 3 4 5
#
# $y
# $y$a
# [1] "a" "b" "c" "d" "e"
#
# $y$b
# i j
# 1 10 20
# 2 11 21
# 3 12 22
# 4 13 23
# 5 14 24
# 6 15 25
should this return 2 lines for just |
Interesting question. The short answer is I think this should throw an error or at least a warning -- reasoning below. My use-cases are pretty narrow but I think the principle is pretty general. It's hard (and I've never seen it supported) for BigQuery loaders or other tools I've encountered to parallel-load a single big json object, even when that object is an array of many semantically-identical elements (eg they all describe rows, cols, or some constant nested object structure). It's trivially possible to parallelize ndjson that way by chunking the file into N-line batches and pushing them through whatever pipe is next. Doesn't need to be a DB loading pipe, could just be sed/awk or a validation logic or whatever. Beyond parallelization, converting differently-sized jsons to ndjson and then file splitting/merging are helpful for creating predictable chunk sizes for memory bottlenecks. Again that boils down to 'some downstream pipe requires more flexible chunking of inputs.' In the past I've used things like geojsplit or With that in mind... I can pretty much only see a reason to use ndjson for a list type where the resulting json lines are similar pieces, and the motivation probably always boils down to caring about flexible chunking downstream. With dataframes, dfs containing list cols, matrices, etc., the by-row or by-col translation to ndjson neatly enforces that similarity. If an input list could be converted into a dataframe, e.g it was a dataframe with The only use case I can come up with is exploratory analysis -- "I have no idea what's in my nested list, so I want to break it apart into pieces for easier grepping of components". But in that case, I usually try list flattening first at various max levels of nesting, so you get a long 2-col dataframe with key/value pairs, and writing that by row. library(tidyverse)
library(jsonify)
set.seed(1234)
long_df <- tidyr::expand_grid(
ABC = LETTERS,
xyz = letters,
num = 1:1000
) %>%
dplyr::mutate(ftr1 = runif(n()),
ftr2 = rpois(n(), 100))
nested_frame_level1 <- long_df %>%
tidyr::nest(nest_level1 = c(num, ftr1, ftr2))
nested_frame_level2 <- nested_frame_level1 %>%
tidyr::nest(nest_level2 = c(xyz, nest_level1))
writeLines(jsonify::to_ndjson(nested_frame_level1),
"nestlvl1.jsonl")
writeLines(jsonify::to_ndjson(nested_frame_level2),
"nestlvl2.jsonl") I am starting to generate a lot of structures for BigQuery that look like |
On a slightly separate note -- I'm seeing equivalent speeds regardless of nesting complexity, and that's a wild improvement on where I've been so far in homebrewing my own! microbenchmark::microbenchmark(
to_ndjson(nested_frame_level1, unbox = TRUE),
to_ndjson(nested_frame_level2, unbox = TRUE),
times = 10
) |
Thank for the input. My reasoning is for cases where you legitimately have a list object of repeated elements, such as an df <- data.frame(
ml_id = c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2)
, l_id = c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2)
, x = rnorm(15)
, y = rnorm(15)
, z = rnorm(15)
, m = rnorm(15)
)
sfc <- sfheaders::sfc_polygon(
obj = df
, x = "x"
, y = "y"
, z = "z"
, m = "m"
, polygon_id = "ml_id"
)
as.data.frame( sfc )
# Error in as.data.frame.default(sfc) :
# cannot coerce class ‘c("sfc_POLYGON", "sfc")’ to a data.frame
to_ndjson( sfc, digits = 2 )
# [[[-0.84,-0.91,-0.24,1.63],[0.15,1.72,-0.63,-1.44],[-1.35,-1.36,-0.54,-1.42],[-0.09,0.75,-1.45,0.23],[0.76,1.9,0.11,-0.18],[-0.22,-0.96,-0.63,-2.26],[0.68,-1.8,0.65,0.61],[-0.29,1.27,-0.7,0.39],[0.84,0.07,-1.72,-0.33],[-0.84,-0.91,-0.24,1.63]]]
# [[[-0.81,0.28,1.47,-1.04],[-1.93,-1.97,0.62,-1.3],[0.05,0.54,0.31,1.3],[0.75,-0.81,1.07,-1.15],[0.19,0.96,0.34,1.85],[0.29,0.45,1.11,-0.89],[-0.81,0.28,1.47,-1.04]]] |
Yep, I like this decision a lot. Forcing non-recursive will absolutely stop me from dumb mistakes some day. |
I've now got #58 working over there if you want to give it a try? |
Just ran into this on |
As the error suggests, this is an issue with R strings, and so writing a large JSON to a string doesn't work. So we'll need to make some sort of stream output (like jsonlite) to write the json line-by-line to a connection. |
Made this comment on the state of these branches and where some of the confusion lies. Basically, branch |
List elements need to keep their name lst <- list(x = 1)
to_json( lst )
# {"x":[1.0]}
to_ndjson( lst )
# [1.0]
## this should be {"x":[1.0]} |
So this now gives lst <- list(
x = 1:5
, y = list(
a = letters[1:5]
, b = data.frame(i = 10:15, j = 20:25)
)
)
to_ndjson( lst )
{"x":[1,2,3,4,5]}
{"y":{"a":["a","b","c","d","e"],"b":[{"i":10,"j":20},{"i":11,"j":21},{"i":12,"j":22},{"i":13,"j":23},{"i":14,"j":24},{"i":15,"j":25}]}} which is slightly different to before, but I think this is right as it maintains list element names. |
where each row/column becomes its own object in an array.
'\n'
The text was updated successfully, but these errors were encountered: