Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_ndjson #29

Open
4 tasks done
SymbolixAU opened this issue Jan 3, 2019 · 18 comments
Open
4 tasks done

to_ndjson #29

SymbolixAU opened this issue Jan 3, 2019 · 18 comments
Labels
enhancement New feature or request

Comments

@SymbolixAU
Copy link
Collaborator

SymbolixAU commented Jan 3, 2019

where each row/column becomes its own object in an array.


  • data.frame by-row & by-column
  • list elements
  • matrix by-row & by-column
  • remove trailing '\n'
@SymbolixAU SymbolixAU self-assigned this Jan 3, 2019
@SymbolixAU SymbolixAU added enhancement New feature or request blue sky thoughts, ideas, development suggestions labels Feb 24, 2019
SymbolixAU pushed a commit that referenced this issue Nov 8, 2019
@sheffe
Copy link

sheffe commented Nov 22, 2019

I found this issue stub while researching jsonify to create ndjson for Google BigQuery inserts. If it's helpful to have another use-case from the wild in mind, here's the BigQuery edition. Most of this is from the documentation here.

(I have some example data but not a full reprex below. If followup seems useful to you I'll happily generate more detail.)

BigQuery efficiently ingests nested/repeated data when it's structured as a single JSON per newline-delimited row, without wrapping the row-object in an array. Each row in an R dataframe (with arbitrary list columns) becomes a single line wrapped in {...} instead of [{...}]. The example data from BigQuery docs looks like this:

{"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]}
{"id":"2","first_name":"Jane","last_name":"Doe","dob":"1980-10-16","addresses":[{"status":"current","address":"789 Any Avenue","city":"New York","state":"NY","zip":"33333","numberOfYears":"2"},{"status":"previous","address":"321 Main Street","city":"Hoboken","state":"NJ","zip":"44444","numberOfYears":"3"}]}

Right now I'm making this work in an ugly way with data.table, jsonify, and substr, like this:

df_split <- split(setDT(df_raw, key = "unique_row_index"), by = "unique_row_index")
json_vec <- lapply(df_split, FUN = (function(x){ y <- jsonify::to_json(x, unbox = TRUE, by = "row")); substr(y, 2L, nchar(y) - 1L}))
fwrite(json_vec, quote = FALSE, sep = "\n")

IMHO the best scenario for the BigQuery problem would be an option in to_ndjson to input a dataframe and spit back a list or unescaped character vector of JSON elements.

FWIW jsonify has already saved me a ton of time on this application, so thank you! All yet-another-json-thingy self deprecation aside, it's a great package.

@dcooley
Copy link
Collaborator

dcooley commented Nov 22, 2019

Thanks for the use-case. It's useful to see how this would be used. I haven't used it myself, but are you aware of the bigrquery library?

@sheffe
Copy link

sheffe commented Nov 22, 2019

Oh yeah, bigrquery is great for using BigQuery from an R client once there's a DB up and running. The issues I'm confronting now are mostly around ingestion pipelines. It seems like many of BigQuery's optimizations rely on tables with nested and repeated field schemas that amount to pre-joining data (which was pretty unfamiliar to me, coming from Postgres land). BQ can ingest not-just-row structured data from ndjson, Parquet, and Avro files, but most of my data pipelines are in R and ndjson is the friendliest solution.

Beyond the BigQuery case I'm seeing a lot more ndjson applications in the wild on larger datasets. A few data providers I work with are pushing it heavily for large spatial cases. E.g. R hits the 2^31 character vector length limits pretty quickly on big geojson files, and I suspect even geojsonsf and other heavily-optimized packages couldn't really work around that. Newline-delimited also means you can use fast CLIs like split to batch-size files to taste without any headaches. I'm becoming a big fan of the nd-format vs regular (geo)json.

@dcooley
Copy link
Collaborator

dcooley commented Nov 23, 2019

Funny you should mention it, but I recently updated geojsonsf to handle larger character vectors in this issue

But I agree on the usefulness of ndjson and hope to implement this soon.

dcooley added a commit that referenced this issue Jan 28, 2020
@dcooley
Copy link
Collaborator

dcooley commented Jan 28, 2020

First proof-of-concept

df <- data.frame( x = 1:5, y = letters[1:5] )
cat( jsonify:::rcpp_to_ndjson( df ) )
{"x":1,"y":"a"}
{"x":2,"y":"b"}
{"x":3,"y":"c"}
{"x":4,"y":"d"}
{"x":5,"y":"e"}
N <- 1e6
data <- data.frame(
  longitude = runif(N, -180, 180),
  latitude = runif(N, -180, 180),
  tooltip = sample(LETTERS, N, replace = TRUE)
  , stringsAsFactors = F
)

microbenchmark::microbenchmark(
  json = {
    res <- jsonify:::rcpp_to_json( data )
  },
  ndjson = {
    res <- jsonify:::rcpp_to_ndjson( data )
  },
  times = 5
)

# Unit: seconds
# expr      min       lq     mean   median       uq      max neval
# json 1.314854 1.318062 1.351673 1.336339 1.372355 1.416754     5
# ndjson 1.480732 1.481594 1.502879 1.483283 1.485288 1.583497     5

res <- jsonify:::rcpp_to_ndjson( data )
str( res )
# chr "{\"longitude\":18.95450374111533,\"latitude\":-76.69491718523205,\"tooltip\":\"X\"}\n{\"longitude\":138.2996058"| __truncated__

cat( substr( res, 1, 500 ))
# {"longitude":18.95450374111533,"latitude":-76.69491718523205,"tooltip":"X"}
# {"longitude":138.2996058370918,"latitude":96.73179904930294,"tooltip":"K"}
# {"longitude":150.90493150055409,"latitude":41.85694866813719,"tooltip":"B"}
# {"longitude":52.576808631420139,"latitude":120.26858397759497,"tooltip":"M"}
# {"longitude":72.25942776538432,"latitude":-58.31666300073266,"tooltip":"W"}
# {"longitude":-35.5781511683017,"latitude":-87.4365452863276,"tooltip":"N"}
# {"longitude":-138.20940038189293,"latitude":3

@dcooley
Copy link
Collaborator

dcooley commented Jan 28, 2020

data.frame by column

df <- data.frame(
  x = 1:5
  , y = letters[1:5]
)

cat( jsonify:::rcpp_to_ndjson(df, FALSE, -1L, TRUE, TRUE, "column") )
# {"x":[1,2,3,4,5]}
# {"y":["a","b","c","d","e"]}

dcooley added a commit that referenced this issue Jan 28, 2020
@sheffe
Copy link

sheffe commented Jan 29, 2020

I'm testing this out now, with remotes::install_github("SymbolixAU/jsonify", ref = "issue29"). So far everything's working great! Even on some large deeply nested dataframes, dfs with geojson embedded as strings, mixed types, etc, speed and memory consumption performance are both really solid.

I'm using some live data from our BigQuery pipe to test right now -- I'll try to sanitize it or generate some comparable noise and post the results this weekend.

@dcooley dcooley removed the blue sky thoughts, ideas, development suggestions label Jan 30, 2020
@dcooley
Copy link
Collaborator

dcooley commented Jan 30, 2020

@sheffe what do you think should be the result of to_ndjson() for nested lists:

lst <- list(
  x = 1:5
  , y = list(
    a = letters[1:5]
    , b = data.frame(i = 10:15, j = 20:25)
  )
)

lst

# $x
# [1] 1 2 3 4 5
# 
# $y
# $y$a
# [1] "a" "b" "c" "d" "e"
# 
# $y$b
#    i  j
# 1 10 20
# 2 11 21
# 3 12 22
# 4 13 23
# 5 14 24
# 6 15 25

should this return 2 lines for just x and y, or should the nested y be split into another two lines for a and b?

dcooley added a commit that referenced this issue Jan 30, 2020
@sheffe
Copy link

sheffe commented Jan 30, 2020

Interesting question. The short answer is I think this should throw an error or at least a warning -- reasoning below.

My use-cases are pretty narrow but I think the principle is pretty general. It's hard (and I've never seen it supported) for BigQuery loaders or other tools I've encountered to parallel-load a single big json object, even when that object is an array of many semantically-identical elements (eg they all describe rows, cols, or some constant nested object structure). It's trivially possible to parallelize ndjson that way by chunking the file into N-line batches and pushing them through whatever pipe is next. Doesn't need to be a DB loading pipe, could just be sed/awk or a validation logic or whatever. Beyond parallelization, converting differently-sized jsons to ndjson and then file splitting/merging are helpful for creating predictable chunk sizes for memory bottlenecks. Again that boils down to 'some downstream pipe requires more flexible chunking of inputs.' In the past I've used things like geojsplit or jq for 're-batching' jsons into similar pieces, but it's a pain compared to using ndjson in the first place.

With that in mind... I can pretty much only see a reason to use ndjson for a list type where the resulting json lines are similar pieces, and the motivation probably always boils down to caring about flexible chunking downstream. With dataframes, dfs containing list cols, matrices, etc., the by-row or by-col translation to ndjson neatly enforces that similarity. If an input list could be converted into a dataframe, e.g it was a dataframe with as.list called on it, fine; if some form of tabularizing/as.data.frameing would fail, then I suspect the result wouldn't make sense as ndjson.

The only use case I can come up with is exploratory analysis -- "I have no idea what's in my nested list, so I want to break it apart into pieces for easier grepping of components". But in that case, I usually try list flattening first at various max levels of nesting, so you get a long 2-col dataframe with key/value pairs, and writing that by row.

library(tidyverse)
library(jsonify)

set.seed(1234)

long_df <- tidyr::expand_grid(
  ABC = LETTERS,
  xyz = letters,
  num = 1:1000
) %>%
  dplyr::mutate(ftr1 = runif(n()),
                ftr2 = rpois(n(), 100))

nested_frame_level1 <- long_df %>%
  tidyr::nest(nest_level1 = c(num, ftr1, ftr2))

nested_frame_level2 <- nested_frame_level1 %>%
  tidyr::nest(nest_level2 = c(xyz, nest_level1))

writeLines(jsonify::to_ndjson(nested_frame_level1),
           "nestlvl1.jsonl")

writeLines(jsonify::to_ndjson(nested_frame_level2),
           "nestlvl2.jsonl")

I am starting to generate a lot of structures for BigQuery that look like nested_frame_level2 -- it amounts to embedding pre-joined data in nested/repeated fields, because data storage rounds to free and it speeds up queries by dramatically reducing total row counts. But even though it's a pretty nasty nested list, this sort of structure always comes back to 'data gets stuffed into a DB by row with arbitrary sub-pieces...'

@sheffe
Copy link

sheffe commented Jan 30, 2020

On a slightly separate note -- I'm seeing equivalent speeds regardless of nesting complexity, and that's a wild improvement on where I've been so far in homebrewing my own!

microbenchmark::microbenchmark(
  to_ndjson(nested_frame_level1, unbox = TRUE),
  to_ndjson(nested_frame_level2, unbox = TRUE),
  times = 10
)

Screenshot 2020-01-30 12 43 32

@dcooley
Copy link
Collaborator

dcooley commented Jan 31, 2020

Thank for the input.
I've decided I'm going to allow lists, non-recursively. Each of the top-level elements will be converted to a line of JSON

My reasoning is for cases where you legitimately have a list object of repeated elements, such as an sfc object, which can't be easily coerced into a data.frame

df <- data.frame(
  ml_id = c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2)
  , l_id = c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2)
  , x = rnorm(15)
  , y = rnorm(15)
  , z = rnorm(15)
  , m = rnorm(15)
)


sfc <- sfheaders::sfc_polygon(
  obj = df
  , x = "x"
  , y = "y"
  , z = "z"
  , m = "m"
  , polygon_id = "ml_id"
  )

as.data.frame( sfc )
# Error in as.data.frame.default(sfc) : 
#  cannot coerce class ‘c("sfc_POLYGON", "sfc")’ to a data.frame

to_ndjson( sfc, digits = 2 )
# [[[-0.84,-0.91,-0.24,1.63],[0.15,1.72,-0.63,-1.44],[-1.35,-1.36,-0.54,-1.42],[-0.09,0.75,-1.45,0.23],[0.76,1.9,0.11,-0.18],[-0.22,-0.96,-0.63,-2.26],[0.68,-1.8,0.65,0.61],[-0.29,1.27,-0.7,0.39],[0.84,0.07,-1.72,-0.33],[-0.84,-0.91,-0.24,1.63]]]
# [[[-0.81,0.28,1.47,-1.04],[-1.93,-1.97,0.62,-1.3],[0.05,0.54,0.31,1.3],[0.75,-0.81,1.07,-1.15],[0.19,0.96,0.34,1.85],[0.29,0.45,1.11,-0.89],[-0.81,0.28,1.47,-1.04]]]

dcooley added a commit that referenced this issue Jan 31, 2020
dcooley added a commit that referenced this issue Jan 31, 2020
@sheffe
Copy link

sheffe commented Jan 31, 2020

I've decided I'm going to allow lists, non-recursively. Each of the top-level elements will be converted to a line of JSON

Yep, I like this decision a lot. Forcing non-recursive will absolutely stop me from dumb mistakes some day.

@dcooley
Copy link
Collaborator

dcooley commented Jan 31, 2020

I've now got #58 working over there if you want to give it a try?

@sheffe
Copy link

sheffe commented Feb 4, 2020

Just ran into this on to_ndjson in a production pipe:
Error in rcpp_to_ndjson(x, unbox, digits, numeric_dates, factors_as_string, : R character strings are limited to 2^31-1 bytes
This was an extremely large write and (remarkably) the first issue I've had in ~6Tb worth of json writes since I installed this branch, over lots of different data types/sizes/nesting structures/etc. So far it's all just working.

@sheffe sheffe mentioned this issue Feb 4, 2020
5 tasks
@dcooley
Copy link
Collaborator

dcooley commented Feb 4, 2020

As the error suggests, this is an issue with R strings, and so writing a large JSON to a string doesn't work. So we'll need to make some sort of stream output (like jsonlite) to write the json line-by-line to a connection.

@dcooley
Copy link
Collaborator

dcooley commented Feb 5, 2020

Made this comment on the state of these branches and where some of the confusion lies.

Basically, branch issue58 is ahead of issue29, where I cleaned the final trailing \n

@dcooley
Copy link
Collaborator

dcooley commented Feb 12, 2020

List elements need to keep their name

lst <- list(x = 1)
to_json( lst )
# {"x":[1.0]}
to_ndjson( lst )
# [1.0]
## this should be {"x":[1.0]}

@dcooley
Copy link
Collaborator

dcooley commented Feb 12, 2020

So this now gives

lst <- list(
  x = 1:5
  , y = list(
    a = letters[1:5]
    , b = data.frame(i = 10:15, j = 20:25)
  )
)

to_ndjson( lst )
{"x":[1,2,3,4,5]}
{"y":{"a":["a","b","c","d","e"],"b":[{"i":10,"j":20},{"i":11,"j":21},{"i":12,"j":22},{"i":13,"j":23},{"i":14,"j":24},{"i":15,"j":25}]}}

which is slightly different to before, but I think this is right as it maintains list element names.

dcooley added a commit that referenced this issue Feb 12, 2020
@SymbolixAU SymbolixAU removed their assignment Jan 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants