Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

from_ndjson #58

Open
3 of 5 tasks
SymbolixAU opened this issue Nov 8, 2019 · 9 comments
Open
3 of 5 tasks

from_ndjson #58

SymbolixAU opened this issue Nov 8, 2019 · 9 comments

Comments

@SymbolixAU
Copy link
Collaborator

SymbolixAU commented Nov 8, 2019

  • read StringVector
  • stream a file into a &Document
  • tests
  • test using a rapidjson::Domrather than streaming the string with << '[' << ndjson << ']', then std::replace( json.begin(), json.end(), '\n', ',')
  • update reverse imports (mapdeck, geojsonsf, spatialwidget) to use R_xlen_t values before release to CRAN.
dcooley added a commit that referenced this issue Jan 31, 2020
@dcooley dcooley mentioned this issue Jan 31, 2020
4 tasks
@dcooley
Copy link
Collaborator

dcooley commented Jan 31, 2020

on branch issue58 (which inculdes to_ndjson)

dcooley added a commit that referenced this issue Feb 2, 2020
@sheffe
Copy link

sheffe commented Feb 4, 2020

@dcooley I just installed this per your comment in #29 but something is erroring out quickly.

Pretty minimal reprex:

# remotes::install_github("SymbolixAU/jsonify", "issue58")
jsonify::from_ndjson('{"abc":123}')

fails with error:
Error in rcpp_from_ndjson(json, simplify, fill_na) : object 'json' not found

I think this one's pretty straightforward. There are some switches in argument names between ndjson and json at various points. This example seems to be erroring here inside ndjson_to_r.character, at the return line: return(rcpp_from_ndjson( json, simplify, fill_na )). This affects raw strings and filepaths, and connections wrapped in file("pathtofoo.ndjson").

Edit: just forked to make this tweak and ran into more issues. Specifically, calling from_ndjson with json switched to ndjson in that return statement:

  • works on the simple '{"abc":123}' case
  • works for readLines(foo, 1) where 'foo' is every valid ndjson file I have on my hard drive
  • fails for all multi-row cases eg (readLines(foo, 2)) with the error below

To reproduce from my fork,

jsonify::from_ndjson(ndjson = '{"abc":123}')
jsonify::from_ndjson(paste0(c('{"abc":123}', '{"def":456}', collapse = "\n")))

with results:

> from_ndjson(paste0(c('{"abc":123}', '{"def":456}', collapse = "\n")))
Error in rcpp_from_ndjson(ndjson, simplify, fill_na) : 
  Expecting a single string value: [type=character; extent=3].
In addition: Warning messages:
1: In if (is_url(ndjson)) { :
  the condition has length > 1 and only the first element will be used
2: In if (file.exists(ndjson)) { :
 
 Error in rcpp_from_ndjson(ndjson, simplify, fill_na) : 
  Expecting a single string value: [type=character; extent=3]. 

dcooley added a commit that referenced this issue Feb 4, 2020
dcooley added a commit that referenced this issue Feb 4, 2020
@dcooley
Copy link
Collaborator

dcooley commented Feb 4, 2020

I've also messed up my branches so I'm now going through and fixing a load of branch conflicts.

However, your paste0 statement is creating a vector of strings

paste0(c('{"abc":123}', '{"def":456}', collapse = "\n"))
[1] "{\"abc\":123}" "{\"def\":456}" "\n" 

Is this what you intended?

It looks like you may have the closing bracket in the wrong place?

paste0(c('{"abc":123}', '{"def":456}'), collapse = "\n")
[1] "{\"abc\":123}\n{\"def\":456}"

dcooley added a commit that referenced this issue Feb 4, 2020
@sheffe
Copy link

sheffe commented Feb 4, 2020

Yikes, not my finest moment on Github. You're correct, closing paren in the wrong spot, so this statement now works with the fixed argname json to ndjson described above:
jsonify::from_ndjson(paste0(c('{"abc":123}', '{"def":456}'), collapse = "\n"))

On larger test files that I've written with to_ndjson in the past few days, I'm still seeing errors in from_ndjson, but I can't reproduce these easily on testcases at the moment. The crux of the issue is that this works fine:

lapply(readLines("somefile.ndjson"),
       function(x) from_ndjson(ndjson = x))

but this fails:

from_ndjson("somefile.ndjson")

with Error in rcpp_read_ndjson_file(normalizePath(ndjson), get_download_mode(), : json parse error

I can't see why each individual line succeeds but parsing the entire file fails. When I try to reproduce the error using a simple case,

test_list <- from_ndjson(paste0(c('{"abc":123}', '{"def":456}'), collapse = "\n"))
writeLines(to_ndjson(test_list), "testlist.ndjson")
from_ndjson("testlist.ndjson")

this also works fine.

So I need to figure out the structural difference between these files to reproduce. Will write back when I do. (This time without dumb errors!)

@dcooley
Copy link
Collaborator

dcooley commented Feb 4, 2020

Can you check how the ndjson is separated, is it by \n, or perhaps \r\n?

@sheffe
Copy link

sheffe commented Feb 5, 2020

edit: clarified this and not a problem with 58, working as intended

Ha, I can finally reproduce. This was nasty. Still don't have a good idea why it's happening.

When I wrote the files throwing json parse error calling from_ndjson, I was running SymbolixAU/jsonify@issue29. For that branch, writeLines(to_ndjson(df), "file.ndjson") with the default sep="\n" causes a trailing empty line in "file.ndjson" that causes the parse error, but every individual line was valid ndjson. I re-wrote the exact same data today after installing SymbolixAU/jsonify@issue58_fix -- still using writeLines(..., sep="\n") and the same code -- and found that the second presumed-identical file didn't cause a parse error.

The two files -- 1 written with to_ndjson yesterday on issue29, and 2 today on issue58_fix -- pass an all.equal() test once they're in R, but they have different MD5 hashes and readLines() on the two files comes up with different vector lengths (off by 1). The only difference is the empty trailing line for the file written from jsonify@issue29.

I discovered this on a production file, which is a deeply nested dataframe with list-cols. I don't see why the deeply-nested part would cause this, but until I can get a more minimal reprex together I just included an anonymized variant of the data.

(Github won't let me upload .rds directly, so I guess unzip this first.)
testcase.rds.zip

remotes::install_github("SymbolixAU/jsonify", "issue29", force = TRUE)
testcase <- readRDS("testcase.rds")
writeLines(jsonify::to_ndjson(testcase), "issue29.ndjson")

remotes::install_github("SymbolixAU/jsonify", "issue58_fix", force = TRUE)
testcase <- readRDS("testcase.rds")
writeLines(jsonify::to_ndjson(testcase), "issue58.ndjson")

openssl::md5(file("issue29.ndjson"))
openssl::md5(file("issue58.ndjson"))

length(readLines("issue29.ndjson"))
length(readLines("issue58.ndjson"))

library(jsonify)
foo <- from_ndjson("issue58.ndjson")
bar <- from_ndjson("issue29.ndjson")
bar <- from_ndjson(paste0(readLines("issue58.ndjson", nrow(foo)), collapse = "\n"))
all.equal(foo, bar)

Seriously thought I was hallucinating halfway through this.

@dcooley
Copy link
Collaborator

dcooley commented Feb 12, 2020

from_ndjson on this list objet returns the elements one-level deeper. This probably isn't right.

  lst <- list(
    x = 1:5
    , y = list(
      a = letters[1:5]
      , b = data.frame(i = 10:15, j = 20:25)
    )
  )
  
from_ndjson( "{\"x\":[1,2,3,4,5]}\n{\"y\":{\"a\":[\"a\",\"b\",\"c\",\"d\",\"e\"],\"b\":[{\"i\":10,\"j\":20},{\"i\":11,\"j\":21},{\"i\":12,\"j\":22},{\"i\":13,\"j\":23},{\"i\":14,\"j\":24},{\"i\":15,\"j\":25}]}}" )

dcooley added a commit that referenced this issue Feb 12, 2020
@SymbolixAU
Copy link
Collaborator Author

^^ Actually, this is probably correct, because we are splitting the list up during the to_ndjson() phase, and each 'row' of JSON has to become a valid object. Therefore, to re-combine it we have to turn the whole lot into an array.

In the future, if anyone specifically asks or this is an issue, we may have to come up with rules for handling

{"x":[1,2,3,4,5]}
{"y":{"a":["a","b","c","d","e"],"b":[{"i":10,"j":20},{"i":12,"j":22}]}}

where we would need to remove the trailing }\n from the previous row, and the leading "{" from the next row

@SymbolixAU
Copy link
Collaborator Author

Going to CRAN because I need to update the package to set stringsAsFactors = FALSE by default ( in my tests ) to comply with R 4.0.0, but keeping open to test the one remaining check box.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants