Support POSIX compliant filenames in arc #145

brianmajor · 2023-03-22T18:58:36Z

Support POSIX compliant filenames in the arc VOSpace service. Tools such as vcp and the web portal fail if they encounter files will non supported characters, such as spaces.

The text was updated successfully, but these errors were encountered:

sefffal · 2023-03-22T19:02:41Z

Thanks @brianmajor.
Could you point me to the relevant code for the web storage interface and for the backend?

pdowler · 2023-04-04T22:20:59Z

Is there a proposed solution to this issue? We should discuss it here before making any changes.

As I recall, the issue is that in the VOSpace REST API filenames are part of a vos URI and the W3C URI specification (https://www.w3.org/Addressing/URL/uri-spec.html) means some characters cannot be used as-is in valid URIs. This is a very fundamental part of the VOSpace spec.

sefffal · 2023-04-04T22:26:33Z

Hi Patrick, could you clarify if the VOSpace spec itself (not the uri-spec) prevents the use of some ascii characters in filenames?
While I'm not yet familiar with the VOSpace spec, it's industry standard to "URL encode" arguments sent over POST, GET, etc to REST endpoints to prevent those sorts of issues.

pdowler · 2023-04-04T23:34:56Z

It is the URI spec that has reserved characters. In the above link there is language about encoding reserved characters to make valid URIs, so in principle that sounds like the right thing to do... the hard part will be in deciding where/when in the s/w we decode and encode. I haven't read it very carefully, so TBD.

I should mention that there is a feature branch (vos2) in this repo with a more or less complete re-write of code that does this... so depending on how quickly we want to get this issue resolved, we might do it there only or we might have to make the changes in master and port them to vos2. TBD

sefffal · 2023-04-06T14:46:52Z

Here is some documentation on encodeURI, a client side function for accomplishing this: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent

I'm not sure what language and framework are used on the backend but in my experience all web-frameworks have an equivalent decode function. Usually the encoded/decode is done at the REST boundary --- just before submitting the request and just after receiving it.

One of the nice things about the URL encoding scheme is that it is normally safe to decode a string that was never encoded. Previously safe URI characters are simply decoded to themselves. As a result, clients can be migrated gradually to encoding things safely.

pdowler · 2023-04-06T22:29:12Z

That last thing is not quite true. Somebody at some point decided that encoding space with + was good idea, so + needs to be encoded as %2B and decoded to +, but if you decode twice it gets turned into a space. Ouch. And unfortunately for us, astronomy has many object- and file-naming conventions that use a + sign, so we have to be extra careful to encode and decode exactly once.

My reading of the URI spec above talked about hex encoding (eg %2B) so maybe it doesn't have the extra + handling, but that means lots of decode functions out there will be "http query string" decoders but not safe URI decoders.

Less common, but % gets encoded as %25 so it keeps changing as you encode more and more and decode more and more until it is gone or the two chars right after it are not valid hex digits.

you can play with this easily at: https://meyerweb.com/eric/tools/dencoder/

sefffal · 2023-04-08T18:44:16Z

@pdowler the encode / decode should only happen exactly once at the REST API boundary?
Admittedly if people are already relying on + and % being preserved in query parameters then this would require a change to both client and server which could either be gated behind something like an “encoded=1” parameter or a simple version check.

Another alternative would be base64 encoding which is entirely free of special characters like +.

I’m happy to make do any workaround you suggest and submit a PR, please just point me in the right location.

pdowler · 2023-04-11T16:19:14Z

For URL params (eg the limit and uri params one can use with GET /nodes/{path}) there is already a decode by the REST API.

Other such uses that come to mind that are part of the standard:
/synctrans accepts a target URI

Custom endpoints we added to vault:
/pkg accepts vos URI(s)
/nopeprops -- I would have to look but iirc it accepts xml doc

So those are easy because the REST API already decodes params exactly once and if the client(s) don't encode exactly once that's a problem (bug) in the client.

The tricky bit for URIs with spaces is that the URIs are also written in XML documents and are parsed into a URI class in memory (java and python) and those classes enforce the URI spec (eg reject illegal chars). So the hard part of allowing space (eg) is retaining encoding whenever we go from string -> URI in code... seems complicated.

I think we should discuss with Adrian (lead dev on the cadc python and pyvo vospace tools) and we'll need to figure out (at least for the server side java code) how to do this kind of change along with the vos2 feature branch, which I expect will be finished and merged into master about June-ish. I'd also like to bring this up in the correct IVOA forum to see what other implementers have done and whether it's OK for some vospace services to allow/support such "illegal characters"... that I can do asap and then discuss with people at an up-coming meeting in May.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support POSIX compliant filenames in arc #145

Support POSIX compliant filenames in arc #145

brianmajor commented Mar 22, 2023

sefffal commented Mar 22, 2023

pdowler commented Apr 4, 2023

sefffal commented Apr 4, 2023

pdowler commented Apr 4, 2023

sefffal commented Apr 6, 2023

pdowler commented Apr 6, 2023

sefffal commented Apr 8, 2023

pdowler commented Apr 11, 2023

Support POSIX compliant filenames in arc #145

Support POSIX compliant filenames in arc #145

Comments

brianmajor commented Mar 22, 2023

sefffal commented Mar 22, 2023

pdowler commented Apr 4, 2023

sefffal commented Apr 4, 2023

pdowler commented Apr 4, 2023

sefffal commented Apr 6, 2023

pdowler commented Apr 6, 2023

sefffal commented Apr 8, 2023

pdowler commented Apr 11, 2023