Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support POSIX compliant filenames in arc #145

Open
brianmajor opened this issue Mar 22, 2023 · 8 comments
Open

Support POSIX compliant filenames in arc #145

brianmajor opened this issue Mar 22, 2023 · 8 comments

Comments

@brianmajor
Copy link
Member

Support POSIX compliant filenames in the arc VOSpace service. Tools such as vcp and the web portal fail if they encounter files will non supported characters, such as spaces.

@sefffal
Copy link

sefffal commented Mar 22, 2023

Thanks @brianmajor.
Could you point me to the relevant code for the web storage interface and for the backend?

@pdowler
Copy link
Member

pdowler commented Apr 4, 2023

Is there a proposed solution to this issue? We should discuss it here before making any changes.

As I recall, the issue is that in the VOSpace REST API filenames are part of a vos URI and the W3C URI specification (https://www.w3.org/Addressing/URL/uri-spec.html) means some characters cannot be used as-is in valid URIs. This is a very fundamental part of the VOSpace spec.

@sefffal
Copy link

sefffal commented Apr 4, 2023

Hi Patrick, could you clarify if the VOSpace spec itself (not the uri-spec) prevents the use of some ascii characters in filenames?
While I'm not yet familiar with the VOSpace spec, it's industry standard to "URL encode" arguments sent over POST, GET, etc to REST endpoints to prevent those sorts of issues.

@pdowler
Copy link
Member

pdowler commented Apr 4, 2023

It is the URI spec that has reserved characters. In the above link there is language about encoding reserved characters to make valid URIs, so in principle that sounds like the right thing to do... the hard part will be in deciding where/when in the s/w we decode and encode. I haven't read it very carefully, so TBD.

I should mention that there is a feature branch (vos2) in this repo with a more or less complete re-write of code that does this... so depending on how quickly we want to get this issue resolved, we might do it there only or we might have to make the changes in master and port them to vos2. TBD

@sefffal
Copy link

sefffal commented Apr 6, 2023

Here is some documentation on encodeURI, a client side function for accomplishing this: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent

I'm not sure what language and framework are used on the backend but in my experience all web-frameworks have an equivalent decode function. Usually the encoded/decode is done at the REST boundary --- just before submitting the request and just after receiving it.

One of the nice things about the URL encoding scheme is that it is normally safe to decode a string that was never encoded. Previously safe URI characters are simply decoded to themselves. As a result, clients can be migrated gradually to encoding things safely.

@pdowler
Copy link
Member

pdowler commented Apr 6, 2023

That last thing is not quite true. Somebody at some point decided that encoding space with + was good idea, so + needs to be encoded as %2B and decoded to +, but if you decode twice it gets turned into a space. Ouch. And unfortunately for us, astronomy has many object- and file-naming conventions that use a + sign, so we have to be extra careful to encode and decode exactly once.

My reading of the URI spec above talked about hex encoding (eg %2B) so maybe it doesn't have the extra + handling, but that means lots of decode functions out there will be "http query string" decoders but not safe URI decoders.

Less common, but % gets encoded as %25 so it keeps changing as you encode more and more and decode more and more until it is gone or the two chars right after it are not valid hex digits.

you can play with this easily at: https://meyerweb.com/eric/tools/dencoder/

@sefffal
Copy link

sefffal commented Apr 8, 2023

@pdowler the encode / decode should only happen exactly once at the REST API boundary?
Admittedly if people are already relying on + and % being preserved in query parameters then this would require a change to both client and server which could either be gated behind something like an “encoded=1” parameter or a simple version check.

Another alternative would be base64 encoding which is entirely free of special characters like +.

I’m happy to make do any workaround you suggest and submit a PR, please just point me in the right location.

@pdowler
Copy link
Member

pdowler commented Apr 11, 2023

For URL params (eg the limit and uri params one can use with GET /nodes/{path}) there is already a decode by the REST API.

Other such uses that come to mind that are part of the standard:
/synctrans accepts a target URI

Custom endpoints we added to vault:
/pkg accepts vos URI(s)
/nopeprops -- I would have to look but iirc it accepts xml doc

So those are easy because the REST API already decodes params exactly once and if the client(s) don't encode exactly once that's a problem (bug) in the client.

The tricky bit for URIs with spaces is that the URIs are also written in XML documents and are parsed into a URI class in memory (java and python) and those classes enforce the URI spec (eg reject illegal chars). So the hard part of allowing space (eg) is retaining encoding whenever we go from string -> URI in code... seems complicated.

I think we should discuss with Adrian (lead dev on the cadc python and pyvo vospace tools) and we'll need to figure out (at least for the server side java code) how to do this kind of change along with the vos2 feature branch, which I expect will be finished and merged into master about June-ish. I'd also like to bring this up in the correct IVOA forum to see what other implementers have done and whether it's OK for some vospace services to allow/support such "illegal characters"... that I can do asap and then discuss with people at an up-coming meeting in May.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants