Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommended practice for adding reserved characters? #44

Closed
mahmoud opened this issue Oct 26, 2017 · 13 comments
Closed

Recommended practice for adding reserved characters? #44

mahmoud opened this issue Oct 26, 2017 · 13 comments

Comments

@mahmoud
Copy link
Member

mahmoud commented Oct 26, 2017

Per @markrwilliams comment here and a few others dotted around, we're facing a design gap in hyperlink's APIs.

To paraphrase:

url = URL()
url.add(u'param', u'#value').to_text()

Yields:

ValueError: one or more reserved delimiters &# present in query parameter value: u'#value'

This is due to a subtle shift in hyperlink's design compared to twisted.python.url. t.p.url would allow any string value in, whereas hyperlink prefers to store the "minimally-encoded" version. This is why a ValueError is raised from the code above.

Technically, this can be solved by making the code url.add(u'param', _encode_query_part(u'#value')). But hyperlink's primary goal is to handle encoding/decoding, does it really make sense to push that back on the user?

One solution Mark and I discussed would be to switch to decoding every value passed in. But what if someone were to pass in u'%23%' and actually intend for that to be their decoded value? And the API would be further complicated by the fact that the underlying decoding is generally unknown. UTF8, Latin-1, and plain old binary are all valid in percent-encoded URL parts. Autodecoding UTF8 might have better usability most of the time, but much like relying on Python 2's implicit encoding/decoding, the safety of the explicit _encode_*_part() is probably preferable.

It might occur to one that this entire problem bears some resemblance to the bytes/unicode split, as URL has URL.to_uri() and URL.to_iri(). There is some truth to this, but both IRIs and URIs are both URLs. Having two types imposes a sort of artificial split I'd like to avoid if possible, but we also don't have a good way to represent an already decoded IRI. This was causing an issue with double decoding on multiple .to_iri() calls (see #16).

Right now my best idea is to enable that technical solution above by exposing the various encoding and decoding functions as public APIs, since those may prove useful utilities for other contexts anyways. I'm sure there are better ideas, too, so I'm going to leave this issue open as a place for discussion on handling this quandary.

@glyph
Copy link
Collaborator

glyph commented Nov 12, 2017

IRIs vs. URLs is a red herring here. reserved characters in a URL are still reserved in an IRI; the decoding that happens is all reversible and idempotent.

The issue here is one about whether you are manipulating the text of the URL itself (which is what URL has always been; twisted.python.url just had a bug where it would let you put in some invalid characters), or whether you're manipulating a value stored in a URL, which is not visible as part of the URL. In other words: does % mean %, or does it mean "escape sequence beginning, please hold"? One of URL's invariants is that everything can round trip: this would not be true if everything were decoded, since /% would turn into /%25, or an error.

A quick sketch a two-type proposal:

DecodedURL(path="%&") -> DecodedURL(URL(path='%25%26'))
DecodedURL(URL(path='%25%26')).path = %&

I think I actually like this, perhaps with the caveat that this could be a .decoded or .unquoted attribute on URL itself, so we don't need to import an extra wrapper name.

I strongly dislike exposing a bunch of public APIs for encoding and decoding, since it's easy to screw that up, and you end up passing around context-free strings all the time, ending in the inevitable concatenation of some unquoted HTML, some quoted URL, some unquoted SQL, and etc.

@mahmoud
Copy link
Member Author

mahmoud commented Nov 13, 2017

Yeah, agree. I am probably going to expose the encoding/decoding functions as a sorta hazmat utility suite, because it has come up for me to need just the encoding for a part of the URL. But, yes, at this point, I mostly envision a decoded descriptor-style attribute that supports getting and setting "human" (fully-decoded) values, for all the reasons summarized above. :)

@glyph
Copy link
Collaborator

glyph commented Nov 13, 2017

I am probably going to expose the encoding/decoding functions as a sorta hazmat utility suite, because it has come up for me to need just the encoding for a part of the URL.

Please don't do this :-(. The minimal API is one of URL's primary features. I'd rather work around this with urllib.parse.quote and urllib.parse.unquote than add a bunch of unsightly, error-prone warts on the side of URL's API that have to be supported forever.

@mahmoud
Copy link
Member Author

mahmoud commented Nov 13, 2017

The functions don't go on the URL type? It's separating this out into a module. And they're worth exposing exactly because urllib's unqualified blanket approach leads to error-prone code.

@wsanchez
Copy link
Contributor

DecodedURL, is the URL I want, so I like that idea.

@glyph
Copy link
Collaborator

glyph commented Nov 16, 2017

The functions don't go on the URL type? It's separating this out into a module. And they're worth exposing exactly because urllib's unqualified blanket approach leads to error-prone code.

Everything in that region strikes me as a private implementation detail which I would not want exposed / supported; it's an end run around the integrity of the URL object.

But it looks like we have broad consensus around DecodedURL so let's just do that, and have this fight on a different issue :).

@mahmoud
Copy link
Member Author

mahmoud commented Dec 18, 2017

So I'm cruising along over here and it's looking like the straightforward approach will work fine. Each field needs individual handling, but that's ok because it lets us support conveniences like #48.

Also, in a greenfield environment I agree with @wsanchez that I wish URL was DecodedURL and URL was named EncodedURL, but here we are :)

@mahmoud mahmoud mentioned this issue Dec 31, 2017
@mahmoud
Copy link
Member Author

mahmoud commented Dec 31, 2017

Merry Christmas @glyph and @wsanchez, your DecodedURL is all but ready. It is somewhat tested and ready for review, here: #54

Along the way I went on a longish Unicode journey. Somewhat terse notes here: https://gist.github.com/mahmoud/7bc696254a738404bc281c270b169613

@glyph
Copy link
Collaborator

glyph commented Dec 31, 2017

🎄

@glyph
Copy link
Collaborator

glyph commented Jan 10, 2018

I believe this issue should now be closed?

@mahmoud
Copy link
Member Author

mahmoud commented Jan 10, 2018 via email

@glyph
Copy link
Collaborator

glyph commented Jan 10, 2018

Ah, I miss Launchpad's distinction between fix committed and fix released :)

@mahmoud
Copy link
Member Author

mahmoud commented Feb 26, 2018

18.0.0 released today! 🎉

@mahmoud mahmoud closed this as completed Feb 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants