Recommended practice for adding reserved characters? #44

mahmoud · 2017-10-26T17:45:40Z

Per @markrwilliams comment here and a few others dotted around, we're facing a design gap in hyperlink's APIs.

To paraphrase:

url = URL()
url.add(u'param', u'#value').to_text()

Yields:

ValueError: one or more reserved delimiters &# present in query parameter value: u'#value'

This is due to a subtle shift in hyperlink's design compared to twisted.python.url. t.p.url would allow any string value in, whereas hyperlink prefers to store the "minimally-encoded" version. This is why a ValueError is raised from the code above.

Technically, this can be solved by making the code url.add(u'param', _encode_query_part(u'#value')). But hyperlink's primary goal is to handle encoding/decoding, does it really make sense to push that back on the user?

One solution Mark and I discussed would be to switch to decoding every value passed in. But what if someone were to pass in u'%23%' and actually intend for that to be their decoded value? And the API would be further complicated by the fact that the underlying decoding is generally unknown. UTF8, Latin-1, and plain old binary are all valid in percent-encoded URL parts. Autodecoding UTF8 might have better usability most of the time, but much like relying on Python 2's implicit encoding/decoding, the safety of the explicit _encode_*_part() is probably preferable.

It might occur to one that this entire problem bears some resemblance to the bytes/unicode split, as URL has URL.to_uri() and URL.to_iri(). There is some truth to this, but both IRIs and URIs are both URLs. Having two types imposes a sort of artificial split I'd like to avoid if possible, but we also don't have a good way to represent an already decoded IRI. This was causing an issue with double decoding on multiple .to_iri() calls (see #16).

Right now my best idea is to enable that technical solution above by exposing the various encoding and decoding functions as public APIs, since those may prove useful utilities for other contexts anyways. I'm sure there are better ideas, too, so I'm going to leave this issue open as a place for discussion on handling this quandary.

The text was updated successfully, but these errors were encountered:

glyph · 2017-11-12T09:49:25Z

IRIs vs. URLs is a red herring here. reserved characters in a URL are still reserved in an IRI; the decoding that happens is all reversible and idempotent.

The issue here is one about whether you are manipulating the text of the URL itself (which is what URL has always been; twisted.python.url just had a bug where it would let you put in some invalid characters), or whether you're manipulating a value stored in a URL, which is not visible as part of the URL. In other words: does % mean %, or does it mean "escape sequence beginning, please hold"? One of URL's invariants is that everything can round trip: this would not be true if everything were decoded, since /% would turn into /%25, or an error.

A quick sketch a two-type proposal:

DecodedURL(path="%&") -> DecodedURL(URL(path='%25%26'))
DecodedURL(URL(path='%25%26')).path = %&

I think I actually like this, perhaps with the caveat that this could be a .decoded or .unquoted attribute on URL itself, so we don't need to import an extra wrapper name.

I strongly dislike exposing a bunch of public APIs for encoding and decoding, since it's easy to screw that up, and you end up passing around context-free strings all the time, ending in the inevitable concatenation of some unquoted HTML, some quoted URL, some unquoted SQL, and etc.

mahmoud · 2017-11-13T01:46:23Z

Yeah, agree. I am probably going to expose the encoding/decoding functions as a sorta hazmat utility suite, because it has come up for me to need just the encoding for a part of the URL. But, yes, at this point, I mostly envision a decoded descriptor-style attribute that supports getting and setting "human" (fully-decoded) values, for all the reasons summarized above. :)

glyph · 2017-11-13T18:52:09Z

I am probably going to expose the encoding/decoding functions as a sorta hazmat utility suite, because it has come up for me to need just the encoding for a part of the URL.

Please don't do this :-(. The minimal API is one of URL's primary features. I'd rather work around this with urllib.parse.quote and urllib.parse.unquote than add a bunch of unsightly, error-prone warts on the side of URL's API that have to be supported forever.

mahmoud · 2017-11-13T18:58:56Z

The functions don't go on the URL type? It's separating this out into a module. And they're worth exposing exactly because urllib's unqualified blanket approach leads to error-prone code.

wsanchez · 2017-11-13T23:23:52Z

DecodedURL, is the URL I want, so I like that idea.

glyph · 2017-11-16T19:25:10Z

The functions don't go on the URL type? It's separating this out into a module. And they're worth exposing exactly because urllib's unqualified blanket approach leads to error-prone code.

Everything in that region strikes me as a private implementation detail which I would not want exposed / supported; it's an end run around the integrity of the URL object.

But it looks like we have broad consensus around DecodedURL so let's just do that, and have this fight on a different issue :).

mahmoud · 2017-12-18T18:30:00Z

So I'm cruising along over here and it's looking like the straightforward approach will work fine. Each field needs individual handling, but that's ok because it lets us support conveniences like #48.

Also, in a greenfield environment I agree with @wsanchez that I wish URL was DecodedURL and URL was named EncodedURL, but here we are :)

mahmoud · 2017-12-31T08:28:24Z

Merry Christmas @glyph and @wsanchez, your DecodedURL is all but ready. It is somewhat tested and ready for review, here: #54

Along the way I went on a longish Unicode journey. Somewhat terse notes here: https://gist.github.com/mahmoud/7bc696254a738404bc281c270b169613

glyph · 2017-12-31T23:11:45Z

🎄

glyph · 2018-01-10T18:56:01Z

I believe this issue should now be closed?

mahmoud · 2018-01-10T19:51:21Z

Yes, I plan on closing these issues when the code + docs are released. I've come down with a bit of a bug myself, but I'll get around to it!

…

On Wed, Jan 10, 2018 at 10:56 AM, Glyph ***@***.***> wrote: I believe this issue should now be closed? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#44 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAH8kacBt4dZ3IO3rB8JfYz_BKL38o69ks5tJQfBgaJpZM4QH9N5> .

glyph · 2018-01-10T23:37:48Z

Ah, I miss Launchpad's distinction between fix committed and fix released :)

mahmoud · 2018-02-26T08:12:32Z

18.0.0 released today! 🎉

glyph mentioned this issue Nov 12, 2017

Two URLs that render as the same text but are not equal #6

Closed

mahmoud mentioned this issue Dec 16, 2017

idiomatic way to extend paths #48

Open

mahmoud mentioned this issue Dec 31, 2017

DecodedURL #54

Merged

mahmoud mentioned this issue Jan 11, 2018

Can twisted.python.url.URL be switched over to hyperlink.DecodedURL by default? #60

Closed

mahmoud closed this as completed Feb 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommended practice for adding reserved characters? #44

Recommended practice for adding reserved characters? #44

mahmoud commented Oct 26, 2017 •

edited

Loading

glyph commented Nov 12, 2017 •

edited

Loading

mahmoud commented Nov 13, 2017

glyph commented Nov 13, 2017 •

edited

Loading

mahmoud commented Nov 13, 2017

wsanchez commented Nov 13, 2017

glyph commented Nov 16, 2017 •

edited

Loading

mahmoud commented Dec 18, 2017

mahmoud commented Dec 31, 2017

glyph commented Dec 31, 2017

glyph commented Jan 10, 2018

mahmoud commented Jan 10, 2018 via email

glyph commented Jan 10, 2018

mahmoud commented Feb 26, 2018

Recommended practice for adding reserved characters? #44

Recommended practice for adding reserved characters? #44

Comments

mahmoud commented Oct 26, 2017 • edited Loading

glyph commented Nov 12, 2017 • edited Loading

mahmoud commented Nov 13, 2017

glyph commented Nov 13, 2017 • edited Loading

mahmoud commented Nov 13, 2017

wsanchez commented Nov 13, 2017

glyph commented Nov 16, 2017 • edited Loading

mahmoud commented Dec 18, 2017

mahmoud commented Dec 31, 2017

glyph commented Dec 31, 2017

glyph commented Jan 10, 2018

mahmoud commented Jan 10, 2018 via email

glyph commented Jan 10, 2018

mahmoud commented Feb 26, 2018

mahmoud commented Oct 26, 2017 •

edited

Loading

glyph commented Nov 12, 2017 •

edited

Loading

glyph commented Nov 13, 2017 •

edited

Loading

glyph commented Nov 16, 2017 •

edited

Loading