You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Searchable URL section in the CDXJ spec describes a greatly simplified algorithm as compared to either the Java or Python implementations. I created patches to fixes some bugs/differences between those two, but I'm wondering whether the simplification represented in this spec is the direction that things are headed in the future.
Some of the types of things that those implementations do (not all of which I agree with) include:
removal of default port 80
removal of leading www. (as well as www1., www2., etc) (multiple instances in the Java case, just one for Python)
multiple percent decoding steps until the URL stops changing
removal of session identifiers (CFID, JSESSIONID, etc)
reordering of query parameters
I can see that different strengths of canonicalization can be appropriate for different use cases, but I'm curious to understand what went into the CDXJ choices.
The text was updated successfully, but these errors were encountered:
The Searchable URL section in the CDXJ spec describes a greatly simplified algorithm as compared to either the Java or Python implementations. I created patches to fixes some bugs/differences between those two, but I'm wondering whether the simplification represented in this spec is the direction that things are headed in the future.
Some of the types of things that those implementations do (not all of which I agree with) include:
www.
(as well aswww1.
,www2.
, etc) (multiple instances in the Java case, just one for Python)I can see that different strengths of canonicalization can be appropriate for different use cases, but I'm curious to understand what went into the CDXJ choices.
The text was updated successfully, but these errors were encountered: