Question: Searchable URL (SURT) algorithm differences for CDXJ #146

tfmorris · 2023-09-20T18:47:10Z

The Searchable URL section in the CDXJ spec describes a greatly simplified algorithm as compared to either the Java or Python implementations. I created patches to fixes some bugs/differences between those two, but I'm wondering whether the simplification represented in this spec is the direction that things are headed in the future.

Some of the types of things that those implementations do (not all of which I agree with) include:

removal of default port 80
removal of leading www. (as well as www1., www2., etc) (multiple instances in the Java case, just one for Python)
multiple percent decoding steps until the URL stops changing
removal of session identifiers (CFID, JSESSIONID, etc)
reordering of query parameters

I can see that different strengths of canonicalization can be appropriate for different use cases, but I'm curious to understand what went into the CDXJ choices.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Searchable URL (SURT) algorithm differences for CDXJ #146

Question: Searchable URL (SURT) algorithm differences for CDXJ #146

tfmorris commented Sep 20, 2023

Question: Searchable URL (SURT) algorithm differences for CDXJ #146

Question: Searchable URL (SURT) algorithm differences for CDXJ #146

Comments

tfmorris commented Sep 20, 2023