-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-word street names sometimes cause problems #6
Comments
I've been thinking about this issue over the past few days. To minimise the search index size (and the amount of data needed by the browser) the geocoder uses many search indexes instead of one huge index. Each unique address is assigned to a search index based on the street name's first letter and metaphone (a phonetic algorithm). It allows for fuzzy matching as long as you correctly enter the first letter of the street name. Here's the problem: when parsing user input I'm trying to find the street name in order to load the correct search index. But it's not always accurate, and the accuracy is worse again if the street name contains multiple words. A potential solution would be to allow addresses to be assigned multiple indexes based on each word of the street name and suburb. But I fear this will make search indexes too large. To offset the size increase perhaps we could use finer-grained search indexes. But then we're potentially publishing hundreds of thousands of data files. An idea I'm contemplating is to have two levels of search indexes: (1) For each letter of the alphabet, a list of words from street names and suburbs starting with that letter. (2) For each word, the data for every single address containing that word. Say a user entered "11 lily pily crt" and the indended address was "11 Lilly Pilly Court, Burpengary" (notice the typos). The geocoder would perform a fuzzy search against all words starting with "L", "P" and "C". It should find "Lilly", "Pilly" and many other candidate words. The geocoder would then download all addresses containing those words and run a fuzzy search against those. This would be a problem because we'd have to download tens (maybe even hundreds) of files before we had enough data to work with. But what if the second level of search indexes (i.e. the data for each word) resided in a single tar archive, and the geocoder downloaded everything it needed at once using a multipart range request? I think I'm onto somethere here but I need to do more research. |
Some more thoughts after a bit of research...
If we can rely on HTTP/2 multiplexing (and it looks like all browsers now support it) then hundreds of small requests may not even be an issue.
This idea works but with two big caveats. (1) Limited server-side support. Digital Ocean Spaces works if you add 'Range' as an allowed header in the CORS settings. ABC via Akamai (e.g. abc.net.au/res) does not work. (2) Multipart responses can't use Content-Encoding. In other words, there's no automatic compression. To use compression we'd have to compress each file before adding it to the tar file, and decompress each part of the response in the browser via JavaScript. Zstandard is a good compression option if we wanted to take this approach - it's fast, compresses better than gzip, and the decompression algorithm weighs only 28KB. But if we can rely on HTTP/2 multiplexing, then all this would be irrelevant anyway. |
The indexing and address parsing system don't do a good job of accommodating multi-word street names. For example:
The text was updated successfully, but these errors were encountered: