Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-word street names sometimes cause problems #6

Open
drzax opened this issue Oct 27, 2022 · 2 comments
Open

Multi-word street names sometimes cause problems #6

drzax opened this issue Oct 27, 2022 · 2 comments

Comments

@drzax
Copy link
Member

drzax commented Oct 27, 2022

The indexing and address parsing system don't do a good job of accommodating multi-word street names. For example:

  • 13 old pioneer crescent
  • 11 lilly pilly court, burpengary
  • 68 lake manchester road
  • 300 kelvin grove road
@andrewkesper
Copy link
Collaborator

I've been thinking about this issue over the past few days.

To minimise the search index size (and the amount of data needed by the browser) the geocoder uses many search indexes instead of one huge index. Each unique address is assigned to a search index based on the street name's first letter and metaphone (a phonetic algorithm). It allows for fuzzy matching as long as you correctly enter the first letter of the street name.

Here's the problem: when parsing user input I'm trying to find the street name in order to load the correct search index. But it's not always accurate, and the accuracy is worse again if the street name contains multiple words.

A potential solution would be to allow addresses to be assigned multiple indexes based on each word of the street name and suburb. But I fear this will make search indexes too large.

To offset the size increase perhaps we could use finer-grained search indexes. But then we're potentially publishing hundreds of thousands of data files.

An idea I'm contemplating is to have two levels of search indexes:

(1) For each letter of the alphabet, a list of words from street names and suburbs starting with that letter.

(2) For each word, the data for every single address containing that word.

Say a user entered "11 lily pily crt" and the indended address was "11 Lilly Pilly Court, Burpengary" (notice the typos). The geocoder would perform a fuzzy search against all words starting with "L", "P" and "C". It should find "Lilly", "Pilly" and many other candidate words.

The geocoder would then download all addresses containing those words and run a fuzzy search against those.

This would be a problem because we'd have to download tens (maybe even hundreds) of files before we had enough data to work with.

But what if the second level of search indexes (i.e. the data for each word) resided in a single tar archive, and the geocoder downloaded everything it needed at once using a multipart range request?

I think I'm onto somethere here but I need to do more research.

@andrewkesper
Copy link
Collaborator

Some more thoughts after a bit of research...

This would be a problem because we'd have to download tens (maybe even hundreds) of files before we had enough data to work with.

If we can rely on HTTP/2 multiplexing (and it looks like all browsers now support it) then hundreds of small requests may not even be an issue.

But what if the second level of search indexes (i.e. the data for each word) resided in a single tar archive, and the geocoder downloaded everything it needed at once using a multipart range request?

This idea works but with two big caveats.

(1) Limited server-side support. Digital Ocean Spaces works if you add 'Range' as an allowed header in the CORS settings. ABC via Akamai (e.g. abc.net.au/res) does not work.

(2) Multipart responses can't use Content-Encoding. In other words, there's no automatic compression. To use compression we'd have to compress each file before adding it to the tar file, and decompress each part of the response in the browser via JavaScript. Zstandard is a good compression option if we wanted to take this approach - it's fast, compresses better than gzip, and the decompression algorithm weighs only 28KB.

But if we can rely on HTTP/2 multiplexing, then all this would be irrelevant anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants