Multi-word street names sometimes cause problems #6

drzax · 2022-10-27T01:38:39Z

The indexing and address parsing system don't do a good job of accommodating multi-word street names. For example:

13 old pioneer crescent
11 lilly pilly court, burpengary
68 lake manchester road
300 kelvin grove road

andrewkesper · 2022-11-02T03:58:47Z

I've been thinking about this issue over the past few days.

To minimise the search index size (and the amount of data needed by the browser) the geocoder uses many search indexes instead of one huge index. Each unique address is assigned to a search index based on the street name's first letter and metaphone (a phonetic algorithm). It allows for fuzzy matching as long as you correctly enter the first letter of the street name.

Here's the problem: when parsing user input I'm trying to find the street name in order to load the correct search index. But it's not always accurate, and the accuracy is worse again if the street name contains multiple words.

A potential solution would be to allow addresses to be assigned multiple indexes based on each word of the street name and suburb. But I fear this will make search indexes too large.

To offset the size increase perhaps we could use finer-grained search indexes. But then we're potentially publishing hundreds of thousands of data files.

An idea I'm contemplating is to have two levels of search indexes:

(1) For each letter of the alphabet, a list of words from street names and suburbs starting with that letter.

(2) For each word, the data for every single address containing that word.

Say a user entered "11 lily pily crt" and the indended address was "11 Lilly Pilly Court, Burpengary" (notice the typos). The geocoder would perform a fuzzy search against all words starting with "L", "P" and "C". It should find "Lilly", "Pilly" and many other candidate words.

The geocoder would then download all addresses containing those words and run a fuzzy search against those.

This would be a problem because we'd have to download tens (maybe even hundreds) of files before we had enough data to work with.

But what if the second level of search indexes (i.e. the data for each word) resided in a single tar archive, and the geocoder downloaded everything it needed at once using a multipart range request?

I think I'm onto somethere here but I need to do more research.

andrewkesper · 2022-11-20T23:30:38Z

Some more thoughts after a bit of research...

This would be a problem because we'd have to download tens (maybe even hundreds) of files before we had enough data to work with.

If we can rely on HTTP/2 multiplexing (and it looks like all browsers now support it) then hundreds of small requests may not even be an issue.

But what if the second level of search indexes (i.e. the data for each word) resided in a single tar archive, and the geocoder downloaded everything it needed at once using a multipart range request?

This idea works but with two big caveats.

(1) Limited server-side support. Digital Ocean Spaces works if you add 'Range' as an allowed header in the CORS settings. ABC via Akamai (e.g. abc.net.au/res) does not work.

(2) Multipart responses can't use Content-Encoding. In other words, there's no automatic compression. To use compression we'd have to compress each file before adding it to the tar file, and decompress each part of the response in the browser via JavaScript. Zstandard is a good compression option if we wanted to take this approach - it's fast, compresses better than gzip, and the decompression algorithm weighs only 28KB.

But if we can rely on HTTP/2 multiplexing, then all this would be irrelevant anyway.

andrewkesper mentioned this issue Dec 21, 2022

Version 0.6 #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-word street names sometimes cause problems #6

Multi-word street names sometimes cause problems #6

drzax commented Oct 27, 2022

andrewkesper commented Nov 2, 2022

andrewkesper commented Nov 20, 2022

Multi-word street names sometimes cause problems #6

Multi-word street names sometimes cause problems #6

Comments

drzax commented Oct 27, 2022

andrewkesper commented Nov 2, 2022

andrewkesper commented Nov 20, 2022