Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explanation for Hash Key length rationale? #39

Open
atyshka opened this issue Jul 6, 2020 · 1 comment
Open

Explanation for Hash Key length rationale? #39

atyshka opened this issue Jul 6, 2020 · 1 comment

Comments

@atyshka
Copy link

atyshka commented Jul 6, 2020

I'm trying to understand the documentation on the hash key length:

However if your data is dense and hashKeyLength too short, more RCUs will be needed to read a hash key and a higher proportion will be discarded by server-side filtering.

Let's imagine an extreme scenario where our hash key length is 1, and all of our data is stored under this one hash key. For dynamodb queries, we only contribute to RCUs for items that are accessed. We are still restricting our query to be between the minvalue and maxvalue. According to my understanding, only items between minvalue and maxvalue will be considered accessed and therefore contribute to the consumed RCUs. The number of items between the minvalue and maxvalue will be the same regardless of the hash key length. Therefore, using a small hashkeylength should have no impact on RCUs. Of course, this would lead to a hot partition, but that is a different issue. I also understand why a large hashkeylength is bad, as it results in more query operations. I still don't understand though why a small hash key length wouldn't be the most optimal choice. Is there something I'm misunderstanding about the geohash algorithm or dynamodb RCUs?

@gham-khaled
Copy link

Actually I don't think it work this way, Let's suppose you want to query in a 20M Raduis and your hash key length is 1 :

  • You would have to retrieve the items starting with the same geohash letter (it will be probably all your items if your locations are not world-spread) and then the filtering will be done in the server side with the Haversine formula ( I don't think there could have a min and max value on the sort key). --> This will be the same as scanning the whole table and comparing positions one by one as if you are not taking advantage of the geohash at all
    Let's now you suppose your hash key length is 10:
  • You would have to retrieve the items starting with the same 10 geohash letter (it will be probably the items within 50m not sure though) and then you will filter the rest using the same formula -> This way you will have optimized your search by filtering most of the locations in your table from the start
    Hope this is clear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants