You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a case where we have deterministic keys on entities that share prefixes is certain cases.
Basically, we'd like to query by this prefix, which right now is also contained in a ComputedProperty (which drives the query in its current iteration).
To visualize:
25 million entities with key IDs that start with "abc"
2 million entities with key IDs that start with "def"
Using the current query, we see very poor sharding. One shard handles everything in the "def" range.
Knowing that our keys are prefixed this way, we can easily determine the first and last Key for the KeyRange, but I'm not sure of how we would utilize this knowledge to create the sharding we'd like to see.
Any ideas?
The text was updated successfully, but these errors were encountered:
There are two ways to do sharding. Splitting lexicographical or by using the scatter property. The later is better, and used if it is possible automatically. If you are doing a MR over all entities of a given type but that have strangely distributed IDs, that should just work out of the box, as it will use the scatter property to find the split points. So I'm assuming your problem is actually more complicated and that you mean to say that your table looks like:
123-a
123-b
...
456-abc-1
...
456-abc-200
...
456-def-1
...
789-a
and you only want to include the ones starting with "456" but the sub-ranges "abc" and "def" under it are very uneven. In this case you need to cajole your use-case into working with scatter. Using a prefix match will not help with this because a prefix of 456 is going to get split into "id >= 456 and id < 457" which means the only way to split that up is lexicographical. Instead you can add a property that is "456" to the entities. Then if you do a MR over the keyspace with a filter on the property "newproperty=456" then you can use it as a filter and it will split accurately regardless of how things are distributed by ID.
We have a case where we have deterministic keys on entities that share prefixes is certain cases.
Basically, we'd like to query by this prefix, which right now is also contained in a ComputedProperty (which drives the query in its current iteration).
To visualize:
Using the current query, we see very poor sharding. One shard handles everything in the "def" range.
Knowing that our keys are prefixed this way, we can easily determine the first and last Key for the KeyRange, but I'm not sure of how we would utilize this knowledge to create the sharding we'd like to see.
Any ideas?
The text was updated successfully, but these errors were encountered: