-
Notifications
You must be signed in to change notification settings - Fork 4
Hash Keys
Background
JImmutable Collections includes two basic types of maps (Map Tutorial): sorted and unsorted. A third type, ordered maps, use an unsorted map for their implementation internally so they won't be addressed here. Sorted maps are implemented using balanced binary trees. Unsorted maps are implemented using hash array mapped tries (HAMT).
Most hash map implementations use an array sufficiently large to hold all of the elements in the map plus some extra capacity to allow for growth and minimize hash collisions. These implementations compute an integer hash code for each key (in Java the hashCode() method is used for this purpose) and then take the modulus of this code and the size of the array to produce an index into the array. Since hash codes are 32-bit integers the arrays themselves are tiny in comparison to the 4 billion potential hash codes. When two different hash codes map to the same array index this is called a "hash collision". Allowing extra space in the array and using a prime number for its size can help to reduce the number of collisions.
In contrast a HAMT is a shallow tree-like structure which maps every possible hash code to a unique location in the trie. The trie's potential capacity is exactly equal to the number of possible hash codes so two hash codes can never "collide" due to limited capacity of the structure. In that sense a HAMT is an ideal structure for hashing so it would seem that hash collisions would be impossible.
However there is another potential source of hash collisions. Specifically a poorly implemented hash function can generate the same hash code for many keys. An obvious example would be one that hashes strings by simply adding the UTF-32 integer values of each character. This would obviously produce exactly the same hash code for the strings "ABCD", "DCBA", and "ADBC".
Since hash collisions are always possible hash map implementations have to adopt strategies for dealing with collisions.
Collision Handling Strategies
Versions of JImmutable Collections prior to 1.6 relied on the hash functions producing very few collisions. Based on that simplifying assumption they adopted the simple strategy of using linked lists of key/value pairs to resolve collisions at any given location in the HAMT. For example if 9 different keys stored in the hash map have exactly the same hash code the map would store all nine elements in a linked list at the HAMT node corresponding to that hash code. Whenever the get() method is called for one of those keys the map would find the node with the appropriate hash code in the HAMT and then walk the linked list looking for a matching key. Obviously this implementations performance would degrade in the presence of poor hashCode() methods that generate frequent collisions.
A second hash collision strategy has been available beginning with version 1.6. This second strategy uses balanced trees instead of linked lists at each node. In the example above with 9 keys in the same HAMT node a linked list could require up to 9 list nodes to be visited searching for a key but with a balanced tree at most 4 tree nodes would have to be visited. With 128 collisions in a single node the list would require visiting up to 128 list nodes but the balanced tree would require at most 7 tree nodes.
Obviously the balanced tree strategy offers superior performance. However since it relies on the ability to keep keys sorted in the tree it can only be used for certain types of keys. Specifically keys must implement the Comparable
interface. Since the factory method IMaps.hashed()
, does not require that keys implement Comparable
each map waits to decide which strategy to use until the first key is added to the map. On the first call to insert() the map determines if the key is Comparable
. If it is the map uses the tree strategy. Otherwise it falls back on the linked list strategy.
Best Practices for Unsorted Map Keys
All of this background boils down to a few simple best practices to follow when deciding which objects to use as keys in your unsorted maps.
- Ensure that all keys in the map are implementations of a single class.
- Ensure the key class is immutable.
- Ensure the key class has an excellent
hashCode()
method. - Ensure the key class implements
Comparable<K>
(where K is the key class).
Restricting your keys to a single class ensures that all of the keys will always be comparable to one another. It also ensures that all keys either implement Comparable
or do not. For example if you have a class A
which is not Comparable
and a subclass B
of A
which is Comparable
then an unsorted map will encounter exceptions if the first key inserted is of class B
but a later key is of class A
. In that case the map will choose to use the tree strategy when the B
key is inserted but when the map tries to cast the A
key to Comparable
an exception will be thrown. All keys in a given map must either implement Comparable
or not implement Comparable
. You cannot have a mix of the two in a single map. The easiest way to prevent this problem from happening is to use homogeneous hash keys.
Restricting your keys to be immutable is required not just in the JImmutable maps but also in the java.util maps. If a key is mutable it would be possible for its hashCode()
to change after the key has already been added to the map. If that were to happen later attempts to delete or move the key could corrupt the map.
Having an excellent hashCode()
will make it highly unlikely that any collisions will happen in a HAMT. Integer is an obvious example of a class whose hash codes never collide. (note: if your keys are Integers though you could use a IArray
for better efficiency) There is always a trade off of performance vs collision avoidance. For example computing an MD5 digest and using the first 32 bits would probably yield a fantastic hash code but would be extremely slow. Do some reading on how to implement efficient hashCode()
methods. The time expended will pay off in better performance for your program.
Restricting your keys to those classes which implement Comparable
will allow the map to use a tree for maximum performance if collisions do happen. Imagine the worst case scenario of a hashCode()
that always generates the same number. In that case a map using the tree strategy will have O(log(N)) performance while a map using linked list strategy will have O(N) performance. So the small amount of time needed to add a well written compareTo()
method can pay off in better performance if your hashCode()
method proves to be less than stellar.
Of course the easiest strategy is to simply use one of Java's built in value types as keys. String, Integer, Double, etc all implement Comparable
and have good hashCode() implementations.