Saturday 22 November 2008

Item 9: Always override hashCode when overriding equals

This one is pretty straightforward - none of the joys that doing a proper equals function gives you, since you're only concerned with yourself.

So, why do it? The short answer is "Because you're told to".

http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Object.html#hashCode()

Second bullet point says two objects which are equal according to equals(Object) must have the same hash code. The default Object.hashCode() will nearly always give different answers for different objects, despite what equals() may say, so it's unsuitable.

He then goes into a little more detail on why this is. Here's my way of thinking about it. Consider a simple value object - IIRC his is a phone number. Let's make a phone book - a Hashtable, key this phone number object, value being the name associated with this phone number. (ok, it's a backwards phone book).

This phone book has an entry for 555-1234 - for Fred, who lives on USian TV. If I want to find out who's associated with 555-1234, I need to create a phone number entry and ask the hash table for the value associated with this. But if I've not overridden hashCode, there's no reason the hash table will be able to find it - it'll look in the hash bucket for this test object (here I'm assuming everybody understands how hash tables work - is this the case?), but the one for Fred could well be in a different hash bucket. Oops, it's lost.

So, we now know we need to write one. Great - but how to do it? He then describes a simple algorithm for doing so. Lots of multiplying by 17 or 37 (prime numbers) - gives a good spread.

The choice of which fields to consider when writing hashCode is important. The main rule is you mustn't let anything which isn't used in equals affect hashCode - otherwise you can end up with different hashCodes for equal objects.

You can be slack about what you use - you don't need to use all the fields. But you want to be sure your choice gives you a good range - otherwise you'll end up with all your values in not many hash buckets, which means you're effectively doing a very stupid scan on the bucket to find stuff - full table scan in database-land is the equivalent.

Hash calculations obviously want to be reasonably fast. He mentions caching them, and also using lazy evaluation in an example - but notes this is only really relevant for big complex stuff. (The cached value mustn't be used in the hash calculation :-) )

In the 1st ed, he goes on about how the built-in ones (eg for String) have had a bit of a chequered past - ie they didn't necessarily work terribly well. Not sure if he's kept that in the 2nd ed - I'm hoping these days they all just work.

There is a temptation to say "well, I'm not going to use this class in a hashed collection". If so, returning the same value for all of them would work - but you'd need to be really confident nobody was going to change their mind later, and writing the function properly is sufficiently easy that you may as well just do it and be safe.

I think I've had about one value class with equals() and hashCode(). I had strings as the primary keys, so just using the hashcode from them was sufficient for me.

Clive George

No comments: