Generating hashCode questions.

_vk_vk
edited October 2014 in Questions about Code

Hi, I'm overriding hashcode for some classes. Well I don't really know how to properly do this, but I boldly :P tried.

The first class is a data holder (Data) , it has an int that has the range of 0-9 and a float from 0.0-1.0 (for now they are only 0 or 1). So I came up with ,

int hashcode = myInt * 100 + int(myFloat); // e.g. 7 and 1.0 =   701
// it will need adjust if float are fully used..

And happily found that my tests indicates that this is working.

Now the second class have an int[] + an arrayLIst of the elements of previous class (Data). Looking into how to do this I found a lot of different information...

Is that correct to assume that listofData.hashCode() will work if Data hashCode is fine? Same for the array can I just use arrayOfInts.hashCode() ?

And then how to put both together safely?

I did this, and my firsts tests seems to show it is working, but...

//@ override
  public  int hashCode() {
   //scenario 
    return arrayOfInts.hashCode() * listOfData.hashCode();
  }//hashCode

I feel that I should look for a more robust stuff here. Then I found this link that has a solution to implement hashcode in any situation... Looks good doesn't it?

http://www.javapractices.com/topic/TopicAction.do?Id=28

Any thoughts and advices are welcome.

thanks

Answers

  • edited October 2014 Answer ✓

    Unfortunately in Java, hashCode() is a mere 4-byte/32-bit int value.
    Unless a class doesn't have much to store, the whole hash will never fit in such tiny 4-byte space!
    That's why a hashCode() isn't expect to be unique. Simply an attempt for it!
    The better the attempt algorithm, the less hashCode() collision among instances w/ diff. field values.

    In your 1st class, you said you only needed a range from 0 to 9 + a range from 0.0 to 1.0.
    For such tiny ranges, even an 8-bit byte could comfortably store both!
    That'd still leave ya w/ 24 bits vacant from an int! :P

    Here's a class example which generates a unique hashCode() for its 3 fields:

    // forum.processing.org/two/discussion/7691/generating-hashcode-questions-
    
    class DataHolder {
      byte  myByte;    // range -128 to 127.     (1 byte)
      float myFloat;   // range 0.00 to 1.00.    (1 byte)
      short myShort;   // range -32768 to 32767. (2 bytes)
    
      DataHolder() {
      }
    
      DataHolder(int b, float f, int s) {
        myByte  = (byte)  b;
        myFloat = constrain(f, 0.0, 1.0);
        myShort = (short) s;
      }
    
      @ Override int hashCode() { // 4-byte unique hash:
        return myByte<<030 | (int)(myFloat*100)<<020 | myShort;
      }
    
      @ Override boolean equals(Object o) {
        return o instanceof DataHolder
          ? o.hashCode() == hashCode() : false;
      }
    
      @ Override String toString() {
        return "Byte: " + myByte + "\t\tFloat: "
          + nf(myFloat, 1, 2) + "\tShort: " + myShort;
      }
    }
    
    static final int NUM = 10;
    final DataHolder[] dats = new DataHolder[NUM];
    
    void setup() {
      for (int i = 0; i != NUM; ++i) {
        byte  b = (byte)random(Byte.MAX_VALUE + 1);
        float f = (int)random(100)/100.0;
        short s = (short)random(Short.MAX_VALUE + 1);
    
        println( dats[i] = new DataHolder(b, f, s) );
        println( hex(dats[i].hashCode()) + "\t\t[" + i + "]\n" );
      }
    
      exit();
    }
    
  • Answer ✓
    public int hashCode(){
      String s = myInt + "" + myFloat;
      return s.hashCode();
    }
    
  • edited October 2014

    @quark, String's own hashCode() is pretty good, but not unique.
    There exists a slightly chance that diff. combo values of myInt + "" + myFloat yield same hash! L-)
    But w/ limited ranges, perhaps it ends up being unique! Who knows? 8->
    Either way, as aforementioned, uniqueness isn't required. But gr8 when achievable! :-bd

  • There exists a slightly chance that diff. combo values of myInt + "" + myFloat yield same hash!

    Unless there are two objects that have exactly the same values for myInt and myFloat then there is no more chance of two objects having the same hashCode than any other method. The advantage of using the String class hashCode is obvious, it takes the responsibility of ensuring a good code distribution from us, and we can assume that the Java developers have done a reasonable job of it.

    Writing your own hashCode method from scratch is not recommended because you would have make sure you have an even distribution so they can be used effectively in hash maps and sets.

    Creating a unique String from the objects attribute values and using the String hashCode is a good way of getting round it.

    Also if you provide your own hashCode you should override the equals method because any two objects that are considered equal MUST produce the same hashCode.

  • BTW if you have two objects that are considered unequal, in other words equals() returns false, it is NOT a requirements that their hashCodes are different.

    The hasCode value should NOT be used be used to check for equality between objects, that is the purpose of the equals() method

  • edited October 2014

    @quark, dunno if you paid attention, but myByte<<030 | (int)(myFloat*100)<<020 | myShort;
    yields a unique hashCode() if expected range is obeyed! Those 4 fields together fit in 32 bits!
    Like I said, even if String's hashCode() algorithm is good, it's very far from the definition of unique!
    And they were developed for texts, not numerical values: 3:-O
    http://docs.oracle.com/javase/8/docs/api/java/lang/String.html#hashCode--

  • edited October 2014

    @GoToLoop

    You seem to be under the impression that it is important that unequal objects produce a different hashCodes. That is NOT true, what is important is the distribution of values produced, they should be evenly distributed over the 32 bit range. The only reason most hashCodes are usually unique is that they are evenly distributed over 2^32 unique values, difficult to get duplicates even if you have a million unequal objects.

    The sole purpose of the hashCode is to ensure efficient object storage in hash structures, they should NEVER be used to compare objects for equality.

    I saw your hashCode example but the only way to guarentee an even distribution is if the values for the individual attributes are evenly dstributed over their respective ranges. For instance the values of myByte must be evenly distributed over the range -128 to 127; if myByte is always positive then bit 32 of the hash would always be 0. A good hasCode implementation would give it a 50/50 chance.

    Converting the values to a long String then using the String class hashCode avoids that probem.

  • edited October 2014

    Well, in my simple case, my hashCode() is equivalent to equals().
    I simply fail to see why a unique hash should be lowered to a non-unique hash! 8-}

    As soon it's not possible to fit a hash into a 32 bit int value,
    that's where we should start looking for a good distribution hash algorithm.
    I've seen many online examples of it, but never 1 that turned itself into a String! ~:>

  • edited October 2014

    You seem to be under the impression that it is important that unequal objects produce a different hashCodes.

    From my 2nd reply I said:

    Either way, as aforementioned, uniqueness isn't required. But gr8 when achievable! :-bd

  • Thanks @quark and @GoToLoop. I think this String method is what I was looking for.

    I'am overriding equals() as well.

    But the hash seems to be important to avoid duplicates in Sets. Without overriding it my Sets keep having duplicated elements. Once hashCode() was override it starts to work as expected.

    BTW what about the linked page, did you guys had time to look at it? What about that?

    cheers

    vk

  • did you guys had time to look at it?

    Just a little. It contains diff. algorithms for diff. types. Very diff. than make everything a String! :-$

  • quark: "The sole purpose of the hashCode is to ensure efficient object storage in hash structures, they should NEVER be used to compare objects for equality."
    Not the "sole purpose", I think.
    I believe it is used for quick comparison, not for equality, but for inequality. If hash codes are different, then the objects must be different. If they are equals, then the algorithm MUST use equals() to ensure a real equality.

    That's why good practices tell us to always write both equals() and hashCode() at the same time, and to use the same fields for them. Although some people tell hashCode can use a subset of equals' fields.

    quark, I am not sure I would use your algorithm of hashCode(). This function is called a lot, eg. when walking a tree structure, and called on each step of the algorithm. But your method creates a StringBuffer (or Builder) on each call, converts the fields, compute the hash. It is very costly in resources.

    If you want a really good algorithm; import the Guava library and use their methods for hashCode / equals.

    @ Override
    public boolean equals(Object obj)
    {
        if (!(obj instanceof Book))
            return false;
        if (obj == this)
            return true;
    
        final Book that = (Book) obj;
        return Objects.equal(m_isbn, that.m_isbn) && Objects.equal(m_title, that.m_title) &&
                Objects.equal(m_authors, that.m_authors) && Objects.equal(m_publisher, that.m_publisher);
    }
    
    @ Override
    public int hashCode()
    {
        return Objects.hashCode(m_authors, m_publisher, m_isbn, m_title);
    }
    

    Note that they have also helpers for compareTo and toString:

    @ Override
    public int compareTo(Book that)
    {
        return ComparisonChain.start()
                .compare(m_title, that.m_title)
                .compare(getAuthorList(), that.getAuthorList())
                .compare(m_publisher, that.m_publisher)
                .compare(m_publishingDate.orNull(), that.m_publishingDate.orNull(), Ordering.natural().nullsLast())
                .result();
    }
    
    @ Override
    public String toString()
    {
        return Objects.toStringHelper(this).omitNullValues()
                .add("Title", m_title).add("Authors", getAuthorList()).add("Publisher", m_publisher)
                .add("ISBN", m_isbn).add("Number of pages", m_pageNumber).add("Price", m_price)
                .add("Published", m_publishingDate.isPresent() ? m_publishingDate.get().toString("yyyy-MM-dd") : null)
                .toString();
    }
    

    If you ask Eclipse to generate these functions for you, it will come with a robust method itself, purely based on integer algebra...

    Note that for Map and Set, if there is a conflict of hashCode (ie. same hash code, but equals says they are different), they are still stored in the collection, but in a chained list, so it is less efficient because retrieval must iterate on the lists. Thus the importance of a good algorithm with an even distribution, as quark said.

  • edited October 2014

    @PhiLho

    I believe it is used for quick comparison, not for equality, but for inequality. If hash codes are different, then the objects must be different. If they are equals, then the algorithm MUST use equals() to ensure a real equality.

    True hashCode might be used to test for inequality, but surely that's the purpose of equals or !equals. Also I am not convinced that it would be quicker to generate the hashCode rather than use !equals(...).

    Yes Eclipse provides boiler plate hashCode and equals methods but that doesn't help in Processing IDE.

    I am not familiar with the Guava library but at 2.3Mb seems a lot if you are only going to use the hashCode and equals implementations. I suspect there is a lot more useful stuff in there so I must have a look. :)

    I am not sure I would use your algorithm of hashCode(). This function is called a lot, eg. when walking a tree structure, and called on each step of the algorithm. But your method creates a StringBuffer (or Builder) on each call, converts the fields, compute the hash. It is very costly in resources.

    That is true for tree structures but not for case for maps and sets where the hash is calculated once for a put and a get and is used to determine the index (bucket) for the object, after that get uses 'equals' to locate the matching object.

    @_vk

    But the hash seems to be important to avoid duplicates in Sets. Without overriding it my Sets keep having duplicated elements

    If you don't override hashCode then it will use the default version in the Object class - extract from java.lang.Object

    As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer, but this implementation technique is not required by the Java programming language.)

    It means that if you create 2 distinct objects and assign the same values to both objects they will have different hashCodes for instance the output from this program

    void setup(){
      MyClass c0, c1;
      c0= new MyClass();
      c1 = new MyClass();
      println(c0.hashCode());
      println(c1.hashCode());
    }
    
    class MyClass {
      int myInt = 2;
      float myFloat = 2.5;
      String myString = "quark";
    }
    

    Produced the hashcodes 2083592302 and 2067471732 so both will be stored in a HashSet or HashMap

  • edited October 2014

    I agree w/ @PhiLho too. When I perused about hashCode() some years ago, I've learnt stored calculated hash value is tested before equals() for Map & Set data structure types. If hashes don't match, equals() test is skipped.
    Hence hashCode() is indeed a quicker inequality test! :-B

    And of course, unnecessary &/or redundant object creations should be avoided as much as possible.
    Especially to something that is called many times! :-SS

    But like I said, If some class is able to uniquely fit in its hash into a 32-bit int value,
    mathematically its hashCode() == equals()! \m/

  • Hence hashCode() is indeed a quicker inequality test!

    Are you suggesting that given two objects a and b then line 1 is faster than line 2 in all cases? Surely it depends on how they are implemented?

    a.hashCode() != b.hashCode();
    !a.equals(b);
    
  • edited October 2014

    What I suggested is that storing the result of an object's hashCode() is faster than keep calling equals() all the time! That is, if saved hash values don't match, no need to bother going to equals() as well!

  • @GoToLoop

    PhiLho stated

    I believe it is used for quick comparison, not for equality, but for inequality. If hash codes are different, then the objects must be different. If they are equals, then the algorithm MUST use equals() to ensure a real equality.

    My interpretation of this is some people are using a.hashCode() != b.hashCode(); as an alternative to !a.equals(b); even if they were not using a HashMap or HashSet collection. In this situation I don't see how anyone can claim one is faster than the other it will depend on their implementation.

    Your statement

    that storing the result of an object's hashCode() is faster than keep calling equals() all the time!

    This is certainly true inside the HashMap or HashSet but is not relevant outside of these collections, which is what I was getting at.


    BTW I had a look at your DataHolder class

      @ Override int hashCode() { // 4-byte unique hash:
        return myByte<<030 | (int)(myFloat*100)<<020 | myShort;
      }
    
      @ Override boolean equals(Object o) {
        return o instanceof DataHolder
          ? o.hashCode() == hashCode() : false;
      }
    

    This would certainly meet the hashCode() contract but it would need extra code to ensure that the value of myFloat is ALWAYS constrained to >=0 and <1. At the very least myFloat should be declared private (although this won't work in an inner class, in which case it would need to be final) and if not final the class would need a public setter that constrains its value.

    Also what happens when myFloat differs in the third decimal point for instance 0.000 and 0.004 would result in the same hashcode.

  • edited October 2014

    ... although this won't work in an inner class, ...

    That's why I haven't made any extra effort to protect the myFloat field!
    Just left some comments about its working range which gotta be respected for whomever uses that class! 8-|
    Well, those warnings seem to work for Python & JS programmers after all! 8-}

  • Well, those warnings seem to work for Python & JS programmers after all!

    What does that mean :-?

  • Although I dunno Python very well, it seems it doesn't feature any private or similar keywords! :O)

  • _vk_vk
    edited October 2014

    Hey guys, you are amazing :). @PhiLho, thanks I will take a look in guava. I don't really wanted to bother about hashCode() creation, I mean I was just trying to have a Set to avoid duplicates... here is where I end up... :)

    I thought that I should go for the tools the language provides... Let's see where it goes... Anyway thanks a bunch you all. I certainly will be needing more help in those maters.

    :)

Sign In or Register to comment.