數據結構之哈希表(4)


1 Concept of Hashing

  The problem at hands is to speed up searching.We could search even faster if we know in advance the index at which that value is located in the array. Suppose we do have that magic function that would tell us the index for a given value. With this magic function our search is reduced to just one probe, giving us a constant runtime O(1). Such a function is called a hash function , such data sturcture is called hash (table).  A hash function hashes (converts) a number in a large range into a number in a smaller range. This smaller range corresponds  to the index numbers in an array. An array into which data is inserted using a hash function is called a hash table. 


Hash tables are significantly faster than trees, insertion and searching (and sometimes deletion) can take close to constant time: O(1) in big O notation.



Hash table disadvantage:

1)Hash tables are based on arrays, and arrays are difficult to expand after they’ve been created. For some kinds of hash tables, performance may degrade catastrophically when a table becomes too full, so the programmer needs to have a fairly accurate idea of how many data items will need to be stored (or be prepared to periodically transfer data to a larger hash table ( rehash), a time-consuming process).

2)There’s no convenient way to visit the items in a hash table in any kind of order (such as from smallest to largest). If you need this capability, you’ll need to look elsewhere.

2 Use example

  A similar widely used application for hash tables is in computer-language compilers, which maintain a symbol table in a hash table. The symbol table holds all the variable and function names made up by the programmer, along with the address where they can be found in memory. The program needs to access these names very quickly, so a hash table is the preferred data structure.

3 Hash Process

The following figure describes the process of hash : 


The process:

1) Hash code: If keys are not digit, use hash code to covert keys into digit keys;

2)Hash function: hash (converts) a number in a large range into a number in a smaller range;

3)Hash Table: This smaller range corresponds  to the index numbers in an array. An array into which data is inserted using a hash function is called a hash table.

Pseudocode:

//hash code
digitKey=hashCode(key);

//hash fuction
hashValue=hashFunction(digitKey); //hash the digit key

//Insertion
hashTable[hashValue].insert(key); //use hash table index(hashValue) to insert key

//Delete
hashTable[hashValue].delete(key); //insert at hash table

//Search
key= hashArray[hashValue].find(key); // get key

From the process of hash table, the following questions should solve:

1) How to implement a hash code ?

2) What is the size of the array(hash table is an array)?

3) How to implement a hash function?

4)How to solve the conflict if two keys has the same hash value?

4 How to implement a hash code?

  Hash code is the function which convert non-digit key to digit key.  If the key is not digit, how can we convert the key to digit key?

  In Java language world, a non-digit key is usually a string object. At first, we look at how digits come from. Like 324, we can write 324=3*10^2+2*10^1+4*10^0(the base is 10 in mathematics). Would a string can write like this? Certainly, if every char in a string object can equal to  a digit, a string can be written in the same way. Luckily, a char equals a corresponding int the ASCII code which a is 97, b is 98, and so on, up to 122 for z. For example, abc=97*10^2+98*10^1+99*10^0.

  We have solved the way how non-digit key transforms in digit key, but what is the base if we use the digit key in hash table? In Java, this base is 31. Now here is the Java code of the hash code:

/**
* Returns a hash code for this string. The hash code for a String object is computed as
* s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
* using int arithmetic, where s[i] is the i th character of the string,
* n is the length of the string, and ^ indicates exponentiation.
* @param key the string object
* @return a hash code value for the string object.
*/
public int hashCode1(String key){
int digitKey=0;
int power31=1; //the power

for(int i=key.length()-1;i>=0;i--){ //right to left
digitKey+=key.charAt(i)* power31;
power31*=31;
}//end for

return digitKey;
} //hashCode1()

Problem:

The hashCode() method is not as efficient as it might be. There are two multiplications and an addition inside the loop. We can eliminate a multiplication by taking advantage of a mathematical identity called Horner’s method(Horner規則). (Horner was an English mathematician, 1773–1827.) This states that an expression like

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

can be written as

((s[n-1]*31+s[n-2])*31+s[n-3])*31+ ...+ s[0]

So we have the following code

/**
* Returns a hash code for this string. The hash code for a String object is computed as
* s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
* With Horner's method, the above expression can be written as
* ((s[n-1]*31+s[n-2])*31+s[n-3])*31+ ...+ s[0]
* using int arithmetic, where s[i] is the i th character of the string,
* n is the length of the string, and ^ indicates exponentiation.
* @param key the string object
* @return a hash code value for the string object.
*/
public int hashCode2(String key){
int digitKey=0;
for(int i=0;i<key.length();i++){
digitKey=digitKey*31 + key.charAt(i);
}//end for

return digitKey;
} //end hashCode2()

Problem: overflow

   Note that Java's hashCode method might return a negative integer. If a string is long enough, its hashcode will be bigger than the largest integer we can store on 32 bits CPU. In this case, due to integer overflow, the value returned by hashCode can be negative.

   The hashCode2() method unfortunately can’t handle strings longer than about seven letters. Longer strings cause the value of digitKey to exceed the size of type int. (If we used type long, the same problem would still arise for somewhat longer strings.) Can we modify this basic approach so we don’t overflow any variables? 

   One solution for hashFuction(hashValue=digitKey % arraysize). Here we use the hashFuction which will be described later as an example.Notice that the key we eventually end up with is always less than the array size because we apply the modulo operator. It’s not the final index that’s too big; it’s the intermediate key values. It turns out that with Horner’s formulation we can apply the modulo operator (%) at each step in the  calculation. This gives the same result as applying the modulo operator once at the end but avoids overflow. Like this:

The Java code like this

/**
* With the hash function : hashValue=digitKey % arraysize
* Returns a hash value for this string. The hash value for a String object is computed as
* s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
* With Horner's method, the above expression can be written as
* ((s[n-1]*31+s[n-2])*31+s[n-3])*31+ ...+ s[0]
* using int arithmetic, where s[i] is the i th character of the string,
* n is the length of the string, and ^ indicates exponentiation.
* @param key the string object
* @param arraysize the size of the hash table(an array)
* @return a hash value for the string object.
*/
public int hashFuction(String key,int arraysize){ //left to right
int hashValue=0;

for(int i=0;i<key.length();i++){
hashValue=(hashValue*31 + key.charAt(i)) % arraysize;
}//end for

return hashValue;
} //end hashCode2()

Why does Java's hashCode() in String use 31 as a multiplier?

  According to Joshua Bloch's Effective Java (from Chapter 3, Item 9: Always override hashcode when you override equals, page 48):

  The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

  But I think maybe basically using 31 gives you a more even set-bit probability distribution for the hash function.

5 What is the size of the array(hash table is a array)?

  Use a Prime Number for the Modulo Base in a hash function, so the array size is best a prime number. 

  This is true because, if many keys share a divisor with the array size, they may tend to hash to the same location, causing clustering. Using a prime table size eliminates this possibility. For example, if the table size is a multiple of 50 in our car-part example, the category codes will all hash to index numbers that are multiples of 50. However, with a prime number such as 53, you are guaranteed that no keys will divide evenly into the table size. 

例如:設有一個哈希函數H( c ) = c % N;
當N取一個合數時,最簡單的例子是取2^n,比如說取2^3=8,這時候
H( 11100(二進制) ) = H( 28 ) = 4
H( 10100(二進制) ) =  H( 20 )= 4
       這時候c的二進制第4位(從右向左數)就”失效”了,也就是說,無論第c的4位取什么值,都會導致H( c )的值一樣.這時候c的第四位就根本不參與H( c )的運算,這樣H( c )就無法完整地反映c的特性,增大了導致沖突的幾率.取其他合數時,都會不同程度的導致c的某些位”失效”,從而在一些常見應用中導致沖突.
       但是取質數,基本可以保證c的每一位都參與H( c )的運算,從而在常見應用中減小沖突幾率.

       在常見應用中,往往有些數據會比較相近,這時候用質數比較好,比如要存放的數據是壓縮的狀態,比如存儲一個描述當前搜索狀態的表,的這時候哈希不用質數沖突機率就比較大。

 One rule of thumb suggests make the table size about double, so that there is room to expand, and hopefully keep the number of collisions small.
 Another rule of thumb is to assume that you are doing some sort of modulo related hashing, then round your table size up to the next largest prime number, and use that prime number as the modulo value.

6 How to implement a hash function?

  A hash function is a function which when given a key, generates an address(the array index) in the hash table. A hash function has the following properties:

  • it always returns the index of hash table for an object.
  • two equal objects will always have the same number
  • two unequal objects not always have different numbers
  • quick computation :A good hash function is simple, so it can be computed quickly.  A hash function with many multiplications and divisions is not a good idea. (The bit-manipulation(二進制處理) facilities of Java or C++, such as shifting bits(移位) right to divide a number by a multiple of 2, can sometimes be used to good advantage.)

The common three hash functions :

1)除法散列法 

arrayIndex = hugeNumber % arraySize;

學過匯編的都知道,求模數其實是通過一個除法運算得到的,所以叫“除法散列法”。

2)平方散列法 

求index是非常頻繁的操作,而乘法的運算要比除法來得省時(對現在的CPU來說,估計我們感覺不出來),所以我們考慮把除法換成乘法和一個位移操作。公式:arrayIndex = (value * value) >> 28  右移,除以2^28。記法:左移變大,是乘。右移變小,是除。)

如果數值分配比較均勻的話這種方法能得到不錯的結果。value如果很大,value * value不會溢出嗎?答案是會的,但我們這個乘法不關心溢出,因為我們根本不是為了獲取相乘結果,而是為了獲取arrayIndex

3)斐波那契(Fibonacci)散列法

平方散列法的缺點是顯而易見的,所以我們能不能找出一個理想的乘數,而不是拿value本身當作乘數呢?答案是肯定的。

  • 對於16位整數而言,這個乘數是40503 
  • 對於32位整數而言,這個乘數是2654435769 
  • 對於64位整數而言,這個乘數是11400714819323198485

 這幾個“理想乘數”是如何得出來的呢?這跟一個法則有關,叫黃金分割法則,而描述黃金分割法則的最經典表達式無疑就是著名的斐波那契數列,即如此形式的序列:01123581321345589144,233377610, 987, 1597, 2584, 4181, 6765, 10946,…。另外,斐波那契數列的值和太陽系八大行星的軌道半徑的比例出奇吻合。

7 How to solve the conflict if two keys has the same hash value?

Solution:Separate chaining(鏈接法),Open addressing(開放地址法)


  When we put objects into a hash table, it is possible that different objects (by the equals() method) might have the same hashcode. This is called a collision. Here is the example of collision. Two different strings ""Aa" and "BB" have the same key: .

"Aa" = 'A' * 31 + 'a' = 2112
"BB" = 'B' * 31 + 'B' = 2112

  How to resolve collisions? Where do we put the second and subsequent values that hash to this same location? There are several approaches in dealing with collisions. One of them is based on idea of putting the keys that collide in a linked list! A hash table then is an array of lists!! This technique is called a separate chaining collision solution.


  Because of collisions, we cannot guarantee the constant runtime in the worst-case. Why? Imagine that all our objects collide into the same index. Then searching for one of them will be equivalent to searching in a list, that takes a liner runtime. However, we can guarantee an expected constant runtime, if we make sure that our lists won't become too long. This is usually implemnted by maintaining a load factor that keeps a track of the average length of lists. If a load factor approaches a set in advanced threshold, we create a bigger array and rehash all elements from the old table into the new one.

  Another technique of collision resolution is a linear probing. If we cannot insert at index k, we try the next slot k+1. If that one is occupied, we go to k+2, and so on. This is quite simple approach but it requires new thinking about hash tables. Do you always find an empty slot? What do you do when you reach the end of the table?


8 What is a load factor(加載因子)?

  The load factor is the ratio of the number of items in a hash table to its size. 
     loadFactor=n/arraySize ;其中n代表對象元素個數,arraySize表示hash table 這個數組的大小.

       加載因子是表示Hsah表中元素的填滿的程度.若加載因子越大,填滿的元素越多,好處是,空間利用率高了,但:沖突的機會加大了。反之,加載因子越小,填滿的元素越少,好處是:沖突的機會減小了,但:空間浪費多了.沖突的機會越大,則查找的成本越高.反之,查找的成本越小.因而,查找時間就越小.因此,必須在 "沖突的機會"與"空間利用率"之間尋找一種平衡與折衷.這種平衡與折衷本質上是數據結構中有名的"時-空"矛盾的平衡與折衷.  In Java language the default load factor is 0.75, the following words come from Java HashMap API:

  As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put).  The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations.  If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.

  Let’s say that the hash table consists of arraySize elements, each of which holds a list, and that N data items have been inserted in the table. Then, on the average, each list will hold N divided by arraySize items. This is the same as the definition of the load factor: loadFactor = N / arraySize. So the average list length equals the load factor.

  In a successful search, the algorithm hashes to the appropriate list and then searches along the list for the item. On the average, half the items must be examined before the correct one is located. Thus, the search time is 1 + loadFactor / 2.

 This is true whether the lists are ordered or not. In an unsuccessful search, if the lists are unordered, all the items must be searched, so the time is 1 + loadFactor. These formulas are graphed in the following Figure:


                                        Separate-chaining performance

  For an ordered list, only half the items must be examined in an unsuccessful search, so the time is the same as for a successful search. In separate chaining it’s typical to use a load factor of about 1.0 (the number of data items equals the array size). Smaller load factors don’t improve performance significantly, but the time for all operations increases linearly with load factor, so going beyond 2 or so is generally a bad idea.

9 Hash Table implement in Java language

// hashChain.java
// demonstrates hash table with separate chaining
// to run this program: C:>java HashChainApp
import java.io.*;
////////////////////////////////////////////////////////////////
class Link
{ // (could be other items)
private int iData; // data item
public Link next; // next link in list
// -------------------------------------------------------------
public Link(int it) // constructor
{ iData= it; }
// -------------------------------------------------------------
public int getKey()
{ return iData; }
// -------------------------------------------------------------
public void displayLink() // display this link
{ System.out.print(iData + " "); }
} // end class Link
////////////////////////////////////////////////////////////////
class SortedList
{
private Link first; // ref to first list item
// -------------------------------------------------------------
public void SortedList() // constructor
{ first = null; }
// -------------------------------------------------------------
public void insert(Link theLink) // insert link, in order
{
int key = theLink.getKey();
Link previous = null; // start at first
Link current = first;
// until end of list,
while( current != null && key > current.getKey() )
{ // or current > key,
previous = current;
current = current.next; // go to next item
}
if(previous==null) // if beginning of list,
first = theLink; // first --> new link
else // not at beginning,
previous.next = theLink; // prev --> new link
theLink.next = current; // new link --> current
} // end insert()
// -------------------------------------------------------------
public void delete(int key) // delete link
{ // (assumes non-empty list)
Link previous = null; // start at first
Link current = first;
// until end of list,
while( current != null && key != current.getKey() )
{ // or key == current,
previous = current;
current = current.next; // go to next link
}
// disconnect link
if(previous==null) // if beginning of list
first = first.next; // delete first link
else // not at beginning
previous.next = current.next; // delete current link
} // end delete()
// -------------------------------------------------------------
public Link find(int key) // find link
{
Link current = first; // start at first
// until end of list,
while(current != null && current.getKey() <= key)
{ // or key too small,
if(current.getKey() == key) // is this the link?
return current; // found it, return link
current = current.next; // go to next item
}
return null; // didn't find it
} // end find()
// -------------------------------------------------------------
public void displayList()
{
System.out.print("List (first-->last): ");
Link current = first; // start at beginning of list
while(current != null) // until end of list,
{
current.displayLink(); // print data
current = current.next; // move to next link
}
System.out.println("");
}
} // end class SortedList
////////////////////////////////////////////////////////////////
class HashTable
{
private SortedList[] hashArray; // array of lists
private int arraySize;
// -------------------------------------------------------------
public HashTable(int size) // constructor
{
arraySize = size;
hashArray = new SortedList[arraySize]; // create array
for(int j=0; j<arraySize; j++) // fill array
hashArray[j] = new SortedList(); // with lists
}
// -------------------------------------------------------------
public void displayTable()
{
for(int j=0; j<arraySize; j++) // for each cell,
{
System.out.print(j + ". "); // display cell number
hashArray[j].displayList(); // display list
}
}
// -------------------------------------------------------------
public int hashFunc(int key) // hash function
{
return key % arraySize;
}
// -------------------------------------------------------------
public void insert(Link theLink) // insert a link
{
int key = theLink.getKey();
int hashVal = hashFunc(key); // hash the key
hashArray[hashVal].insert(theLink); // insert at hashVal
} // end insert()
// -------------------------------------------------------------
public void delete(int key) // delete a link
{
int hashVal = hashFunc(key); // hash the key
hashArray[hashVal].delete(key); // delete link
} // end delete()
// -------------------------------------------------------------
public Link find(int key) // find link
{
int hashVal = hashFunc(key); // hash the key
Link theLink = hashArray[hashVal].find(key); // get link
return theLink; // return link
}
// -------------------------------------------------------------
} // end class HashTable
////////////////////////////////////////////////////////////////
class HashChainApp
{
public static void main(String[] args) throws IOException
{
int aKey;
Link aDataItem;
int size, n, keysPerCell = 100;
// get sizes
System.out.print("Enter size of hash table: ");
size = getInt();
System.out.print("Enter initial number of items: ");
n = getInt();
// make table
HashTable theHashTable = new HashTable(size);

for(int j=0; j<n; j++) // insert data
{
aKey = (int)(java.lang.Math.random() *
keysPerCell * size);
aDataItem = new Link(aKey);
theHashTable.insert(aDataItem);
}
while(true) // interact with user
{
System.out.print("Enter first letter of ");
System.out.print("show, insert, delete, or find: ");
char choice = getChar();
switch(choice)
{
case 's':
theHashTable.displayTable();
break;
case 'i':
System.out.print("Enter key value to insert: ");
aKey = getInt();
aDataItem = new Link(aKey);
theHashTable.insert(aDataItem);
break;
case 'd':
System.out.print("Enter key value to delete: ");
aKey = getInt();
theHashTable.delete(aKey);
break;
case 'f':
System.out.print("Enter key value to find: ");
aKey = getInt();
aDataItem = theHashTable.find(aKey);
if(aDataItem != null)
System.out.println("Found " + aKey);
else
System.out.println("Could not find " + aKey);
break;
default:
System.out.print("Invalid entry\n");
} // end switch
} // end while
} // end main()
//--------------------------------------------------------------
public static String getString() throws IOException
{
InputStreamReader isr = new InputStreamReader(System.in);
BufferedReader br = new BufferedReader(isr);
String s = br.readLine();
return s;
}
//-------------------------------------------------------------
public static char getChar() throws IOException
{
String s = getString();
return s.charAt(0);
}
//-------------------------------------------------------------
public static int getInt() throws IOException
{
String s = getString();
return Integer.parseInt(s);
}
//--------------------------------------------------------------
} // end class HashChainApp
////////////////////////////////////////////////////////////////

The following code implements a hash table with JSE LinkedList as hash table array. The heart of the hash table is hash function method.

//hash table interface
public interface IHashTable {
//hash code
public int hashCode(String key) ;

//hash function
public int hashFun(String key);

//add
public boolean add(String key);

//remove
public boolean remove(String key);

//find
public boolean find(String key);
} //end interface IHashTable

//hash table implement
public class HashTable implements IHashTable {
LinkedList[] hashArr=null; //the hash table array
int size=0; //the hash table size

//constructor
public HashTable(int size){ //size can not be negitive number or zero
this.size=size;
hashArr=new LinkedList[size];

//create array:fill array with lists
for(int i=0; i<size; i++){
hashArr=new LinkedList();
}//end for
}//end constructor

//hash code
@override
public int hashCode(String key) {
int digitKey=0;

for(int i=0; i<key.length();i++){
digitKey=digitKey*31+key.charAt(i);
}//end for

return digitKey;
} //end hashCode()

//hash function
public int hashFun(String key) {
int hashValue=0;

for(int i=0; i<key.length();i++){
hashValue= (hashValue*31+key.charAt(i) ) % size;
}//end for

return hashValue;
} //end hashFun()

//add
@override
public boolean add(String key) {
return hashArr.add(key);
} //end add()

//remove
@override
public boolean remove(String key) {
return hashArr.remove(key);
} //end remove()

//find
@override
public boolean find(String key) {
boolean isFound=false; //flag whether found
String temp=null;

Iterator it=ll.iterator(); //iterator
while(it.hasNext()){
temp=(String)it.next();
if(temp.equals(key)){
isFound=true;
}//end if
}//end while

return isFound;
} //end remove()
} //end class HashTable

8 External Storage and Hashing

Use External Storage Process Big Data

9 The examples of HashSet and HashMap in JSE

HashSet

import java.util.*;
class HashSetTest
{
public static void main(String[] args)
{
HashSet hs=new HashSet();
hs.add(new Student(1,"zhangsan"));
hs.add(new Student(2,"wangwu"));
hs.add(new Student(2,"wangwu"));

Iterator it=hs.iterator();
while(it.hasNext())
{
System.out.println(it.next());
}
}
}


class Student
{
int num;
String name;
Student(int num,String name)
{
this.num=num;
this.name=name;
}
/*用HashSet存放對象時,需要重寫基類的hashCode、equals方法*/
public int hashCode()
{
//寫一個能唯一標識對象的值
return num*name.hashCode();
}
public boolean equals(Object o)
{
Student s=(Student)o;
return num==s.num && name.equals(s.name);
}
public String toString()
{
return num+":"+name;
}
}

HashMap

import java.util.*;
class HashMapTest
{
public static void printElements(Collection c)
{
Iterator it=c.iterator();
while(it.hasNext())
{
System.out.println(it.next());
}
}
public static void main(String[] args)
{
HashMap hm=new HashMap();
//插入元素
hm.put("one","zhangsan");
hm.put("two","lisi");

System.out.println(hm.get("one"));

//key的值
Set keys=hm.keySet();
printElements(keys);
//value的值
Collection values=hm.values();
printElements(values);

//鍵值對(鍵和值),返回Map.Entry類型
/*打印結果:two=lisi one=zhangsan three=wangwu */
Set entry=hm.entrySet();
//printElements(entry);
Iterator it=entry.iterator();
while(it.hasNext())
{
Map.Entry me=(Map.Entry)it.next();
//獲得關鍵字getKey(),獲得值getValue()
System.out.println(me.getKey()+":"+me.getValue());
}

}
}

注意!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系我们删除。



 
  © 2014-2022 ITdaan.com 联系我们: