Hashing

这一课笔记的逻辑结构和课件略有区别，总体根据最初设定的两个优化目标分为两部分。

第一部分找到一种数据结构（hash table），解决了search tree的generalization的问题；第二部分优化这个结构，解决了performance的问题。

Limits of Search Tree sets

requires the objects to be comparable
performance not good enough (Θ(log N))

因此之后的改进主要在这两方面：1）generalize to more types of objects；2）节约内存、降低复杂度。

Generalization to various objects

第一部分找到一种数据结构（hash table），解决了search tree的generalization的问题

A Better Structure: Data Indexed Sets

这一部分从generalization入手，从Data Indexed Integer Set (Int -> index）引入，扩展到Data Indexed English Word Set（EN word -> Int -> index），最后一步步generalize到Data Indexed String Set（String -(based on ASCII)-> Int -> index）。

首先想到的是Data Indexed Integer Set
- 缺点是它太浪费memory了，并且无法表示除了integer以外的其他对象

其次，为了能够表示其他种类的对象，通过进制转换将字母转换成integer的index。
- 在进制的选择上，为了防止index重复（collision），选择base >= N（item种类）
- 比如将27作为base转换26个字母
因此就有了Data Indexed English Word Set。
- 缺点是这个方法无法应对字符（“eGg!”）

为了进一步从英文单词generalize到字符，采用ASCII码/Unicode将String转化为index。
于是进化出了Data Indexed String Set
- 缺点是index过大，会造成overflow，超过java的限制（+2,147,483,647）。
- 而overflow会进一步导致index重复（collision）

为了“解决”这个generalization部分的遗留问题，将引入hash code的编码方式。

Handling Collision: Hash Code and Hash Table

Definition:

a hash code “projects a value from a set with many (or

even an infinite number of) members to a value from a set with a fixed

number of (fewer) members.”

首先“解决”刚才的overflow导致collision的问题。

turns out collision无法避免，只能应对。

应对方法是把hashcode相同的item放进一个bucket。可以用LinkedList、ArrayList、ArraySet等等。
- bucket在示意图里是每一个index格子所指向的一筐item。
于是它又进化成了Separate Chaining Data Indexed Array

终于处理好了collision——这也意味着generalization部分的问题至此彻底解决。

Improving Performance

第二部分优化hash table的结构，解决了performance的问题

接下来就是开头的第二个问题performance（memory & complexity）。

这个问题可以分解为两部分，一是几万个bucket太浪费memory，二是complexity可能会很高（比如所有的item都连成了一串）。

Saving Memory

为了节约memory，改成只用N个bucket。

方法是HashCode % NumBuckets把hashCode重新归档。比如，只用10个bucket，就把hashCode%10：

以上完成的就是Hash Table。完整流程示意图如下：

Reducing Complexity

为了让每个bucket里的平均item数量保持恒定，需要动态改变bucket数量——resizing。

具体来说，

已知N为item总数，M为bucket总数
设定load factor = N / M （即最多平均每个bucket里有几个item）

下为示意图，如果load factor ≥ 1.5，将bucket数M翻倍：

Finale: Hash Table Runtime Analysis

Worst Case Runtime:

	contains(x)	add(x)
Bushy BSTs	Θ(log N)	Θ(log N)
DataIndexedArray	Θ(1)	Θ(1)
Separate Chaining Hash Table With No Resizing	Θ(N)	Θ(N)
… With Resizing	Θ(1)†	Θ(1)*†

*: Indicates “on average”. †: Assuming items are evenly spread. 这里尤其要注意，如果hashCode()不好（无法平均分配items），那么还是会像without resizing的HT一样有Θ(N)。

有了resizing后，worst case operation数

不再是遍历全部连成一串的N个item【$\Theta(N)$】
而是遍历最多N/M个item【$\Theta(N/M) = \Theta(1)$】

由此performance的问题也解决了。

(Op) How to Compute a hashCode

Q: How to write a good hashCode() method?

A: Use a small prime base, for it yields better randomness and cost less to compute

下面是两个hashCode()方法的例子：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


// Hashing a collection
@Override
public int hashCode() {
    int hashCode = 1;
	for (Object o : this) {
        hashCode = hashCode * 31; // elevate the current hash code
        hashCode = hashCode + o.hashCode(); // add new item’s hash code
	}
	return hashCode; 
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


// Hashing a recursive structure (e.g. a tree)
@Override
public int hashCode() {
    if (this.value == null) {
    	return 0; 
    }
    return this.value.hashCode() + 
        31 * this.left.hashCode() +
    	31 * 31 * this.right.hashCode();
}

Hashing#

Limits of Search Tree sets#

Generalization to various objects#

A Better Structure: Data Indexed Sets#

Handling Collision: Hash Code and Hash Table#

Improving Performance#

Saving Memory#

Reducing Complexity#

Finale: Hash Table Runtime Analysis#

(Op) How to Compute a hashCode#