跳转至

CS61b_Lec19_Hashing

理解hash table,让我们从最简单的开始:

DataIndexedArray

Create an array of booleans indexed by data! - Initially all values are false. - When an item is added, set appropriate index to true. 添加了0、5、10在这个DataIndexedArray里:

Pasted image 20240427193346.png =100x

如果用这样的数组来存数据的话,仅仅是存数字就需要很多空间,并且还很浪费,而且一些其他的数据,比如字符串,汉字这些没法存,虽然它很快: Pasted image 20240427193551.png =100x

接下来我们继续改进: 开始存字符串进去,比如cat、dog这些,数字是本身就是index,而字符串如果只是用首位的话就很容易产生冲突 --- cat存第三个,那我cap存第几个呢?

Avoiding Collisions

Use all digits by multiplying each by a power of 27. Pasted image 20240427194100.png =100x 这种算index的方法和进制很像: Pasted image 20240427194158.png =100x

上面只存了小写字母,非常受限,我们再一次改进: 先引入ascii的概念,第33 - 126位是可见字符,超出ascii就是unicode了,扩大基数(base) 举例一个中文短语编码(unicode) Pasted image 20240427194735.png =100x

Integer Overflow

数的大小有限制,超出则会溢出,这就导致可能会有两个算出来是一样的index,就是冲突: Pasted image 20240427195013.png =100x

(箭头那里是返回true了)

Overflow can result in collisions, causing incorrect answers.

引入Pigeonhole Principle: 大致意思是十个鸽子放入九个笼子中,必定有一个笼子有两只鸟(每个笼子都需要有鸽子) 潜在哈希冲突:项目数量超过了哈希函数的输出情况 Pigeonhole principle tells us that collisions are inevitable due to integer overflow冲突无法避免

那么: - 如何解决哈希冲突(collision handling) - 如何计算任意一个object的哈希码?(computing a hashCode)

Make a Bucket

别急,接着继续改进: Suppose N items have the same numerical representation h: - Instead of storing true in position h, store a “bucket” of these N items at position h.

How to implement a “bucket”? - Conceptually simplest way: LinkedList. - Could also use ArrayLists. - Could also use an ArraySet. - Will see it doesn’t really matter what you do. bucket本身的性质并不重要

有冲突的话直接全塞进原地,hash table初具雏形 Pasted image 20240427211300.png =100x

但是我们不需要这么长的list,so looong 的bucket,这不跟链表没差吗。 答案是模,把长长的bucket斩断分到其他bucket里去

Hash Table

然后 hash table上场!! - Data is converted by a hash function into an integer representation called a hash code.  - The hash code is then reduced to a valid index, usually using the modulus operator, e.g. 2348762878 % 10 = 8. Pasted image 20240427211716.png =100x

Hash Table Runtime

  • The good news: We use way less memory and can now handle arbitrary data.
  • The bad news: Worst case runtime is now Θ(Q), where Q is the length of the longest list.

find x的算法scale: Pasted image 20240427212000.png =100x

Load factor

这样显然不太行,都N了你还不如直接上链表呢,继续改进: Suppose we have: - An increasing number of buckets M. - An increasing number of items N.
As long as M = Θ(N), then O(N/M) = O(1).

N在变大的同时,bucket的数量也跟着大,相比一除就是1了,之前那个bucket定死的是N/5,不就是N。实现的大致方法是判断N/M的比值,比如 > 5时M double一下,这里的比值就是下面的load factor

Even distribution of item is critical for good hash table performance.

  • Both tables below have load factor of N/M = 1.
  • Left table is much worse!
    • Contains is Θ(N) for x. Pasted image 20240427212605.png =100x

Default implementation simply returns the memory address of the object. Java里面默认的hashcode就是返回当前object的地址,这个方法真聪明。

Warning

  • Never store objects that can change in a HashSet or HashMap!
  • Never override equals without also overriding hashCode.