YongXMan's Blog » 首页 » 关于 » 归档

## 基数估计

HyperLogLog是一种基数统计算法，算法的神奇之处是充分发挥数学之美，在保证相对误差的前提下，用极少的内存完成了同样的工作。redis实现中，12KB内存就可以完成2^64个数据的统计。

HyperLogLog实现的原理基本思想是利用数字的bit信息中第一个1出现的位置预估整体基数，并采用分桶的方式来减少误差。

antirez举了个形象的例子：

Here I’ll cover only the basic idea using a very clever example found at [3]. Imagine you tell me you spent your day flipping a coin, counting how many times you encountered a non interrupted run of heads. If you tell me that the maximum run was of 3 heads, I can imagine that you did not really flipped the coin a lot of times. If instead your longest run was 13, you probably spent a lot of time flipping the coin.

However if you get lucky and the first time you get 10 heads, an event that is unlikely but possible, and then stop flipping your coin, I’ll provide you a very wrong approximation of the time you spent flipping the coin. So I may ask you to repeat the experiment, but this time using 10 coins, and 10 different piece of papers, one per coin, where you record the longest run of heads. This time since I can observe more data, my estimation will be better.

## 实现原理

1. 对于输入，计算hash
2. 去hash后的二进制比特最右6bit来进行分桶，桶的个数是2^6=64个，对应的桶是6
3. 计算的hash值左侧的18位用来进行比特统计，从右侧数第一个1bit位的位置为3
4. 设置第6个register的值为6（由于原来是0，6大于0，所以更新。如果原来的值比6大，则不更新）
5. 完成元素添加

（完）