布隆过滤器 Java 实现

介绍

使用 Java 实现布隆过滤器有两种方式，一种是自己实现，另一种是使用 Google 开源的 Guava 中自带的布隆过滤器

自己实现

版本一：简单模拟实现

public static class BloomFilter {

    private byte[] data;

    public BloomFilter(int initSize) {
        this.data = new byte[initSize * 2]; // 默认创建大小 * 2 的空间
    }

    public void add(int key) {
        int location1 = Math.abs(hash1(key) % data.length);
        int location2 = Math.abs(hash2(key) % data.length);
        int location3 = Math.abs(hash3(key) % data.length);

        data[location1] = data[location2] = data[location3] = 1;
    }

    public boolean contains(int key) {
        int location1 = Math.abs(hash1(key) % data.length);
        int location2 = Math.abs(hash2(key) % data.length);
        int location3 = Math.abs(hash3(key) % data.length);

        return data[location1] * data[location2] * data[location3] == 1;
    }

    private int hash1(Integer key) {
        return key.hashCode();
    }

    private int hash2(Integer key) {
        int hashCode = key.hashCode();
        return hashCode ^ (hashCode >>> 3);
    }

    private int hash3(Integer key) {
        int hashCode = key.hashCode();
        return hashCode ^ (hashCode >>> 16);
    }
}

这里很简单内部仅维护了一个 byte 类型的 data 数组，实际上 byte 仍然占有一个字节之多，可以优化成 bit 来代替，这里也仅仅是用于方便模拟。另外我也创建了三个不同的 hash 函数，其实也就是借鉴 HashMap 哈希抖动的办法，分别使用自身的 hash 和右移不同位数相异或的结果。并且提供了基础的 add 和 contains 方法。

测试：

public static void main(String[] args) {
    Random random = new Random();
    // 假设我们的数据有 1 百万
    int size = 1_000_000;
    // 用一个数据结构保存一下所有实际存在的值
    LinkedList<Integer> existentNumbers = new LinkedList<>();
    BloomFilter bloomFilter = new BloomFilter(size);

    for (int i = 0; i < size; i++) {
        int randomKey = random.nextInt();
        existentNumbers.add(randomKey);
        bloomFilter.add(randomKey);
    }

    // 验证已存在的数是否都存在
    AtomicInteger count = new AtomicInteger();
    AtomicInteger finalCount = count;
    existentNumbers.forEach(number -> {
        if (bloomFilter.contains(number)) {
            finalCount.incrementAndGet();
        }
    });
    System.out.printf("实际的数据量： %d, 判断存在的数据量: %d \n", size, count.get());

    // 验证10个不存在的数
    count = new AtomicInteger();
    while (count.get() < 10) {
        int key = random.nextInt();
        if (existentNumbers.contains(key)) {
            continue;
        } else {
            // 这里一定是不存在的数
            System.out.println(bloomFilter.contains(key));
            count.incrementAndGet();
        }
    }
}

输出如下：

实际的数据量： 1000000, 判断存在的数据量: 1000000 
false
true
false
true
true
true
false
false
true
false

这就是前面说到的，当布隆过滤器说某个值 存在时，这个值 可能不存在，当它说某个值 不存在时，那就 肯定不存在，并且还有一定的误判率…

版本二：参考

import java.util.BitSet;

public class MyBloomFilter {

    /**
     * 位数组的大小
     */
    private static final int DEFAULT_SIZE = 2 << 24;
    /**
     * 通过这个数组可以创建 6 个不同的哈希函数
     */
    private static final int[] SEEDS = new int[]{3, 13, 46, 71, 91, 134};

    /**
     * 位数组。数组中的元素只能是 0 或者 1
     */
    private BitSet bits = new BitSet(DEFAULT_SIZE);

    /**
     * 存放包含 hash 函数的类的数组
     */
    private SimpleHash[] func = new SimpleHash[SEEDS.length];

    /**
     * 初始化多个包含 hash 函数的类的数组，每个类中的 hash 函数都不一样
     */
    public MyBloomFilter() {
        // 初始化多个不同的 Hash 函数
        for (int i = 0; i < SEEDS.length; i++) {
            func[i] = new SimpleHash(DEFAULT_SIZE, SEEDS[i]);
        }
    }

    /**
     * 添加元素到位数组
     */
    public void add(Object value) {
        for (SimpleHash f : func) {
            bits.set(f.hash(value), true);
        }
    }

    /**
     * 判断指定元素是否存在于位数组
     */
    public boolean contains(Object value) {
        boolean ret = true;
        for (SimpleHash f : func) {
            ret = ret && bits.get(f.hash(value));
        }
        return ret;
    }

    /**
     * 静态内部类。用于 hash 操作！
     */
    public static class SimpleHash {

        private int cap;
        private int seed;

        public SimpleHash(int cap, int seed) {
            this.cap = cap;
            this.seed = seed;
        }

        /**
         * 计算 hash 值
         */
        public int hash(Object value) {
            int h;
            return (value == null) ? 0 : Math.abs(seed * (cap - 1) & ((h = value.hashCode()) ^ (h >>> 16)));
        }

    }
}

使用 Google 开源的 Guava 中自带的布隆过滤器

Guava 中布隆过滤器的实现算是比较权威的，所以实际项目中我们不一定需要自己手动实现一个布隆过滤器。

首先我们需要在项目中引入 Guava 的依赖：

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>28.0-jre</version>
</dependency>

实际使用如下：

我们创建了一个最多存放最多 1500 个整数的布隆过滤器，并且我们可以容忍误判的概率为百分之（0.01）

// 创建布隆过滤器对象
BloomFilter<Integer> filter = BloomFilter.create(
        Funnels.integerFunnel(),
        1500,
        0.01);
// 判断指定元素是否存在
System.out.println(filter.mightContain(1));
System.out.println(filter.mightContain(2));
// 将元素添加进布隆过滤器
filter.put(1);
filter.put(2);
System.out.println(filter.mightContain(1));
System.out.println(filter.mightContain(2));

在我们的示例中，当 mightContain() 方法返回 true 时，我们可以 99% 确定该元素在过滤器中，当过滤器返回 false 时，我们可以 100% 确定该元素不存在于过滤器中。

Guava 提供的布隆过滤器的实现还是很不错的，但是它有一个重大的缺陷就是只能单机使用（另外，容量扩展也不容易），而现在互联网一般都是分布式的场景。为了解决这个问题，我们就需要用到 Redis 中的布隆过滤器了。

算法

#算法

布隆过滤器 Java 实现

https://flepeng.github.io/030-算法-布隆过滤器-布隆过滤器-Java-实现/

作者

Lepeng

发布于

2020年8月8日

许可协议

bitmap 上一篇

布隆过滤器 Python 实现下一篇