Analyzing Redis Bloom Filters and Their Applications in Detail

What’s Bloom Filter

A Bloom Filter is a clever probabilistic data structure proposed by Howard Bloom in 1970, designed to determine whether a certain element exists in a set. It can indicate with a certain probability whether something definitely does not exist or might exist. Specifically, when a Bloom Filter indicates the presence of something, that something may not actually be present; when it indicates the absence of something, then that something definitely does not exist.

Compared to common data structures like Sets and Maps, Bloom Filters are more efficient in terms of insertion and query operations, and they occupy less space. However, they come with a drawback, which is the possibility of false positives when determining the existence of an element. Nevertheless, with appropriate parameter settings, the accuracy can be controlled within a relatively precise range, minimizing the probability of false positives.

Bloom Filters in Redis

In Redis, prior versions utilized bitmap operations to implement Bloom Filters, until Redis version 4.0 introduced plugin functionality, officially incorporating Bloom Filters as part of Redis. The Bloom Filter is loaded into Redis Server as a plugin, providing powerful deduplication functionality to Redis.

The basic commands for Bloom Filters in Redis

The basic commands for Bloom Filters in Redis are as follows:

  • bf.add: Adds an element to the Bloom Filter, similar to the sadd command for sets. However, the bf.add command can only add one element at a time. To add multiple elements at once, the bf.madd command can be used.
  • bf.exists: Checks if a specific element is in the filter, similar to the sismember command for sets. The bf.exists command, however, can only query one element at a time. To query multiple elements at once, the bf.mexists command can be used.

Here are some examples:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> bf.add one-more-filter fans1
(integer) 1
> bf.add one-more-filter fans2
(integer) 1
> bf.add one-more-filter fans3
(integer) 1
> bf.exists one-more-filter fans1
(integer) 1
> bf.exists one-more-filter fans2
(integer) 1
> bf.exists one-more-filter fans3
(integer) 1
> bf.exists one-more-filter fans4
(integer) 0
> bf.madd one-more-filter fans4 fans5 fans6
1) (integer) 1
2) (integer) 1
3) (integer) 1
> bf.mexists one-more-filter fans4 fans5 fans6 fans7
1) (integer) 1
2) (integer) 1
3) (integer) 1
4) (integer) 0

In the example above, the scenario did not demonstrate false positives because the number of elements was relatively small. When dealing with a larger number of elements, the likelihood of false positives increases. How can we reduce the occurrence of false positives?

Advanced Usage of Bloom Filters

The Bloom Filter used in the previous example is a default parameter Bloom Filter, automatically created when we first use the bf.add command. Redis also provides the option to create a custom parameter Bloom Filter to minimize false positives. Before using the bf.add command to add elements, one can use the bf.reserve command to create a custom Bloom Filter. The bf.reserve command takes three parameters:

  • key: The key for the Bloom Filter.
  • error_rate: The desired error rate. A lower error rate requires more space.
  • capacity: Initial capacity. When the actual number of elements exceeds this initial capacity, the false positive rate increases.

For example:

1
2
>  bf.reserve one-more-filter 0.0001 1000000
OK

If the corresponding key already exists, executing the bf.reserve command will result in an error. If one chooses not to use the bf.reserve command for creation and instead relies on the Bloom Filter created automatically by Redis, the default error_rate is 0.01, and the default capacity is 100.

The error rate of a Bloom Filter is inversely proportional to the required storage space. For scenarios where high precision is not crucial, a slightly larger error rate may be acceptable. Setting the capacity of the Bloom Filter too large will waste storage space, while setting it too small can impact accuracy. Therefore, it is crucial to estimate the number of elements as accurately as possible before usage. Additionally, including some redundancy space is recommended to account for the possibility of the actual number of elements significantly exceeding the set value unexpectedly. In summary, both the error_rate and capacity need to be set to appropriate values.

Introduction to the Principles of Bloom Filters

Having explored the usage of Bloom Filters, let’s delve into the principles of how Bloom Filters work to understand both the outcome and the reasoning behind it.

The data structure of a Bloom Filter in Redis consists of a large bit array and several different unbiased hash functions (capable of evenly distributing the hash values of elements, ensuring a relatively random placement of elements in the bit array).

When adding an element to the Bloom Filter, multiple unbiased hash functions are used to hash the element, resulting in several integer indices. These indices are then subjected to a modulo operation with the length of the bit array to obtain positions. Each unbiased hash function produces a different position. Subsequently, the corresponding positions in the bit array are set to 1, completing the operation of the bf.add command.

When querying the existence of an element in the Bloom Filter, a similar process is followed. The hash functions calculate the positions, and it is checked whether the corresponding positions in the bit array are all set to 1. If any position is 0, it indicates that the element is not present in the Bloom Filter. However, if all positions are 1, it does not conclusively confirm the existence of the element. There is a possibility that these positions are set to 1 due to the presence of other elements. This is the reason Bloom Filters can result in false positives.

Applications of Bloom Filters

Addressing Cache Breakdown Issues

In typical scenarios, one first checks if the desired data is in the cache. If the data is not found in the cache, a subsequent query to the database is made. When the data is also absent in the database, each query results in a database access, leading to a situation known as cache breakdown. The problem with cache breakdown is that when a large number of requests query for data that doesn’t exist in the database, it puts significant pressure on the database, potentially overwhelming it.

Bloom Filters can be employed to tackle the issue of cache breakdown. Existing data keys can be stored in the Bloom Filter. When a new request comes in, the Bloom Filter is queried first to check for the existence of the data. If the data is not found, the request is handled accordingly; if the data exists, then the cache is queried, and if necessary, the database is consulted.

Blacklist Verification

Performing specific actions for items found in a blacklist is a common use case. For instance, in spam detection, any email from an address listed in the blacklist is identified as spam. Assuming the blacklist contains a massive number of entries, storing it can be storage-intensive. Bloom Filters present an effective solution in this context. All blacklist entries can be placed in the Bloom Filter. When an email is received, it is checked against the Bloom Filter to determine if the email address is in the blacklist, enabling efficient and space-saving blacklist verification.