Skip to main content

Command Palette

Search for a command to run...

How to Filter 1 Billion Requests Without Crashing Your Database

An Intro to Bloom Filters

Updated
•3 min read

GIF credits: Tenor

Imagine you are building a URL spam filter for a service like Gmail. You have a list of 100 million malicious URLs, and you need to check every single link clicked by your 1 billion active users against this list.

The most obvious simple way would be to store the bad URLs in a SQL database and query it every time a user clicks a link.

If you try this in production, your infrastructure will crash in minutes.

In this post, I want to explore why the "obvious" solution fails at scale and how a probabilistic data structure called a Bloom Filter solves the problem by trading a tiny bit of accuracy for massive speed.

The Challenge: The "Scale" Wall

When you are dealing with web-scale traffic, you hit three distinct bottlenecks:

  1. Storage: A blacklist of 100 million URLs takes up about 10 GB of space. That’s too big to keep in the fast L1/L2 cache of a standard server, forcing you to use slower storage.

  2. Throughput: With 1 billion daily active users, you might face peaks of 30,000 queries per second (QPS). A standard database instance simply cannot handle that many read operations without an expensive cluster.

  3. Latency: Users expect instant page loads. A round-trip to a database (network + disk I/O) takes about 2 milliseconds. That sounds fast, but in the world of high-frequency requests, it’s an eternity.

But here is the real kicker: 99% of your traffic is safe.

If you use a traditional database, you are wasting expensive computing resources to check 990 million "safe" links every day. You are effectively burning money to answer "No" over and over again.

The Solution: The Bloom Filter

A Bloom Filter is a space-efficient probabilistic data structure that tells you whether an element might be in a set or is definitely not in the set.

Think of it as a highly efficient gatekeeper. Instead of storing the full URL string (which is heavy), we project it onto a compact array of bits (0s and 1s) using hash functions.

How It Changes the Architecture

We place the Bloom Filter in the memory (RAM) of the application server itself. Before we ever bother the database, we ask the filter: "Have you seen this URL?"

  1. If the Filter says "No": The URL is 100% safe. We let the user proceed immediately. No database call. Zero network latency.

  2. If the Filter says "Maybe": It might be a malicious URL, or it might be a "False Positive." Only then do we check the database to confirm.

The Impact: By The Numbers

The difference between the "Traditional Approach" and the "Bloom Filter Approach" is not just incremental; it's exponential.

MetricTraditional DB LookupWith Bloom FilterImprovement
Storage (RAM)~10 GB~125 MB99% Smaller
Database Load30,000 QPS~300 QPS98% Less Load
Latency2 millisecondsNanoseconds20,000x Faster

By accepting a standard 1% False Positive rate, we instantly eliminate 99% of the unnecessary traffic that would otherwise hit our database. We shrink a 10 GB problem into a 125 MB solution that fits on the RAM of a cheap laptop.

Industry Adoption

  • Google (Safe Browsing)

  • Medium (Feed Deduplication)

  • Cassandra (Database Speed)

  • Akamai (CDN Caching)

  • Bitcoin (Wallet Syncing)

Conclusion: The Power of "Good Enough"

In system design, we are often taught that accuracy is paramount. But Bloom Filters teach us that sometimes, knowing what is definitely NOT true is just as valuable as knowing what IS true.

If you can tolerate saying "Maybe" occasionally, but can never afford to be slow, use a Bloom Filter.

For a deeper dive into how bloom filters work with an interactive demo, check out https://samwho.dev/bloom-filters.