How bloom filters made SQLite 10x faster

URL: avi.im
16 comments

Small aside based on some comments here -

People frequently bring up write concurrency issues in SQLite, with the implied idea that if two physical people are using your app at once they'll get errors. But of course it's concurrency at a transaction level, vastly different - if your writes take 1ms (in SQLite they might even be faster), you can support 1,000 writes per second. And if your users are generating a write click every 5 seconds (a very engaged user base, in most apps more users are readers) you can support 5,000 simultaneous physical people before you need to start worrying about scaling, replication, and distributed shenanigans.

If only 5% of those users are writers and the rest are readers / lurkers, you can support 100,000 users simultaneously.

I suspect close to 99% of distributed app architectures are premature optimizations for a scale the company will never reach.

The issue isn’t quite just that. If you are running your app and database on one single server and don’t need to move past that, SQLite is a great solution. But let’s say you have a web server and a worker server or more than one of each. Now you can’t access SQLite from multiple hosts as it doesn’t have a network access model. So your architecture is somewhat limited. You can still utilize it by either putting a network front end on it or setting up an API server in front of it that everything else uses. But at that point you have basically abstracted away SQLite enough that just using Postgres is easier. And with Postgres you get a lot more features that go hand in hand with multiple servers: replication, failover, etc.

I am a huge fan of SQLite but its best use case is a local database for a local application. You can use it for a client-server setup in some special circumstances but in my mind you’d need a compelling reason to do that rather than using the standard solution of something like Postgres.

Exactly. As Dr. Hipp writes on the web site, "SQLite does not compete with client/server databases. SQLite competes with fopen()."

Note that the measurements in the paper were made before they fixed a bug where they confused bits and bytes. So SQLite only used 1/8 of the reserved bloom filter space, thus increasing the false positive rate significantly:

https://sqlite.org/src/info/56d9bb7aa63043f5

I found and reported the bug because I wanted to know how the bloom filters work in SQLite for my uni seminar paper. Still wondering if one can find those kind of bugs with test cases.

On top of that I don't think it's fair to say it's 10x faster when it btree was tested only on integer index primary key column. Benchmarks with that bold statements should include short string (1-16 chars maybe) and UUID indexes at least.

I do not know if it is still the case, but the last time I looked into the source code SQLite did hash all strings to the exact same value.

So the bloom filter optimization does not work there.

It had to do with the different ways strings can be compared with collating functions, as strings may be equal even if they have different bytes: https://sqlite.org/forum/forumpost/0846211821

Why do you want a UUID index? Use an integer index and have the UUID in another column.

Because if you want to refer to things by a UUID, now you have two indexes

UUIDs are very wasteful [1]. For most use cases you can replace them with much shorter strings and still have very low chances of collisions [2]

[1] https://henvic.dev/posts/uuid/

[2] https://alex7kom.github.io/nano-nanoid-cc/

Call me crazy, but I'm simply splitting my UUID into the higher and lower bits and indexing off that.

IE

    CREATE TABLE foo(
        id_ms    UNSIGNED BIG INT NOT NULL,
        id_ls    UNSIGNED BIG INT NOT NULL,
        PRIMARY KEY (id_ms, id_ls)
    ) WITHOUT ROWID;
That works well with UUIDv7 and is just storing 128bits rather than a full string. In most languages it's pretty trivial to turn 2 longs into a UUID and vice versa.

Sure, at cost of increased complexity of access. Sometimes the waste is worth the simplicity.

It's also a problem in machine learning. Your data might be mangled due to a bug but the NN will still extract something useful out of it. Or, on the flip side, if you make a change to the data and things do break (learning stops converging), you never really know if it's the architecture or the data that's the issue.

How much of a slowdown did you estimate this bug caused?

SQLite only knows nested loop joins and the bloom filter can just tell us "no need to do a join, there is definitely no matching entry".

If it has a false positive all the time (the worst case) then the performance is the same as before the bloom filter optimization was implemented (besides the small bloom filter overhead).

As the bloom filter size in SQLite directly depends on the table size I estimated a false positive rate of 63.2% due to this bug, while it could have been just 11.75%.

It actually would have performed faster, but the false positive rate drastically increased.

I guess the person is asking how much a slowdown did the whole query receive.

90%?

The description of nested loop join is confusing; it's mainly a single pass through the outer table with one B-tree probe per row to each inner table.

The linked paper is clearer:

"However, the inner loops in the join are typically accelerated with existing primary key indexes or temporary indexes built on the fly."

"Note that SQLite probes the part table index for every tuple in the lineorder table."

The Bloom filter does not reduce the cardinality of the join, it simply replaces each B-tree probe and filter with a Bloom probe on a pre-filtered key set.

This technique is well-known; the paper cites several examples, and Bloom filter pushdown is common to many commercial systems.

Thanks to its simplicity for development and hosting SQLite has become our first choice for new Apps. We use a number of different ways to workaround its single concurrent writer limitation [1]. Whilst we're currently using Litestream for replication, we're currently evaluating switching to SQLite's native rsync [2].

[1] https://servicestack.net/posts/scalable-sqlite

[2] https://www.sqlite.org/rsync.html

Is switching to SQLite really making hosting your web apps less of a headache? Most hosting providers make spinning up your standard client-server RDBMSs (MySQL, Postgres) a breeze.

Dumb question, but how do you use SQLite on kubernetes? Indeed SQLite doesn't work over NFS (and I guess other remote file shares) so how do you share access to it to other pods?

Without an additional layer you will have to be happy with a single vertically scaled instance of your application. If you want to resort to horizontal scaling, you can look into something like LiteFS

Litestream replicates continuously. sqlite3_rsync takes a snapshot. How do you plan to use the latter?

Bloom filters are great, I wish more people knew about them. The most important part about a bloom filter:

They will never have a false negative (and only sometimes a false positive).

We used this to vastly improve render times for comments pages on reddit. We used two tricks. The first was to store the time of your last vote as a first class property on your user object. If you loaded a comments page for a link that was submitted after your last vote, we knew that you couldn't have voted on any of those comments.

But if you had voted afterwards, we had to look up every single comment on the page to see if you had voted on it (we couldn't only do the comments made before your last vote because we didn't know the creation time until after we looked up the comment, and it was faster to just look up the vote).

But with a bloom filter, we could very quickly look up all the comments and get back a list of all the ones you voted on (with a couple of false positives in there). Then we could go to the cache and see if your actual vote was there (and if it was an upvote or a downvote). It was only after a failed cache hit did we have to actually go to the database.

But that bloom filter saved us from doing sometimes 1000s of cache lookups.

What happens if the table is one with a big number of deletes? The Bloom filter false positive rate will keep increasing as time goes on.

One way to address this is to recalculate it every n deletes, but this sounds similar to AUTOVACUUM issues in PostgreSQL and might result in unexpected drops in performance

We do rotating filters for that. Items are added to the current and next bloom filters, and we stop serving from one, serving from the next, delete the former, and start the next filter.

Another option is a cuckoo filter; it is like a bloom but allows deletion

The article says "at the start of the join operation the bits will be set in the bloom filter".

So maybe this is built for every query? Would be needed anyway if there is a where clause for the joined table.

You might reset the single position (which would invalidate all matching items) which is better than recalculating everything or mass invalidation.

My guess though is that many use cases are likely data stores that don't experience deletes at all.

SQLite is getting better and better. I am using it in production for a bunch of websites and never got a problem.

I've been really impressed with sqlite+litefs on fly.io. it's pretty easy to get distributed reads with single master writes.

If you never got a problem then how is it getting better?

What if we didn't know we had a problem? And now it's "ooooh it's better now".

Packs more features and improvements over the time. One can read it to see that it gets better

Perhaps it's getting faster?

It should be fine for read-only data. If you want to write, be aware that only one process can write at a time, and if you forget to set busy_timeout at the start of the connection, it defaults to zero milliseconds and you'll get an error if another process has locked the database for writing while you try to read or write it. Client-server databases tend to handle concurrent writers better.

I think people overstate this.

Yes, the sqlite concurrency model is a bad choice if you have a high degree of concurrent writes. However for many applications that simply isn't true. When it comes to websites i think people significantly overestimate the amount of concurrent writes.

This is a baseline benchmark I published of various p9x latencies for multiple readers/writers with a single SQLite database with different configurations set: https://github.com/mqudsi/sqlite-readers-writers

Even if you don’t use message passing to use one thread to perform all updates/writes, it still performs very solidly for many use cases.

No, it's indeed a very real problem. I ran into with a very small service.

You want to configure it so it has a timeout. Take turns. That’s how locks work.

The only difference between SQLite and Postgres write locking is the granularity.

It's not a trivial difference like you suggest.

I agree It’s not a trivial difference in implementation and the concurrent write performance is probably a lot worse.

But it’s talked about as if it’s a categorical limitation, the app will fail if there is concurrency. But it’s just a question of how much time will be spent locking.

A website with a 16 process pool for handling requests will be fine.

sqlite also polls for lock availability

Depending on the amount of write throughput you need and whether you care about latency, "concurrent writes" aren't necessarily a problem either. You can just shove them into a queue and then have a single thread pull stuff out of the queue and push it into SQLite. That still scales, up to a point.

FYI this is basically what Redis does (which does not support concurrent writes or reads afaik), and that still scales quite a lot.

No, because your readers are still blocked by writers and still randomly fail if you forgot to set busy_timeout, which is something at least one person didn't they had to do until they read this comment.

| readers are still blocked by writers

We're all using WAL mode these days, right?

SQLite really isn't meant to be used exactly like a hosted solution. I don't know who is advocating for this.

If you are sharing your database between processes or machines, you are probably doing "the fancy new SQLite thing" wrong.

If you need to share information contained in a SQLite database with other processes or machines, you can write application-level code to handle this concern.

Sharing between processes? Absolutely. Machines? No.

For example, LAMP stack applications could swap the M for SQlite (and I think it would have been better historically if they did).

Sharing between processes isn't impossible but this where you get all of your performance caveats.

I think it is a bit unfair to argue against SQLite on performance grounds but then not explore ways to utilize instances of it within a single process scope where it is most effective.

Sharing a single SQLite connection across all threads within a process is where you get the best performance outcomes. Everything is serialized within the SQLite provider - assuming you are using a mainstream build. So, if your storage subsystem is fast (NVMe & friends), you will achieve very good outcomes. Any utilization model that requires a file open/close operation every time you want to touch the data is a complete non-starter by comparison. The "unit of work" paradigm that hosted providers recommend is catastrophic for performance with SQLite.

One option for the "I absolutely must share this SQLite instance with 30+ services" scenario is to simply wrap it with a REST API or similar.

I don't see why performance would be significantly different between multiple threads using same sqlite db vs multiple processes on same machine. Can you explain more what you mean?

See section 5:

https://www.sqlite.org/lockingv3.html

In the single process access model, you can connect exactly once and remain in a reserved lock state the entire time.

> but this where you get all of your performance caveats.

You mean the one where it locks on write? It’s totally fine, if you wrote any cross process code yourself it’s probably going to do similar locking.

[deleted]

So configure busy_timeout, that’s what it’s for.

I thought WAL mode solves this. Am I misunderstanding the docs, or this SQLite running without a write ahead log?

There are advantages and disadvantages to using WAL instead of a rollback journal. Advantages include:

    WAL is significantly faster in most scenarios.
    WAL provides more concurrency as readers do not block writers and a writer does not block readers. Reading and writing can proceed concurrently.
    Disk I/O operations tends to be more sequential using WAL.
    WAL uses many fewer fsync() operations and is thus less vulnerable to problems on systems where the fsync() system call is broken.*

Then you have to remember to enable WAL mode. And it still has its caveats, but different ones: it's possible that the WAL file can grow without bound under certain conditions.

[flagged]

I bet your great to work with.

Next should be this -> https://x.com/lemire/status/1869752213402157131

What a progress we have with these. Amazing times.

Maybe not such a great fit for sqlite:

> One of the challenges with binary fuse filters, is that they are immutable once populated, so data cannot be added incrementally, and they consume a significant amount of memory during the populate process

Same restriction with cuckoo filters. Are there any better than bloom filters without this restriction?

I think you may be confusing the strict immutability of binary fuse with the "degraded" (aka need-to-resize-a-hash-table-once-"full") of https://en.wikipedia.org/wiki/Cuckoo_filter under high hash table load.

In any event, to address your question, most of the time people truncate to some round number of bytes (8, 16, 32, 64 bits, really) because this is very cheap, but it’s actually barely more expensive to truncate to any smaller than 64 number of bits with masks & shifts. Doing so with a back-to-back array of such truncated b-bit numbers (https://github.com/c-blake/adix/blob/master/adix/sequint.nim) structured as a regular hash table (such as robin hood linear probing (https://github.com/c-blake/adix/blob/master/adix/bltab.nim) where a hit or miss will likely only induce a nearly guaranteed single cache line miss up to ~90..95+% utilization) lets you make a filter with many fewer CPU cache misses (~10X fewer, but everything always depends on where you land in the parameter space) than a Bloom filter at a small 2..4X cost in more space.

There are some example numbers and a “calculus analysis” at the bottom of https://github.com/c-blake/adix/blob/master/tests/bl.nim, but it’s all only a few hundred lines of Nim and you could re-do a test in your favorite ProgLang. This table does eventually "fill up" like Cuckoo or any fixed malloc'd arena at which point people usually double/whatever to make more space resulting in a linear amortized cost growing up from zero. I would just call this a b-bit hash existence filter or maybe b-filter or bit-level filter if you want to get brief.

FWIW, I believe this was even understood by the aboriginal paper by Bloom in his 1970 CACM article (https://cacm.acm.org/research/space-time-trade-offs-in-hash-...) based upon his footnote2, though I think he was referring less to a CPU cache and more to "loading a whole word of memory at a time" like the 36-bit word IBM mainframes of the day, though these are similar mathematically (just think of a 64B cache-line as a 512-bit aligned word). Somehow it got lost in the teaching of Bloom filters.

To speculate on that "somehow", once you have a new dimension (accuracy in space-time-accuracy here), it is easy/natural to only consider "projections", but such oversimplifications can lead one astray. E.g., most people I know optimize for space only to indirectly optimize for time, but most discussion on this topic is about space-accuracy.

You can convert static data structures like these into dynamic ones with about a logarithmic slowdown. So it might still be worthwhile.

(You could also combine a static filter with a dynamic bloom filter in front. A bit like generational garbage collection.)

It's also not clear that a better filter will result in a substantial improvement for SQLite. If the improvement is on the order of a tens or hundreds of nanoseconds per a query, that's nothing compared to hitting disk.

Edit: I guess you could get really efficient at detecting things not in the database, but the soon as you get a single false or true positive the process runs into the equivalent of a screeching halt if you need to actually check if it's there.

[deleted]

Just a thought, just because a general problem is NPHard doesn't mean that we can't find specific solutions quickly or that a given input is hard to search for. If the downstream effect results in an order of magnitude less work, it makes sense, it's just a tradeoff.

The one I liked was the postgres genetic optimizer.

https://www.postgresql.org/docs/17/geqo-pg-intro.html

The theory being an exhaustive search of all possible query plans is np-hard and would take too long, so you do a limited, iterative, best fit search.

My understanding is it never worked super great and would only be used if your query exceeded some high level of complexity. I distinctly remember it being removed at some point, but I see it mentioned in current docs, so I am probably wrong about that.

Anyway I wonder if, with some of the new advances in machine learning, it would be worth revisiting this approach to optimization.

Well yes, heurstics for query planning is a very well researched field

I was more thinking about solving NP hard problems. Modern CPUs are fast, if the benefit is worth it against the downstream task, just do it.

Most instances of most NP hard problems are fast and easy to solve in practice.

Eg you have to go to quite a bit of effort to construct a knapsack problem that's hard to solve.

What complexity class will be the problem of construct only hards to solve knapsack (or others) problems?

I like to think it's more than a nit to pick, but does anyone else absolutely despise the way the sql is written in the example? Join table1, table2, table3, table4... then the "on" logic in the where clause, without explicitly defining which columns belong to which table? Completely unsupportable and wastes so much time years from now. Please don't write sql like that, everyone.

"update ... from ..." still uses that syntax and it throws me for a loop every time. I have to stop and think about it while convincing myself it is doing the same thing as "join on".

https://www.postgresql.org/docs/16/sql-update.html

> SQLite does Nested Loop join

Only that? Never anything better? Really?

EDIT: Really.

Section titled Joins here https://sqlite.org/optoverview.html

states:

"SQLite implements joins as nested loops."

That's quite shocking.

While doing MySQL and Postgres when nested loop showed up in EXPLAIN in almost all cases I knew I botched my query and/or indexes.

If you mean in mysql explain: "Using join buffer (Block Nested Loop)", its not slow because nested loop algorithm is being used, its slow because of the join buffer part, which is an optimization used when its not possible to immediately get the right row of the inner table via an index.

As far as i know (might be wrong,im not really familiar with mysql internals), mysql (like sqlite) generally uses nested loop joins all the time. The EXPLAIN just only says something in the join buffer case. When using a simple nested loop join, EXPLAIN does not mention the fact that it is using that algorithm.

Fair enough, most of my memories about building fast, complex queries come from my Postgres times.

Though I remember one instance where while using MySQL in a web app it turned out that N+1 was faster than doing a JOIN.

[flagged]

SQLite is self-described as not open contribution. So yes by their own measure they've made it more difficult to mainline features (and intentionally so).

I submitted a bug report on SQLite a year or so back (a simple test case only, not a solution). The folks were super nice, and their patch went into the next release.

Me too, repeatedly. I've asked questions, reported bugs, asked for enhancements, made suggestions, submitted patches for consideration, and was always welcomed. Even when I'm asking for stuff that doesn't necessarily align with their goals.

OTOH, I've requested clarification (just some basic documentation really) on the “open contribution” fork of SQLite… and they never documented their own code.

And I'm sorry, I know sarcasm isn't the way here, and is impolite, but that was exactly the point.

Less than a week ago we had a whole thread where, again, we discussed the impossibility of improving SQLite from the outside because it's not “open contribution.”

Well, this is just a great example of much larger feature that was developed in collaboration with them.

Even if true, it seems like they're doing a pretty good job on their own.

Open contribution isn’t a good in and of itself.