This post has been coming for some time now. I am in love with this new world of a thousand databases. Whatever your use case is, there is a database for you. In the old days you figured out how to keep up MySQL or Postgres and that was the hammer you used to bang in every screw. Forget screwdrivers, that’s another thing you’d have to learn and it took you a long time just to figure out this hammer. Which is why I’m not all that surprised when everyone finds their new shiny screwdriver and are trying to bang on all their nails with it.
The database world has spent a long time reassuring you that a database is a place for you to put things permanently and that they won’t corrupt or degrade over time if you do things right. Remember when memcached wasn’t a database? That’s because you don’t put things in to memcached permanently. Now we call memcached a database, not because it changed, but because how we evaluate software to determine if it’s a database has changed. Databases are more dynamic about where they sacrifice durability and consistency for performance and the old requirements on durability aren’t the test we use to determine if a database is good for our use case, instead we have to understand our use case and decide what we need.
I’m going to pick on MongoDB because they have a cavalier attitude toward durability, in the pursuit of perceived performance, that isn’t clear to most of the developers that 10gen is encouraging to drop existing durable solutions (mostly MySQL and Postgres). I’m a CouchDB guy and I work on CouchDB stuff for a living but I’m going to talk a lot about Redis and Cassandra and even some relational databases because they help tell a bigger story about performance and durability in the larger world of databases that you couldn’t understand by just looking at CouchDB and MongoDB or just Postgres and MongoDB. This is from me, not my employer, as a database geek.
Redis is an in-memory data structure server. Think of it like a place on the network to stick your global objects so that you can scale your application servers horizontally sharing their state in Redis. It’s incredibly fast, necessarily, to handle it’s use case properly. By default Redis returns a response once it’s finished putting your write in to memory, in other words once it’s accessible to be read. Optionally you can set Redis to only return after it’s written to it’s append-only log (by log I mean an internal transaction log, most SQL databases do something similar, more on this later).
Cassandra has a long list of options you can use for what kind of assurances you want when you do a write which you can set on the client. These allow you to say things like “only return a response after you’ve written this data to 3 other nodes”. This is one of the things that they mean by “fully consistent”, you have assurances that your data is persisted to multiple nodes in a cluster if you like.
CouchDB has two options. One returns responses as soon as the document is available and does an fsync() every second (delayed_commits). But, we encourage people to run in production with delayed_commits off. This means that all the pending writes are fsync()’d and when that fsync() returns any other pending writes are flushed (this is sometimes referred to as a “group commit”) and the responses to the clients are only returned after the fsync() is finished. Under concurrent load running with delayed_commits off has the same throughput as fsync()ing every minute, the only difference is that the responses to the clients might return 10-100ms after but it’s probably better to have on-disc assurances and in the real world a write delay to a client that is sub-second isn’t a huge deal. Relational databases have very similar “group commit” and “delayed commit” features in their fsync() strategy.
MongoDB, by default, doesn’t actually have a response for writes. You just write your data to the socket and assume it’s going to be available when you try to read it. Under concurrent load you can’t expect to reliably store stateful data like session information like this. It’s kind of like, in your webapp, if you were to spawn a thread to do some work and at the end set some global but you return a response to the client immediately. Where there wasn’t any load you would be fine because your thread is faster than the roundtrip time to the client. But under heavy load the operations in the thread could take too long and a subsequent request wouldn’t have access to that global because it wasn’t set yet. This is why you can’t store session data reliably in MongoDB without changing the default client option to return a response, because no matter how “fast” it is if you send a response to the client (or no response at all) before the data is available for read it’s useless.
To people who don’t live in databases all day it’s hard to explain just how odd this choice is. I don’t know of another database that even allows you to return a response before the write data is accessible much less not even have a response by default. This is kind of like using UDP for data that you care about getting somewhere, it’s theoretically faster but the importance of making sure your data gets somewhere and is accessible is almost always more important.
It’s no secret that people are hitting IO concurrency issues with traditional Python and Ruby web frameworks. Solutions to IO concurrency problems are gaining traction; erlang, node.js, nginx, EventMachine, Tornado, and all of these technologies use at least some non-blocking IO to limit the amount of overhead a connection has. Languages like Ruby and Python traditionally use threads, or in some cases system processes, which have a static per-connection cost taken out of the available resources (usually available memory). Once you hit the upper limit of how many connections you can have open at once you have to start limiting the amount of time each connection is open if you want to continue to increase your requests per second. Not waiting for a response from MongoDB cuts down the overall connection time. This must seem like a silver bullet for people scaling their Rails or Django app but it’s entirely possible that under load your users aren’t actually seeing the data they write or are having to hit refresh in order to finish their login a few seconds after the session data is actually available.
After many many years of database engineering most databases have come to the same conclusion. Using an append-only file is the preferred, sane and most assured way to handle data loss or corruption. This is hard for a lot of people to understand but it’s not just a
SQL databases tend to keep an append-only “log”, a sequential list of every transaction on the database. This way it can always be replayed to recover from corruption. Redis keeps a similar log in an append-only file. CouchDB actually uses an append-only btree on disc for it’s entire database removing the need for a traditional “log”.
The trick to append-only file formats is to write a “header” to the end of the file after every operation. If a crash happens in the middle of a write you just find that last header and disregard everything after it. If you notice some corruption on disc it’s easy to isolate the space between operations. If you write to a file format “in place” instead of using append-only the complexity of tracking down corruption and invalid writes is mind boggling.
The catch with append-only is that you have to “compact” or “vacuum” from time to time. Since every delete and update operation is actually adding to the size of the file you’ll eventually want to write a new file with only the latest versions of the data dramatically reducing disc usage.
MongoDB does not keep an append-only log, and does not use an append-only file format for it’s db. It writes in place to it’s on disc format. They’ve stated that this is “faster” and that compaction is costly. The actual write being faster is a puzzling statement and I believe it to be entirely untrue. The advantage of writing to an append-only file is that all the writes are sequential which are significantly faster on spinning discs and even SSDs, seeks aren’t free so the individual write operations in place will never be faster than append-only.
In CouchDB compaction does not lock the database (nothing can lock the database) so the only way compaction might be “costly” is if you triggered it during heavy write load (the new writes and the compaction task would both be competing for use of the disc). Which is why CouchDB doesn’t do automatic compaction, you are supposed to trigger it when your write load isn’t peaking.
As I said before, at some point in the life of your database a disc write will fail. Not keeping something append-only around is incredibly concerning. Stories like this are dubious not because they expose a few bugs in MongoDB but because they show inherent architectural problems you cannot overcome long term without something append-only. MongoDB encourages you to throw heavy load at it by touting their performance but everyone’s load looks a little different and when MongoDB does fall it falls hard and you’re left with whatever the last backup was assuming that predates any of your corruption.
For the most part, when you write data you want to be assured that it’s going to stay there and be accessible. I touched on this a little earlier when I talked about the request/response differences between everyone and MongoDB. Understanding the difference between when something is availability and when something is persisted to disc is also important. Traditionally something isn’t to be considered “guaranteed” until the fsync() to disc finishes.
Some databases, like Cassandra, actually take this a little further offering “full consistency” across a cluster of nodes giving assurances that fsync’s across multiple nodes are complete. CouchDB uses “eventual consistency” which is to say that a single CouchDB node has your data on disc and will replicate with other nodes at some point in the future. CouchDB allows you to take nodes offline and bring your entire database with you on your devices including mobile phones so “full consistency” across nodes that might be offline is actually impossible. This is a good example of differences in use cases when you decide what granularity of consistency you need for you application.
Redis is an in-memory database (with a soon to be released version using a hybrid of memory and virtual memory) but they keep around an append-only file so that you can also bring back up a node after it’s crashed and get your data. There are 3 options for how it will fsync(). One is “never”, it just sends writes to the kernel and lets the kernel flush the data to the append only file when it feels like it. Another option is every second and the last option is “always” which does a “group commit” style continuous fsync().
MongoDB writes to a mem-mapped file and lets the kernel fsync it whenever the kernel feels like it. More recent versions added a feature that does an fsync() to disc every minute if the kernel hasn’t done it already. So, at any time you could lose up to one minute of your data if the node goes down. That’s longer than most, if not all, other databases that persist at all to disc (when Redis lets the kernel handle it exclusively it might actually be longer).
When you look at MongoDB more critically I don’t see how you could actually justify using it for anything resembling the traditional role of a database. They have a great feature list which makes you think of all these things you can do with it but most if not all of those things will require it to not lose the data you put in it. If you wanted a cache with great indexing, or you needed to store data quickly that was too large to put in to memory but didn’t care about losing it, MongoDB would be a good choice. Of course, this is not what they market it as. 10gen’s sizable marketing effort promotes MongoDB as the new M in your LAMP stack, or Ruby/Python equivalent, without addressing their differences in durability with your existing M and almost any alternative “NoSQL” database.
The clincher for me was when Josh Berkus told me that he had an in-memory version of Postgres and once you turn off the log and all the durability it’s neck and neck with MongoDB write performance and I thought “who would be crazy enough to run that in production” but then I realized it’s about the same reliability you get with MongoDB.
It’s sad but in this gold rush of new databases where you can find a database that fits your exact use case so many people are choosing one that just isn’t really fitting for theirs. If “NoSQL” is going to survive as a movement of replacements for relational databases it’ll have to do it with proper durability and consistency guarantees like their RDBMS counterparts when the use case necessitates it. Most have delivered on these guarantees, some have not.
I work on CouchDB and to people who are writing webapps in Ruby and Python it probably looks like we’re competing with MongoDB but in reality we aren’t. We’re putting CouchDB on your mobile phone so you can take your applications and data with you and work on it offline. We’re trying to extend the web platform to mobile and be the glue HTML5 uses cross-mobile. But, I still find uses for Redis and love it when I need it. I don’t have a warehouse full of data but if I did I’m sure I’d take a serious look at Cassandra. There are great databases out there and people should understand them and use them when you have the use case they are built for.