StrangeLoopConference2009 A St. Louis software developer conference



Speaker Focus: Mike Dirolf on MongoDB

17 Aug 2009
Posted by stloopadm

Mike Dirolf (@mdirolf) will be speaking at the Strange Loop conference about the non-relational database MongoDB. Michael works on MongoDB, specifically on the Python and Ruby client drivers.

Strange Loop: Perhaps the best place to start is the whole notion of a "non-relational" database, which seems to be a hot topic of late. Can you explain what a "non-relational" database is and how data is stored in MongoDB?

Mike: Non-relational databases of various forms have been around for a long time; non-relational is a broad term meaning any database system that doesn't use the relational model. Recently we've seen a resurgence in the use of (and excitement about) non-relational databases, primarily because of their promise of allowing greater scalability than the traditional RDBMS. Popular websites like Google, Facebook, Amazon and Yahoo are all using non-relational databases to scale out their data storage.

Data is stored in MongoDB using the BSON format, which is a general purpose binary representation of "JSON-like" data. Since the database understands this format it is able to "reach inside" of the data and do things like building indexes on specific keys and performing complex queries, even on embedded documents.

Strange Loop: Do you see developers trying MongoDB or other key-value stores mostly as a way to achieve greater scale at a lower cost or because the programming model is simpler?

Mike: We tend to refer to MongoDB as a document database - the unit of storage is a "document", which can be more complex and information rich than a simple key-value pair (you can think of a document in MongoDB as something akin to a JSON object). In fact, the document model seems to resonate very well with developers - it tends to line up well with the way they think about data in the problems they're solving.

In my experience a lot of the interest and "buzz" about the non-relational space has been the promise of scalability - everybody would like to think that they're working on the next Twitter, so it's reasonable for them to want to store their data using a highly scalable system. What I find really interesting, however, is how many people try MongoDB because of its promise of scalability but end up settling on it because of it's ease of use. There's a lot about working with MongoDB that seems to be intuitive to people - the document model I mentioned above, it's schema-free nature and dynamic queries all seem to combine to make development fun.

This is true of a lot of the key-value stores as well. They simplify the storage model so much that it can make working with them much easier than working with a more traditional RDBMS. We think that MongoDB is a nice trade-off between the simplicity of a key-value store and the power of an RDBMS, though.

Strange Loop: It seems like the key-value store offers a simpler interface for database creation and data access, but you are trading that off with storing data that embeds more structure and complexity. One benefit of having highly structured data in a relational database is the ability to specify rich declarative SQL queries and let the database optimize the retrieval path. How does MongoDB address the concept of database queries in the face of more complex data? Does the programmer end up doing more work to optimize retrieval?

Mike: I think you point out a major problem that I see with the key-value approach - for a lot of problems your data has more structure than just simple key-value pairs, and it's nice (or even necessary) to be able to exploit that structure and perform queries more complex than just getting a value by key. With MongoDB we're trying to hit the sweet spot - maintaining the performance and scalability of a key-value store while providing as much of the complex functionality of an RDBMS as possible. As mentioned above, the BSON format allows MongoDB to reach into documents and perform complex queries with no extra work on the part of the developer.

One interesting performance-related aspect of MongoDB is the query optimizer. We use concurrent query plan evaluation to ensure good worst-case performance on queries. MongoDB is also lockless - these two features go a long way towards ensuring predictability of query performance (we think that unpredictable performance is a very bad thing).

Strange Loop: How does MongoDB approach the notion of scalability? How does scaling a MongoDB deployment compare to scaling a traditional relational database like MySQL or Oracle?

Mike: Scaling a traditional RDBMS vertically (by adding more power to a single node) is generally pretty simple - the difficulty comes in scaling horizontally (by adding more nodes). The general solution to this problem is partitioning data ("sharding" for the buzzword inclined). Sharding a traditional database generally requires an added layer of application level complexity to determine which node is responsible for which data. More importantly, sharding a traditional database becomes a nightmare when using complex transactions, joins, etc.

MongoDB makes some simplifying assumptions (no complex transaction support, no joins) which allow sharding to be done effectively. We also have an auto-sharding layer which automatically handles sharding your data based on a shard-key which you specify. This can be done in combination with replication, so you can have both failover and scalability. The sharding layer in MongoDB is still in alpha, but the basic functionality (allowing infinite scalability by adding nodes dynamically) is already there.

Strange Loop: Another popular non-relational database with similar goals is CouchDB. How does MongoDB differ from CouchDB? What would lead one to choose one versus the other?

Mike: One difference is CouchDB's focus on master-master replication and MVCC conflict resolution. While master-master setups are possible with MongoDB they are not a goal of the system - our replication system is designed for master-slave and auto-failover. If your problem involves intense versioning or offline databases that resync later you probably will have better luck with CouchDB.

A difference that a lot of users point out is the way queries are expressed. CouchDB uses a clever index building scheme to generate indexes which support particular queries. This is an interesting technique, but requires pre-declaring these structures for any queries you want to execute (queries are "static"). MongoDB uses more traditional dynamic queries and allows specifying indexes as the developer sees fit. Dynamic queries are nice for rapid development and for inspecting data administratively. Keeping indexes optional is useful if we don't want an index (e.g. for an insert intensive application). The MongoDB query mechanism also seems to be a lot easier for new users to wrap their heads around.

MongoDB also focuses heavily on performance - this focus can be seen in a lot of the design decisions behind MongoDB. A good example of this is the query interface - CouchDB relies on REST to interact with the database while MongoDB uses native drivers. The use of native drivers makes driver development a little more complex, but greatly increases performance.

Another difference is that MongoDB plans on supporting auto-sharding as a core component of the system - as far as I know CouchDB is planning on keeping any sharding support as a separate project talking to the database over http (that said there are some projects working on auto-sharding for CouchDB).

A really good resource for those interested in the differences between the two approaches is here.

Strange Loop: Thanks Mike! I look forward to seeing your talk at Strange Loop!

Mike: I can't wait!

Tags:


Recent comments


Tweets (#strangeloop)