Why NoSQL is still relational

NoSQL has become a buzz word over the last decade, and has gained large popularity over that time. At the time of this writing, MongoDB, the most popular document database, has nearly equal usage to PostgreSQL.

NoSQL databases can be great for getting to market quickly if your timeline is particularly restrictive. You are able to create less tables/collections, create a thin data layer, and even skip the traditional REST routes. Granted, the latter two aren't long-term solutions, but they get the ship sailing. Additionally, if on a rather slim budget, using the full MEAN or MERN stack allows you to only have to worry about hiring developers that know Javascript. No specialists required! From a technical perspective, documents can be retrieved with some auxiliary information already attached with no additional work necessary (i.e. getting a user with an address object already attached, as opposed to having to join two tables to get the same result). As Martin Fowler points out in his 2012 talk, this can have benefits when running on clusters, as you will have less trips between members of a cluster to communicate with different tables.

A trend among new technologies is the notion that they must be adopted in their most "purest" form. However, I argue that NoSQL databases, particularly document databases, should not cause implementers to cast aside the concept of relational models.

Within the category of a NoSQL database lie two sub-categories: Key-Value databases, and Document databases. While Key-Value databases are part of this overarching categorization, we will be focusing on Document databases for the purposes of this discussion, as they are the more common.

Let's look at a situation I faced three years ago when I built a finance management platform using MongoDB as the database of choice. I needed to represent the concept of a user, which had one or more monthly bills, as well as one or more accounts, each with one or more transactions.

In a document database such as MongoDB, the simplest way to represent this would be to have a single user collection, with objects like this:

{
	_id: "1",
	username: "myusername",
	password: "hashedpassword",
	bills: [
		{
			name: "Phone Bill"
			dayOfMonth: 15,
			amount: 56.78,
		},
		{
			name: "Mortgage",
			dayOfMonth: 1,
			amount: 1234.56
		}
	],
	accounts: [
		{
			name: "Bill Expense",
			transactions: [
				{
					date: "2019-01-01T00:00:00.000Z",
					description: "Mortgage",
					amount: -1234.56,
				},
				{
					date: "2019-01-15T00:00:00.000Z",
					description: "Phone Bill",
					amount: -56.78,
				}
			]
		},
		{
			name: "General Expense",
			transactions: [
				{
					date: "2019-01-07T00:00:00.000Z",
					description: "Dunkin' Donuts coffee",
					amount: -2.94
				}
			]
		}
	]
}

Initially, this looks great and simple, and has a few obvious benefits:

  1. We only have to make one top-level query in our UI to get everything that we need, and can just pass it to child components.
  2. We only have to maintain a single schema using something like mongoose.
  3. We only have write, test, and maintain a single endpoint.

Unfortunately, looks can be deceiving, and this comes with a laundry list of drawbacks:

  1. In general, MongoDB hosting providers like MLab only allow up to a set data transfer limit per query, per payment level.
  2. Transferring data over HTTP only allows up to a certain size for data sent in a single request. Users with a large amount of transactions will exceed this. In simple terms: It's not scalable.
  3. Large amounts of memory will be required for simple transactions. Even if you only want a user's accounts, the entire user object must be retrieved.
  4. Testing becomes more complicated. You will generally need to fake the entire user object, bills, accounts, and transactions included, in order to test subsystems.
  5. Migrations become more risky. To simply change the shape of transactions, entire user objects will have to be overwritten.
  6. Two words: race condition. Any change to a bill, account, transaction, or user information will require writing to the same document. In order to avoid race conditions, endpoints will have to be written to use $push to add transactions, and endpoints will have to be more specific in general to avoid overwriting entire user objects per transaction.

Now, let's fast forward to when I had to deal with these situations in reality. I eventually used this application to the point where a single account may have 1000+ transactions. Each transaction had a minimum of three properties: description, amount, and date. Assuming one character each, this still results in 3000 characters, and that's an extremely conservative estimate. The user objects were simply getting too large, and I ultimately faced 413 Payload Too Large errors.

This led to transactions eventually getting their own collection. In the end, the database looked similar to a relational one.

Users collection

{
	"_id": "1",
	username: "myusername",
	password: "hashedpassword",
	accounts: [
		{
			_id: "a",
			name: "Bill Expense"
		},
		{
			_id: "b",
			name: "General Expense"
		}
	],
	bills: [
		{
			name: "Phone Bill"
			dayOfMonth: 15,
			amount: 56.78,
		},
		{
			name: "Mortgage",
			dayOfMonth: 1,
			amount: 1234.56
		}
	]
}

Transactions collection

{
    _id: "f",
    accountId: "a",
    date: "2019-01-01T00:00:00.000Z",
    description: "Mortgage",
    amount: -1234.56
}

Conclusion

NoSQL does not mean non-relational. Relational techniques are about organization and that principle is not null and void when using a NoSQL database like MongoDB. It is entirely okay to think relationally when working with these databases. In the end, queries will be leaner, the API will be under less stress, and with the advent of API methodologies like GraphQL and features like MongoDB's aggregations, the same powerful querying will still be available. Note that accounts and bills are still stored on the user. We still get plenty of data for free when simply requesting a user by _id or email. If I had to boil down the best practice that I've discovered, I would say this:

  • When dealing with an array of objects with a small, finite number of elements, use nested documents. This is exemplified in the bills array in the example above. A bill will not be shared by multiple users, and a user will not have a massive amount of unique bills (100+), so there's no problem having them nested.
  • When dealing with a one-to-very-many relationship, get relational! In the above scenario, transactions are a great example. A user has an unlimited amount of transactions per account. Thus, it makes the most sense to give them their own collection. In this case, no matter how many transactions there are, we will never be forced to query for every single one, giving us full control of how intense the database access needs to be.
  • When the size of any given element in an array is potentially large, give it its own collection too. On one project I've worked on, a particular document would have an array of objects where one property was a Base64 image string. This means there need only be a few before the parent document gets particularly bloated. In this case, you will benefit from having the objects containing the image strings in their own collection. (Or better yet, host the images somewhere and only store a link)