Databases are misguided
Dan Dascalescu, July 2009
Claim: databases only exist because CPU-addressable computer memory is volatile and persistent storage is slow.
Introduction: RDBMS vs. ODBMS
Why does magnetic or optical storage exist at all? Because RAM loses its contents when powered off, and CPUs can only deal directly with data in RAM. This means that if you have a set of objects in RAM and wish to "save" their state, you have to "persist" them in external storage. This is commonly done in two ways:
using traditional relational databases (RDBMS). This necessitates converting the complex relationships between objects in memory into sets of simple relationships that can be stored in sets of related flat tables and then retrieved using SQL queries. For example, tree structures can't be stored directly in a RDBMS:
Most users at one time or another have dealt with hierarchical data in a SQL database and no doubt learned that the management of hierarchical data is not what a relational database is intended for. The tables of a relational database are not hierarchical (like XML), but are simply a flat list. Hierarchical data has a parent-child relationship that is not naturally represented in a relational database table.
This is called "hitting the relational wall".
using object-oriented databases (ODBMS). ODBMS have a tremendous conceptual advantage: they can serialize objects to disk and deserialize them back transparently; there is no relational wall and the programmer doesn't have to use any tricks to represent any data structures, hierarchical or not.
While RDBMS benefit from tens of years of development, they are a conceptual strait-jacket for object-oriented programming. Relational databases also tend to be used for many more types of applications than they were designed to handle. Read more at SQL Databases Are An Overapplied Solution (And What To Use Instead).
Why ODBMS at all
We have seen that what computer programs really need to persist their data is a way to transparently store objects to disk, and restore them at a later time. That, because of two reasons:
- RAM has orders of magnitude smaller capacity than disk storage
- RAM is volatile
Also, it would be highly impractical for CPUs to read/write small chunks of object data from disk all the time and use extremely little RAM, because
- disk storage is orders of magnitude slower than RAM
But what if these three facts about RAM and disk storage will no longer hold true in the near future?
It turns out that with the advent of solid state memory, external storage became much faster. But the key is not really super-fast SSDs. With terabyte capacity and data reading, retrieval and erasing 1000 times faster than Flash memory, while still in the work, the key is nanowire memory.
Fast-forward a few years into the future: external storage is as fast as RAM; non-volatile; and larger than today's largest hard drives. What will be the point of shuffling data between RAM and a "NanoDrive", and why would a program care where its data is, as long as it's accessible? At a higher level, this kind of location-agnosticism of an object's storage happens already all the time. Program don't generally know if portions of their working memory has been swapped out of RAM. Code like
will transparently load
my_object from the swap file into RAM. With non-volatile high-capacity fast RAM, the swapping layer will simply disappear. Programs could just as well be stored in nano-memory and "frozen", along with all their variables, when the system is powered down. The CPU can access and manipulate the nano-memory directly and we finally attain the confluence between RAM and external storage.
This is very likely to happen. And it does make databases, especially relational ones, misguided.
But we still have to use databases at the moment, so ODBMS are the way to go for now. As for serialization, it will only be necessary in the future scenario for data transfer, because network connections are by definition serial.
A number of objections could be brought to this argument:
Q: What about enormous databases that use server farms, cluster computing, or similar technologies? They sure need external storage.
A: Each server in a cloud currently needs its own CPU (or set of CPUs running in parallel) to access a set of hard drives. Imagine we replace the hard drives with nano-memory, and change nothing in the logic of the database server software, except the low-level disk access. Everything still works as before. At that stage, what will be the point of shuffling data between RAM and the "NanoDrives"?
Q: How about isolating data from code?
A: Object-oriented programming facilitates that already: properties (attributes) are separated from methods. It is trivial to delineate areas of RAM that contain only data from those that contain code. Such data can be exported to other forms of storage, for complete isolation.
RDBMS are closer in their data representation to physical storage than ODBMS.
A: Not really. See Berkeley DB, an extremely fast key-value database. BDB can be used as a backend for the Perl ODBMS library KiokuDB with code as simple as:
$uuid = $kioku->store($object); # ... $object = $kioku->lookup($uuid);
It will take probably tens of years before nanowire memory becomes commercially practical.
A: Possible; but keep in mind that we're going through a rethinking of how manufacturing is done. For example, the "big 3" automakers used to tightly control their suppliers of automotive parts, with no outside contractors being allowed to bring their innovations into the system. That is now changing quite radically (read the excellent Wired article Beyond Detroit: On the Road to Recovery, Let the Little Guys Drive).
Companies will not really have to retool their production lines and transform them to make nanowire memory instead of Flash memory. New, small, dynamic producers will probably show up on the storage scene, and competition will select the best.
Q: What if nanowire memory turns out to be vaporware?
A: My point that databases are misguided doesn't depend on nanowire memory; any technology with equivalent performance will do. To claim that such technology will not be developed in the next 10-15 years is a bit pessimistic. To start new projects using RDBMS is simply short-sighted.
(thanks Mike Schilli): Databases have rock-solid, tested, code. If your application has a bug, it may corrupt all the data in the persistent RAM, and there's no going back.
A: In any modular project, separate application-level code is separated from the ORM layer code. The latter would be a standard library (like KiokuDB above), with equivalent robustness to that of a database kernel, and would not corrupt data on its own. Also, Garbage-In-Garbage-Out: if your application-level code has a bug and writes corrupt data in the databases, it's toast, unless you made backups. So the solution is to make backups of the data in RAM (see question 2).
Showing changes from previous revision.