To get started on this, lets first try to understand the CAP theorem. There are three ingredients in the CAP theorem namely:
- Consistency- Having the same data across all the nodes in the cluster at any given instant of time.
- Availability- Being able to serve always. No downtime and least possible response time.
- Partition Tolerance- The system continues to serve even if some link is broken and your cluster is broken into two or more parts. There could be a loss of a message, some node may crash, but you still want to be able to serve.
Now the CAP theorem states that you can carry home only two out of these three. This is where the difference in RDBMS and NoSQL lies! Lets look at the three combinations we can form here:
- CA - data is consistent between all nodes - as long as all nodes are online - and you can read/write from any node and be sure that the data is the sam.
- CP - data is consistent between all nodes, and maintains partition tolerance by becoming unavailable when a node goes down.
- AP - nodes remain online even if they can't communicate with each other and will resync data once the partition is resolved, but you aren't guaranteed that all nodes will have the same data (either during or after the partition)
Now look at the case of some popular NoSQL customers and then return back to see why NoSQL is good and applicable to them but RDBMS in my opinion will co-exist.
Lets talk of amazon.com first. Their business model is such that they want to be available all the time. They wouldn't want their site to be down or have a higher response time at any moment. So it is very essential for them to have the 'A' and 'P' attributes of the CAP theorem. They would rather give away the 'C' for it to an extent. Getting a regret from amazon.com saying we don't have this item although we showed you it was available earlier is not as bad as the site itself going down. So if there was one item and two people simultaneously put it into their carts, that could happen but given their business model they can have alternatives to save their customers of this situation. For instance they could have some extra items in the stock always.
Similarly when you think of facebook.com, suppose you post a picture on your wall. Its not a great deal if one of your friends can see that picture and the other will be able to see the picture a few moments later. Again, it doesn't care as much about consistency as it does to the availability.
Lets now think why was the cluster or a farm of servers needed after all. Its because everything you do on internet is being stored in a database. Google, facebook, amazon etc are examples who keep all this data for providing personalized search or recommendations etc. This huge amount of data in the order of petabytes or zetabytes can not be stored on one disk. To try to store all of them on one disk and replicate it to more such disks is a pain and that is why google chose to use a farm of of several servers with smaller disks. Traditional RDBMS was built to best serve on a single disk and that is why people with this huge data came up with BigTable, DynamoDB etc.
And as we near the end of this article, its importnat to have a look at some NoSQL databases. There are many out there which can be broadly divided into 4 categories:
- Column: HBase, Accumulo
- Document: MarkLogic, MongoDB, Couchbase
- Key-value : Dynamo, Riak, Redis, Cache, Project Voldemort
- Graph: Neo4J, Allegro, Virtuoso
Note that there isnt a concrete line between the 4 types. As an example, the document oriented databases and the key-value databases could resemble the other type to seom extent at times. So the boundaries are a little fuzzy. To conclude with, I would say NoSQL databases are popular and are good in certain circumstances, but when you come to something like say banking you really need ACID compliance and therefore the RDBMS. So in my opinion they will co-exist as they today.