We have had lot of shifts in the paradigm in how we think about persisting data and retrieving them as efficiently as possible. For the longest time, we have had SQL as our goto solution for persisting data. SQL definitely is an awesome solution and that is the reason why it had survived the test of time for so long (Fun Fact: The first commercial RDBMS was Oracle which got released in 1979 [1]) and is definitely not in any space to go down in the near future.
In my opinion, SQL gave us all a near perfect, generic solution for persisting data. SQL gives a lot of importance to things such as data-integrity, atomicity, normalization of data so the data is always in a consistent state whilst maintaining the query performance. Hence, it has got its own way to sort out things among itself with joins, foreign-keys, etc. Of-course, life is not always too fair. This magic that SQL does comes with a price to scale. SQL datastores are often not horizontally scalable and would require manual and application logic to shard the data to decrease the load. One other big challenge with SQL is having a single point of failure. The reason for this is again attributing to its inability to scale horizontally.
We were taught to think database design in this way, and hence that is what we do the best. But majority of the applications do not care about the size of the data or how its stored. On the other hand we don’t actually mind if the data is replicated in multiple locations. In fact, memory has become so cheap now, that we would love to have data duplication where ever possible.
With the intrusion of big data on almost every domain, it does not always make sense to hold on to SQL way of doing things. I’m in no way suggesting that SQL does not have future and everyone who are depending on big data need to resort to NoSQL way of doing things. The idea is to think by prioritizing the feature-set you would need for your datastore to have and, be open to pick and prioritize what you would need. Keep in mind that using both NoSQL
and SQL
hand-in-hand is not considered a bad practice at all. Just make sure that you are not over engineering your use case.
There are multiple options in the market right now. But, we are going to talk about one such datastore; Cassandra
. Why Cassandra
? Because, that is the one I have the most insight to talk about. Cassandra
, has now reached a very mature state with many big-shots using it as their datastore. Few of them worth mentioning are Netflix
, Apple
, SoundCloud
, etc. So, what is this Cassandra all about? Cassandra is a very powerful, battle-tested data-store that provides high availability with high read and write throughput.
I did talk about letting go few features that relational database provide could give you benefits such as ability to scale horizontally, increase the availability, etc. So, what is the compromise that we have to make if we choose Cassandra
? It is the data model.
Data model is the best way you can fine tune your cassandra
cluster. Relational databases often deal with creating a logical entity called tables, and relationships among the tables using Foreign-keys
. Since Cassandra does not have any kind of joins or relation-ships, the data has to be stored in a denormalized
fashion. This is actually not a bad thing in casssandra
. But the catch is that, we need to know the query access patterns before designing the model. But if done properly, the performance that we get out of Cassandra is phenomenal. Here is a link that talks about the benchmarking on cassandra cluster at Netflix.
I would like to get this going as a series, so I will stop talking about cassandra now and we’ll start off with how to model your data in cassandra in a different post.