CAP Theorem
CAP Theorem was proposed by Eric Brewer (professor of computer science at the University of California, Berkeley, and vice president of infrastructure at Google) in year 2000. Eric found that there are 3 attributes of a distributed system which has multiple nodes and communicate with each other over network. CAP Theorem is an acronym which stands for Consistency Availability & Partition Tolerance. These 3 technical terms can be further elaborated as below. CAP theorem is primarily related with distributed persistence layer of your application having multiple system/nodes and indicates how robust, flexible and available your data set no matter what is the scale and how many transactions it is supporting.
C which stands for Consistency (ACID Transaction)indicates that every read will have access to recently persisted (or written) data set. So if you write a data set to node-1 and try to read it from node-2, if system promises consistency, it must be return the recently written data set.A which stands for Availability (Total Redundancy)indicates that every slave or data node which is close to data set (data locality) execute queries if alive and in cluster. P which stands for Partition Tolerance(infinite scale out) indicateseven if the connections between nodes in a cluster are down, the Availability of data and consistency of data set promises, are intact.
Any system which satisfies two out of these 3 properties is a distributed system. In context of Hadoop , it supports the Availability(A) and PartitionTolerance(P) property. The Consistency property is not supported because only name-node has the information of where the replicas are placed. This information is not available with each and every node of the cluster.
How to achieve all 3 in CAP & limitations
- High Availability (HA) or availability is achieved by duplicating/replicating a copy of data set (or state of data) across different nodes in a clustered environment.
- Consistency or access to recently updated data (most recent state of data) is achieved by updating several nodes before allowing further reads.
- Total partitioning, meaning failure of part of the system is rare. However, we could look at a delay, a latency, of the update between nodes, as a temporary partitioning. It will then cause a temporary decision between A and C
CAP Theorem And Supporting Technology
C+A : Traditional relational database like MySQL, Oracle, PostgreSQL etc. These system are available and highly consistent and primarily used for transaction-al system development where data availability and consistency are primarily objective. Low latency is achieved by different techniques like indexing and caching for master data and high availability is achieved using passive or slave machines.
A+P: There are cases where we expect a minor degree of inaccuracy for consistency like travel site where prices may vary over time and end user is ready to compromise with it. When we are using travel site, it may happen that while booking my choice, system tells that the seats or reservations are no more available and in such cases, we can use HBase, couchDb, Riak, DynamoDB or Cassandra. These storage systems support clustered data base (partitioned tolerance) and 100% availability but consistency will not be met (and they call it eventually consistent system). So cassandra is used for your like counts or wishlist item etc.
C+P: Now there are cases like chat system where messages should be displayed in correct order even if it comes with little delay but they should be consistent. Storage system like HBase, MongoDB, Redis or MemcacheDB are they technology where Consistency is paramount and it also support partitioned tolerance but availability is in question.