An overview of architecture and modeling in Cassandra

An overview of architecture and modeling in Cassandra
April 16, 2020 1 Comment Business, City, Digital Marketing, Education, Entertainment, International, Jobs & Education, Other, Programming, Social Media, Technology Priya Saha

assandra uses a peer-to-peer architecture, unlike a master-slave architecture, which is prone to single point of failure (SPOF) problems. Cassandra is deployed on multiple machines with each machine acting as a node in a cluster. Data is autosharded, that is, automatically distributed across nodes using key-based sharding, which means that the keys are used to distribute the data across the cluster. Each key-value data element in Cassandra is replicated across the cluster on other nodes (the default replication is 3) for high availability and fault tolerance. If a node goes down, the data can be served from another node having a copy of the original data.

Sharding is an old concept used for distributing data across different systems. Sharding can be horizontal or vertical. In horizontal sharding, in case of RDBMS, data is distributed on the basis of rows, with some rows residing on a single machine and the other rows residing on other machines. Vertical sharding is similar to columnar storage, where columns can be stored separately in different locations.

Hadoop Distributed File Systems (HDFS) use data-volumes-based sharding, where a single big file is sharded and distributed across multiple machines using the block size. So, as an example, if the block size is 64 MB, a 640 MB file will be split into 10 chunks and placed in multiple machines.

The same autosharding capability is used when new nodes are added to Cassandra, where the new node becomes responsible for a specific key range of data. The details of what node holds what key ranges is coordinated and shared across the cluster using the gossip protocol. So, whenever a client wants to access a specific key, each node locates the key and its associated data quickly within a few milliseconds. When the client writes data to the cluster, the data will be written to the nodes responsible for that key range. However, if the node responsible for that key range is down or not reachable, Cassandra uses a clever solution called Hinted Handoff that allows the data to be managed by another node in the cluster and to be written back on the responsible node once that node is back in the cluster.

The replication of data raises the concern of data inconsistency when the replicas might have different states for the same data. Cassandra uses mechanisms such as anti-entropy and read repair for solving this problem and synchronizing data across the replicas. Anti-entropy is used at the time of compaction, where compaction is a concept borrowed from Google BigTable. Compaction in Cassandra refers to the merging of SSTable and helps in optimizing data storage and increasing read performance by reducing the number of seeks across SSTables. Another problem that compaction solves is handling deletion in Cassandra. Unlike traditional RDBMS, all deletes in Cassandra are soft deletes, which means that the records still exist in the underlying data store but are marked with a special flag so that these deleted records do not appear in query results. The records marked as deleted records are called tombstone records. Major compactions handle these soft deletes or tombstones by removing them from the SSTable in the underlying file stores. Cassandra, like Dynamo, uses a Merkle tree data structure to represent the data state at a column family level in a node. This Merkle tree representation is used during major compactions to find the difference in the data states across nodes and reconciled.

The Merkle tree or Hash tree is a data structure in the form of a tree where every non-leaf node is labeled with the hash of children nodes, allowing the efficient and secure verification of the contents of the large data structure.

Cassandra, like Dynamo, falls under the AP part of the CAP theorem and offers a tunable consistency level. Cassandra provides multiple consistency levels, as illustrated in the following table:

ReadNot supportedNot supportedReads from one node Read from a majority of nodes with replicasRead from all the nodes with replicas
WriteAsynchronous writeWrites on one node including hintsWrites on one node with commit log and MemtableWrites on a majority of nodes with replicasWrites on all the nodes with replicas

A summary of the features in Cassandra

The following table summarizes the key features of Cassandra with respect to its origins in Google BigTable and Amazon Dynamo:

FeatureCassandra implementationGoogle BigTableAmazon Dynamo
ArchitecturePeer-to-peer architecture, ring-based deployment architectureNoYes 
Data modelMultidimensional map(row,column, timestamp) -> bytesYes No
CAP theoremAP with tunable consistencyNoYes 
Storage architectureSSTable, MemtablesYes No
Storage layerLocal filesystem storageNoNo
Fast reads and efficient storageBloom filters, compactionsYes No
Programming languageJavaNoYes 
Client programming languageMultiple languages supported: Java, PHP, Python, REST, C++, .NET, and so on.Not knownNot known
Scalability modelHorizontal scalability; multiple nodes deployment than a single machine deploymentYes Yes 
Version conflictsTimestamp field (not a vector clock as usually assumed)NoNo
Hard deletes/updatesData is always appended using the timestamp field—deletes/updates are soft appends and are cleaned asynchronously as part of major compactionsYes No


Cassandra packs the best features of two technologies proven at scale—Google BigTable and Amazon Dynamo. However, today Cassandra has evolved beyond these origins with new unique and enterprise-ready features such as Cassandra Query Language (CQL), support for collection columns, lightweight transactions, and triggers.

About The Author
Priya Saha I am content writer at LoogleBiz -A Large Local Business Directory with over 5 years' experience in creating high-quality content for a range of clients. Writing clear marketing copy to awareness about products/services, Preparing well-structured drafts using Content Management Systems, Researching industry-related topics (combining online sources, interviews and studies), include conducting thorough research on industry-related topics, generating ideas for new content types and proofreading articles before publication. #Some qualities that I have - *Excellent command over English language. *Basic analytical skills. *An eye for details. *Ability to meet deadlines. *Ability to develop innovative and engaging content. *Being able to deliver under deadlines. *Excellent writing and editing skills in English *Good command over Microsoft Office tools like Word doc, Powerpoint etc.
Leave Comment
  1. 1


    hi!,I really like your writing very much!
    percentage we keep in touch extra about your
    post on AOL? I require a specialist on this house to resolve my problem.

    Maybe that is you! Taking a look forward to see you.


Leave a reply

Your email address will not be published. Required fields are marked *