Data distribution and storage in Apache Cassandra | Perfomatix | Full Stack Engineering Company
Endless data is generated every day, and by storing, it grows exponentially. Every bit of data has to be collected, stored, refined, queried, analyzed, and operationalized for continuous improvement. This leads to building digital products and services that are efficient and safer.
As people are using multiple devices to access and upload data, this asymmetrical growth, especially in IoT networks, doesn’t offer similarity in data and might be too random to process. Wide column databases will help gain control over such dynamic data.
Consider you need a data storage solution for an IoT or application.
The following functionalities are required from a database;
- Storing data with its variable event length
- Query the massive, fast-growing dataset for insights and iterative, perpetual improvements
- A distributed database that can accommodate evolving and variable-length data on a large scale.
- Scalability and high availability of data without compromising performance
- Manage the data with a query language everyone understands
Apache Cassandra is an open-source, NoSQL, wide column database that stores and processes large amounts of data. It was initially designed for Facebook search and presently used by CERN, Apple, Github, Netflix, and many other organizations which handle dynamic and exponentially growing data. There are identical data nodes in Apache Cassandra clustered together to remove bottlenecks and single-point failures, thus immunizing data from being lost or getting corrupted.
Cassandra adopts a peer-to-peer data distribution model, whereas most databases like Postgres use the master-slave replication model. The master-slave hierarchy creates bottlenecks in data transfer since the writes go to a master node, and reads are executed on slaves. Cassandra’s data cluster architecture provides node-based open channels of communication.
Key advantages of using Apache Cassandra as a database are;
- Decentralized database: Each node is capable of communicating with end-user as a complete of a partial replica of the database.
- Distributed: Cassandra is distributed among many data nodes or data centers.
- Highly scalable: Each node can communicate with a constant amount of other nodes; this allows linear scaling of the application over a massive number of nodes.
- Risk-tolerant: Database is risk-averse since it is stored in a decentralized network. Data will be available even if several nodes are unavailanle and data centers crash.
- Variable consistency: Availability and consistency of Cassandra nodes are adjustable, by configuring replication factor and consistency level settings. For example, if consistency level is set to 3 on a 3-node cluster. It would require at least all three nodes to be in agreement for maximum consistency in this cluster.
- Deployable on cloud or hybrid data environment
How does data distribution happen in Apache Cassandra?
The peer-to-peer distribution model in Cassandra allows data to be fully distributed among variable-length rows, and partition keys store it. This happens across multiple data centers and cloud availability zones to ensure the continuous availability and scalability of the database.
- Tokens are used to determine which node holds what data. A token is a 64-bit integer, and Cassandra assigns ranges of these tokens to nodes. This ensures that each token is owned by a node, adding or removing nodes from a cluster requires redistribution of these token ranges among nodes.
- A row’s partition key is used to calculate a token using a given partitioner (a hash function for computing the token of a partition key) to determine which node owns that row. That’s how Cassandra finds data replicas.
Data Modeling in Cassandra
Cassandra exposes a dialect similar to SQL called CQL for its Data definition language (DDL) and data manipulation language (DML). While similar to SQL, there is a notable omission: Apache Cassandra does not support
join operations or subqueries.
The populated table could look like this,
In a wide-column store, each row in a table appears to contain all columns. But only some need to be populated. The rest can be filled with NULL values during an insert operation.
The variable width of rows concept enables flexibility in terms of the events it can store: one event (row) can have columns
name (string), address (string), and phone (string), with the next event having
name (string), shoe_size (int), and favorite_color (string). Both events can be stored as rows in the same table.
How data is added to Cassandra?
Any one of the following conditions need to be satisfied to add new values or rewrite existing values.
- Columns already exist in the schema — unused columns in new rows are populated with NULL values during an insert operation;
- Applications can dynamically run alter table commands to add new columns to the schema.
How Data is read in Cassandra?
When a client selects a row with a select statement, all the mutations of the row are gathered and applied in order of their timestamps. If an insert happens first, and is followed by an update, then the resulting row is the insert mutation columns with the update overwriting the values for columns it contains. On the other hand, if an update is followed by an insert, then the insert overwrites all the columns from the updated row.
Cassandra is scalable and durable, it allows addition of new machines to increase throughput without downtime. In master-slave architecture, when a master node shuts down in databases, the database can’t process new writes until a new master is appointed. In such cases, Cassandra, which doesn’t rely on a master-slave architecture, can simply redirect writes to any available node, without shutting down the system.
Did you like this blog? Are you looking to upgrade your business?
Originally published at https://www.perfomatix.com on March 3, 2020.