If you're looking for Cassandra Interview Questions & Answers for Experienced or Freshers, you are at right place. There are lot of opportunities from many reputed companies in the world. According to research Cassandra has a market share of about 0.3%. So, You still have opportunity to move ahead in your career in Cassandra Engineering. Mindmajix offers Advanced Cassandra Interview Questions 2018 that helps you in cracking your interview & acquire dream career as Cassandra Engineer.
Q: Can you define the meaning of Cassandra?
It is one of the favored NoSQL which is the distributed database management solution by Apache. This is an open source technology that is designed to make sure huge number of volumes of the data without any kind of failure is managed. It is highly scalable and is designed by Facebook and written in the Java language that comprises of the flexible schemas. There are different types of NoSQL databases and no doubt that Cassandra is a hybrid which is a part of the column orients with some key value store database solution.
Q: What are the types of composite in Cassandra?
Under Cassandra, there is a composite type because of which any particular coloumn or key name can be well defined while focusing on the different type of data. It is categorized into two types namely:
1. Row Key
2. Column Name
Q: Can you explain more about the Cassandra Data Model?
Generally, Cassandra Data Model comes into four main components such as:
1. Keyspace: It is the namespace that groups different column families specially the one that comes per partition
2. Cluster: It is made of different keyspaces and nodes
3. Column: under this section there will be name of the column, value and timestamp
4. Column family: It refers to variety of columns with a row key reference
Q: Can you explain the keyspace in Cassandra?
Under the Cassandra section, there is a keyspace which generally specified that data replication on the nodes. It comes with one keyspace for one node.
Q: Can you explain in brief on CQL?
It is possible for the user to access Cassandra with the help of nodes which uses the Cassandra Query Language (CQL). It focuses on treating the database as a table container. The programmers also use cqlsh that promptly works with the CQL.
Q: Define in Detail about SSTable?
SSTable is an abbreviation used for Sorted String Table. It specifically refers to the crucial information file in Cassandra. It generally accepts the regular mmtables that is written which gets stored at the disk. Besides, it is available in each of the Cassandra Table. Since it is immutable, it does not allow any kind of removal or addition of the data items once you write down. For every such table, it leads to the creation of the different files such as summary, index and even the bloom filter.
Q: Explain in brief about Thrift:
It is originally the Romote Procedure call name in which the client uses it for communicating with the server of Cassandra.
Q: What is Cassandra-Cqlsh?
This is one query language because of which the users can communicate easily with the database. With the help of Cassandra cqlsh, it is possible for:
1. Data insertion
2. Schema defining
3. Query execution
Q: Explain in detail about the bloom filter?
This type of off heap data structure needs to be verified on whether there is any kind of information available in the SSTable or not before you perform any kind of I/O disk operation
Q: Explain different components of the Cassandra write
It is categorized into three components namely:
1. Write of SStable
2. Write of Memtable
3. Commitlog write
The focus of Cassandra is to write the data to make sure the log gets committed and then it gets fit in the memory table structure and at the end in the SStable.
Q: What is zero consistency?
Under uniformity also called as zero consistency, accurate operations more specifically related to the writing are generally handled in the background. It is considered to be the quickest possible solution by which a data can be written.
Q: Explain the values that get stocks up in the Column of Cassandra?
The values that are available in Cassandra Column which are:
1. Time Stamp
3. Column Name
Q: Explain in detail of Kundera?
It is an implementation of object-based mapping (ORM) for Cassandra which is written with the help of Java annotations
Q: What are the types of NoSQL databases?
It is categorized into four different types which are:
1. Column Stores
2. Document Stores
3. Key-Value Stores
4. Graph Stores
Q: Do you know the meaning of Commit log in Cassandra?
It is a crash recovery mechanism that is present in Cassandra. Generally in the Commit log, such type of write operation is written.
Q: What is memtable?
Memtable is a similar concept to that of the table. It is known to be the in-memory/write-back cache space that consists of the content present in the column and key format. The data which is present in the memtable is sorted as per the key. It con of the ColumnFamily under which there is a distinct memtable which collects the column data through key. It also accommodates the writes till gets full and then it is completely eradicated.
Q: When is the right time to avoid indexes?
Generally secondary indexes are not used on the columns that contain a huge count of unique values as they are likely to produce some results.
Q: Explain the points in writes of Cassandra that changes the data into commitlog?
The focus of Cassandra is more on the altered data to commitlog. It more specifically works as a crash recovery log for data. Till the time data gets changed and is concatenated to commitlog, the write operation will never become successful.
Q: Can you brief about the Cassandra Compaction?
Compaction is all about the maintenance process present in Cassandra. There is the SSTables which is then reorganized to make sure data is well optimized for the data structure of the disk. The process of compaction is then used at the time of memetable interaction. It comes with two times:
1. Small compaction: This started automatically when the new SStable was created. It compresses entirely the similar size of ssTables in one single platform.
2. Major compaction: There is also some major compaction that gets triggered with node tool. This way it squashes all SSTables of Column Family into one.
Q: What is Replication Factor?
It is more specifically used for measuring the wide range of data copies that exist. It helps in increasing the replication factor to the log in the cluster.
Q: Why should we choose Cassandra?
1. As compared to traditional or any other database, the focus of Cassandra is to deliver the near real time performance. It makes the work of the developers a lot simpler. Along with this, it has also been designed for software engineers, developers, administrators and data analysis to name a few.
2. Instead of the master slave, the Cassandra is especially designed to create the peer to peer architecture that ensures maximum success.
3. It also makes sure that the phenomenal flexibility is maintained since it offers the insertion of different nodes to the Cassandra cluster in the datacenter.
4. Cassandra also offers the widest range of scalability which you can easily scale up and down as per the needs
5. It is considered to be the best platform for strong data application
Q. What is NoSQL ?
NoSQL (sometimes expanded to “not only sql“) is a broad class of database management systems that differ from the classic model of the relational database management system (rdbms) in some significant ways.
In contrast to RDBMS, NoSQL systems:
NoSQL implementations can be categorised by their manner of implementation:
Q. Explain what is Cassandra?
Cassandra is an open source data storage system developed at Facebook for inbox search and designed for storing and managing large amounts of data across commodity servers. It can server as both
Q. Why Cassandra? Why not any other no SQL like Hbase ?
Apache Cassandra is an open source, free to use, distributed, decentralized, elastically and linearly scalable, highly available, fault-tolerant, tune-ably consistent, column-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web. Cassandra lies in CA bucket of CAP Theorem.
Our use case was more of write intensive. Since Cassandra provide Consistency and Availability, which was requirement of our use case we preferred Cassandra.
HBase is really good for Low latency read write kind of use cases.
Q. Explain Cassandra Data Model.
– The Cassandra data model has 4 main concepts which are cluster, keyspace, column,column&family.
– Clusters contain many nodes (machines) and can contain multiple keyspaces.
– A keyspace is a namespace to group multiple column families, typically one per application.
– A column contains a name, value and timestamp.
– A column family contains multiple columns referenced by a row keys.
Q. Explain about Cassandra NoSQL.
Cassandra is an open source scalable and highly available “NoSQL” distributed database management system from Apache. Cassandra claims to offer fault tolerant linear scalability with no single point of failure. Cassandra sits in the Column-Family NoSQL camp.The Cassandra data model is designed for large scale distributed data and trades ACID compliant data practices for performance and availability.Cassandra is optimized for very fast and highly available writes.Cassandra is written in Java and can run on a vast array of operating systems and platform.
Q. Explain how Cassandra writes.
Cassandra writes first to a commit log on disk for durability then commits to an in-memory structure called a memtable. A write is successful once both commits are complete. Writes are batched in memory and written to disk in a table structure called an SSTable (sorted string table). Memtables and SSTables are created per column family. With this design Cassandra has minimal disk I/O and offers high speed write performance because the commit log is append-only and Cassandra doesn’t seek on writes. In the event of a fault when writing to the SSTable Cassandra can simply replay the commit log.
Q. What platforms does Cassandra run on?
Cassandra is a Java Application, meaning that a compiled binary distribution of Cassandra can run on any platform that has a Java Runtime Environment (JRE), also referred to as a Java Virtual Machine (JVM). Datastax Strongly recommends using the Oracle Sun Java Runtime Environment (JRE), version 1.6.0_19 or later, for optimal performance. Packaged releases are provided for RedHat, CentOS , Debian and Ubuntu Linux Platforms.
Q. What is the CQL Language?
Cassandra 0.8 is the first release to introduce Cassandra Query Language(CQL), the first standardized query language for Apache Cassandra. CQL pushes all of the implementation details to the server in the form of a CQL parser. Clients built on CQL only need to know how to interpret query result objects. CQL is the start of the first officially supported client API for Apache Cassandra. CQL drivers for the various languages are hosted with the Apache Cassandra project.
CQL Syntax is based on SQL (Structured Query Language), the standard for relational database manipulation. Although CQL has many similarities to SQL, it does not change the underlying Cassandra data model. There is no support for JOINS, for example.
Q. What management tools exist for Cassandra?
Datastax supplies both a free and commercial version of OpsCenter, which is a visual, browser-based management toll for Cassandra. With OpsCenter, a user can visually carry out many administrative tasks, monitor a cluster for performance, and do much more. Downloads of OpsCenter are available on the DataStax Website.
A number of command line tools also ship with Cassandra for querying/writing to the database, performing administration functions, etc.
Cassandra also exposes a number of statistics and management operations via Java Management Extensions(JMX). Java Management Extensions (JMX) is a Java technology that supplies tools for managing and monitoring Java Applications and services. Any statistics or operation that a Java application has exposed as an MBean can then be monitored or manipulated using JMX.
During normal operation, Cassandra outputs information and statistics that you can monitor using JMX-compliant tools such as JConsole, the Cassandra nodetool utility, or the DataStax OpsCenter centralized management console. With the same tools, you can perform certain administrative commands and operation such as flushing caches or doing a repair.
Q. Briefly Explain CAP theorem.
The CAP theorem (also called as Brewer’s theorem after its author, Eric Brewer) states that within a large-scale distributed data system, there are three requirements that have a relationship of sliding dependency: Consistency, Availability, and Partition Tolerance.
CAP theorem states that in any given system, you can strongly support only two of these three.
Q. Why Cassandra is called decentralized no sql data base?
Cassandra is distributed, which means that it is capable of running on multiple machines while appearing to users as a unified whole. Cassandra is decentralized means that there is no single point of failure. All of the nodes in a Cassandra cluster functions exactly the same. There is NO Master NO Slave.
Q. What do you understand by Elastic Scalability?
Elastic Scalability means that your cluster can seamlessly scale up and scale back down. That actually means that adding more servers to cluster would improve and scale performance of cluster in linear fashion without any manual interventions. Vice versa is equally true.
Q. Cassandra is said to be Tune able Consistent. Why?
Consistency essentially means that a read always returns the most recently written value. Cassandra allows you to easily decide the level of consistency you require, in balance with the level of availability. This is controlled by parameters like replication factor and consistency level.
Q. How Cassandra Achieve High Availability and Fault Tolerance?
Cassandra is highly available. You can easily remove few of Cassandra failed node from cluster without actually losing any data and without bring whole cluster down. In similar fashion you can also improve performance by replicating data to multiple data center.
Q. What is basic difference between data center and cluster in terms of Cassandra?
A collection of related nodes is called so. A data center can be a physical data center or virtual data center. Replication is set by data center. Depending on the replication factor, data can be written to multiple data centers. However, data centers should never span physical locations whereas a cluster contains one or more data centers. It can span physical locations.
Q. What is a commit log?
It is a crash-recovery mechanism. All data is written first to the commit log (file) for durability. After all its data has been flushed to SSTables, it can be archived, deleted, or recycled.
Q. What is SSTable? Is it similar to RDBMS table?
A sorted string table (SSTable) is an immutable data file to which Cassandra writes memtables periodically. SSTables are append only and stored on disk sequentially and maintained for each Cassandra table.
Whereas RDBMS Table collection of ordered columns fetched by row.
Q. What is Gossip protocol?
Gossip is a peer-to-peer communication protocol in which nodes periodically exchange state information about themselves and about other nodes they know about. The gossip process runs every second and exchanges state messages with up to three other nodes in the cluster.
Q. What is Bloom Filter?
These are quick, nondeterministic, algorithms for testing whether an element is a member of a set. Bloom filters are accessed after every query.
Q. What is Order Preserving partitioner?
This is a kind of Partitioner that stores rows by key order, aligning the physical structure of the data with your sort order. Configuring your column family to use order-preserving partitioning allows you to perform range slices, meaning that Cassandra knows which nodes have which keys. This partitioner is somewhat the opposite of the Random Partitioner; it has the advantage of allowing for efficient range queries, but the disadvantage of unevenly distributing keys.
The order-preserving partitioner (OPP) is implemented by the org.apache.cassandra .dht.OrderPreservingPartitionerclass. There is a special kind of OPP called the collating order-preserving partitioner (COPP). This acts like a regular OPP, but sorts the data in a collated manner according to English/US lexicography instead of byte ordering. For this reason, it is useful for locale-aware applications. The COPP is implemented by the org.apache.cassandra .dht.CollatingOrderPreservingParti tioner class. This is implemented in Cassandra by org.apache.cassandra.dht.OrderPreservingPartitioner.
Q. What are key spaces and column family in Cassandra?
In Cassandra logical division that associates similar data is called as column family. Basic Cassandra data structures: the column, which is a name/value pair (and a client-supplied timestamp of when it was last updated), and a column family, which is a container for rows that have similar, but not identical, column sets. We have a unique identifier for each row could be called a row key. A keyspace is the outermost container for data in Cassandra, corresponding closely to a relational database.
Q. What is the use of HELP command?
It is used to display a synopsis and a brief description of all cqlsh commands.
Q. What is the use of capture command?
Capture command is used to captures the output of a command and adds it to a file.
Q. What is materialized view? Why is it normal practice in Cassandra to have it?
Materialized” means storing a full copy of the original data so that everything you need to answer a query is right there, without forcing you to look up the original data. This is because you don’t have a SQL WHERE clause, you can recreate this effect by writing your data to a second column family that is created specifically to represent that query.
Q. Why Time stamp is so important while inserting data in Cassandra?
This is important because Cassandra use timestamps to determine the most recent write value.
Q. Why are super columns in Cassandra no longer favoured?
Super columns suffer from a number of problems, not least of which is that it is necessary for Cassandra to deserialize all of the sub-columns of a super column when querying (even if the result will only return a small subset). As a result, there is a practical limit to the number of sub-columns per super column that can be stored before performance suffers.
In theory, this could be fixed within Cassandra by properly indexing sub-columns, but consensus is that composite columns are a better solution, and they work without the added complexity.
Q. What are advantages and disadvantages of secondary indexes in Cassandra?
Querying becomes more flexible when you add secondary indexes to table columns. You can add indexed columns to the WHERE clause of a SELECT.
When to use secondary indexes: You want to query on a column that isn’t the primary key and isn’t part of a composite key. The column you want to be querying on has few unique values (what I mean by this is, say you have a column Town, that is a good choice for secondary indexing because lots of people will be form the same town, date of birth however will not be such a good choice).
When to avoid secondary indexes: Try not using secondary indexes on columns contain a high count of unique values and that will produce few results. Remember it makes writing to DB much slower, you can find value only by exact index and you need to make requests to all servers in cluster to find value by index.
Q. How do you query Cassandra?
We query Cassandra using cql (Cassandra query language). We use cqlsh for interacting with DB.
Q. What is cqlsh?
It’s a Python-based command-line client for cassandra.
Q. Does Cassandra works on Windows?
Yes Cassandra works pretty well on windows. Right now we have linux and windows compatible versions available.
Q. Why renormalization is preferred in Cassandra?
This is because Cassandra does not support joins. User can join data at its own end.
Q. What is the sue of consistency command?
Consistency command is used to copy data to and from Cassandra to a file.
Q. Does Cassandra Support Transactions?
Yes and No, depending on what you mean by ‘transactions’. Unlike relational databases, Cassandra does not offer fully ACID-compliant transactions. There are no locking or transactional dependencies when concurrently updating multiple rows or column families. But if by ‘transactions’ you mean real-time data entry and retrieval, with durability and tunable consistency, then yes.
Cassandra does not support transactions in the sense of bundling multiple row updates into one all-or-nothing operation. Nor Does it roll back when a write succeeds on one replica, but fails on other replicas. It is possible in Cassandra to have a write operation report a failure to the client, but still actually persist the write to a replica.
However, this does not mean that Cassandra cannot be used as an operational or real time data store. Data is very safe in Cassandra because writes in Cassandra are durable. All writes to a replica node are recorded both in memory and in a commit log before they are acknowledged as a success. If a crash or server failure occurs before the memory tables are flushed to disk, the commit log is replayed on restart to recover any lost writes.
Q. What is Compaction in Cassandra?
The compaction process merges keys, combines columns, evicts tombstones, consolidates SSTables, and creates a new index in the merged SSTable.
Q. What is Anti-Entropy?
Anti-entropy, or replica synchronization, is the mechanism in Cassandra for ensuring
that data on different nodes is updated to the newest version.
Q. What do you understand by Consistency in Cassandra?
Consistency means to synchronize and how up-to-date a row of Cassandra data is on all of its replicas.
Q. Explain Zero Consistency?
In this write operations will be handled in the background, asynchronously. It is the fastest way to write data, and the one that is used to offer the least confidence that operations will succeed.
Q. Explain Any Consistency?
Ii assures that our write operation was successful on at least one node, even if the acknowledgment is only for a hint. It is a relatively weak level of consistency.
Q. Explain ONE consistency?
It is used to ensure that the write operation was written to at least one node, including its commit log and memtable.
Q. Explain QUORUM consistency?
A quorum is a number of nodes that is used to represent the consensus on an operation. It is determined by / 2 + 1.
Q. What is the use of SOURCE command?
SOURCE command is used to execute a file that contains CQL statements.
Q. Explain ALL consistency?
Every node as specified in your configuration entry must successfully acknowledge the write operation. If any nodes do not acknowledge the write operation, the write fails. This has the highest level of consistency and the lowest level of performance.
Q. What are consistency levels for read operations?
Q. What do you mean by hint handoff?
It is mechanism to ensure availability, fault tolerance, and graceful degradation. If a write operation occurs and a node that is intended to receive that write goes down, a note (the “hint”) is given (“handed off”) to a different live node to indicate that it should replay the write operation to the unavailable node when it comes back online. This does two things: it reduces the amount of time that it takes for a node to get all the data it missed once it comes back online, and it improves write performance in lower consistency levels.
Q. What is Merkle Tree? Where is it used in Cassandra?
Merkle tree is a binary tree data structure that summarizes in short form the data in a larger dataset. Merkle trees are used in Cassandra to ensure that the peer-to-peer network of nodes receives data blocks unaltered and unharmed.
Q. What do you mean by multiget?
It means a query by column name for a set of keys.
Q. What is Multiget Slice?
It means query to get a subset of columns for a set of keys.
Q. What is a SEED node in Cassandra?
A seed is a node that already exists in a Cassandra cluster and is used by newly added nodes to get up and running. The newly added node can start gossiping with the seed node to get state information and learn the topology of the node ring. There may be
many seeds in a cluster.
Q. What is Slice and Range slice in Cassandra?
This is a type of read query. Use get_slice() to query by a single column name or a range of column names. Use get_range_slice() to return a subset of columns for a range of keys.
Q. What is Tombstone in Cassandra world?
Cassandra does not immediately delete data following a delete operation. Instead, it marks the data with a “tombstone,” an indicator that the column has been deleted but not removed entirely yet. The tombstone can then be propagated to other replicas.
Q. What is Thrift?
Thrift is the name of the RPC client used to communicate with the Cassandra server.
Q. What is Batch Mutates?
Like a batch update in the relational world, the batch_mutate operation allows grouping calls on many keys into a single call in order to save on the cost of network round trips. If batch_mutate fails in the middle of its list of mutations, there will be no rollback, so any updates that have already occurred up to this point will remain intact.
Q. What is Hector?
Hector is an open source project written in Java using the MIT license. It was one of the early Cassandra clients and is used in production at Outbrain. It wraps Thrift and offers JMX, connection pooling, and failover.
Q. What is Kundera?
Kundera is an object-relational mapping (ORM) implementation for Cassandra written using Java annotations.
Q. What is Random Partitioner?
This is a kind of Partitioner that uses a BigIntegerToken with an MD5 hash to determine where to place the keys on the node ring. This has the advantage of spreading your keys evenly across your cluster, but the disadvantage of causing inefficient range queries. This is the default partitioner.
Q. What is Read Repair?
This is another mechanism to ensure consistency throughout the node ring. In a read operation, if Cassandra detects that some nodes have responded with data that is inconsistent with the response of other, newer nodes, it makes a note to perform a read repair on the old nodes. The read repair means that Cassandra will send a write request to the nodes with stale data to get them up to date with the newer data returned from the original read operation. It does this by pulling all the data from the node, performing a merge, and writing the merged data back to the nodes that were out of sync. The detection of inconsistent data is made by comparing timestamps and checksums.
Q. What is Snitch in Cassandra?
A snitch is Cassandra’s way of mapping a node to a physical location in the network.
It helps determine the location of a node relative to another node in order to assist with discovery and ensure efficient request routing.
Q: Tell us the writing of Cassandra?
The performance of Cassandra depends on the functioning of the write which is applied with the two commits. At initial time, it writes the commit which log on to the disk and then commits to the in-memory structured which is called memtable. When two months get over, the write gets easily attained. Then the same writes are also written in the structure of the table such as SSTable. Cassandra is known for faster write performance.
Q: Explain the tool of management in Cassandra:
There is an online dependent management and monitoring platform for cluster of the Cassandra with DataStax. It does not cost anything to download and include the OpsCenter Edition. It also focuses on administering the metrics and other OS and JVM metrics. Other than this, SPM also observes the Hadoop and Storm and Zookeeper to name a few. The main focus of SPM is to make sure events and metrics correlation is maintained. Other than this, you can also enjoy getting the right graphs with better zooming feature and diffrently detect and even the alert of the heartbeat to name a few.
Q: Explain concept of SuperColumn in Cassandra:
This type of option is one unique element that consists of the sample data collection. It is ideally the key value that pairs with the values and columns. It is also the well defined of the columns which follows a hierarchy when actually the action takes place.
Q: What is the right time for avoiding the secondary indexes:
It is advised of not using the secondary indexes on the columns that contains a high count with unique values like that of few results
Q: What are the steps under which the Cassandra writes a changed data in the commitlog?
The Cassandra focuses on changing the data to commitlog which works as the crash recovery log for the data. If the changed data is concatenated to commitlog, it will not be possible for the write operation to get successful.
Q: What is tunable consistency in Cassandra concept? Could you please elaborate:
The tunable consistency is a part which makes Cassandra a favored database choice for analysis, and developers. This also refers to the updated information along with the synchronised data rows on different replicas. This type of platform offers the users with a consistency level that suits up the use cases. It also supports the consistencies such as Strong consistency and even the eventual consistency.
Q: Explain the Components of Cassandra Write?
The components of Cassandra Write is categorized into:
1. Memtable write
2. Commitlog write
3. SStable write
Each of such component is known to have the special features and are unique. The Cassandra also writes the data that commits the log and then there is an in-momey table structure. At the end it is located in SStable.
Q: Explain CAP Theorem.
If there is a string need to scale the system where there is extra recourse needed at such time CAP Theorem can be pretty much helpful. It helps to maintain the scale strategy and is said to be the most effective way by which to handle the scaling in the distributed system. The theorem shows that under the system of the distributed version such as Cassandra, the users would actually be able to enjoy two or three characteristics without any issue.
However, it is also true that one needs to be given away. The consistency also makes sure the return of the write is done for the client in the best possible manner that too without compromising with the quality. It also works as a rational responsive that has less time and in partition tolerance. The system would this way continue the operation even when the network partition arises. CP and AP are the two options that are available.
Q: What is zero consistency?
The zero consistency is a concept in which the write operation is handled at the background. It is considered to be the fastest way for writing the data. In case you are planning to apply in your project there is no doubt that it would do wonders.
The above given concepts are extremely simple and would definitely help you outgrow in your career success. However, you have to make sure you follow the right technique of actually answering the questions in the best possible manner.