Apache Scala Interview Questions
Apache Scala Interview Question And Answers:
Q.What are the advantages of using Apache Spark over Hadoop MapReduce for big data processing?
Simplicity, Flexibility and Performance are the major advantages of using Spark over Hadoop.Spark is 100 times faster than Hadoop for big data processing as it stores the data in-memory, by placing it in Resilient Distributed Databases (RDD).
- Spark is 100 times faster than Hadoop for big data processing as it stores the data in-memory, by placing it in Resilient Distributed Databases (RDD).
- Spark is easier to program as it comes with an interactive mode.
- It provides complete recovery using lineage graph whenever something goes wrong.
Q.What is Shark?
Most of the data users know only SQL and are not good at programming. Shark is a tool, developed for people who are from a database background – to access Scala MLib capabilities through Hive like SQL interface. Shark tool helps data users run Hive on Spark – offering compatibility with Hive metastore, queries and data.
Q.List some use cases where Spark outperforms Hadoop in processing.
- Sensor Data Processing –Apache Spark’s ‘In-memory computing’ works best here, as data is retrieved and combined from different sources.
- Spark is preferred over Hadoop for real time querying of data
- Stream Processing – For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution.
Q. What is a Sparse Vector?
A sparse vector has two parallel arrays –one for indices and the other for values. These vectors are used for storing non-zero entries to save space.
Q.What is RDD?
RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the system in object format. RDDs are used for in-memory computations on large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records, that are –
- Immutable – RDDs cannot be altered.
- Resilient – If a node holding the partition fails the other node takes the data.
Q.Explain about transformations and actions in the context of RDDs.
Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey.
Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.
Q.What are the languages supported by Apache Spark for developing big data applications?
Scala, Java, Python, R and Clojure
Q.Can you use Spark to access and analyse data stored in Cassandra databases?
Yes, it is possible if you use Spark Cassandra Connector.
Q.Is it possible to run Apache Spark on Apache Mesos?
Yes, Apache Spark can be run on the hardware clusters managed by Mesos.
Q. Explain about the different cluster managers in Apache Spark.
The 3 different clusters managers supported in Apache Spark are:
- Apache Mesos -Has rich resource scheduling capabilities and is well suited to run Spark along with other applications. It is advantageous when several users run interactive shells because it scales down the CPU allocation between commands.
- Standalone deployments – Well suited for new deployments which only run and are easy to set up.
Q. How can Spark be connected to Apache Mesos?
To connect Spark with Mesos-
- Configure the spark driver program to connect to Mesos. Spark binary package should be in a location accessible by Mesos. (or)
- Install Apache Spark in the same location as that of Apache Mesos and configure the property ‘spark.mesos.executor.home’ to point to the location where it is installed.
Q.How can you minimize data transfers when working with Spark?
Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:
- Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.
- Using Accumulators – Accumulators help update the values of variables in parallel while executing.
- The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.
Q.Why is there a need for broadcast variables when working with Apache Spark?
These are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().
Q. Is it possible to run Spark and Mesos along with Hadoop?
Yes, it is possible to run Spark and Mesos with Hadoop by launching each of these as a separate service on the machines. Mesos acts as a unified scheduler that assigns tasks to either Spark or Hadoop.
Q. What is lineage graph?
The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.
Q. How can you trigger automatic clean-ups in Spark to handle accumulated metadata?
You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.
Q.Explain about the major libraries that constitute the Spark Ecosystem
- Spark MLib– Machine learning library in Spark for commonly used learning algorithms like clustering, regression, classification, etc.
- Spark Streaming – This library is used to process real time streaming data.
- Spark GraphX – Spark API for graph parallel computations with basic operators like joinVertices, subgraph, aggregateMessages, etc.
- Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI tools.
Q.What are the benefits of using Spark with Apache Mesos?
It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other big data frameworks.
Q.What is the significance of Sliding Window operation?
Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.
Q.What is a DStream?
Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two operations –
- Transformations that produce a new DStream.
- Output operations that write data to an external system.
Q.When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?
Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.
Q.What is Catalyst framework?
Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.
Q.Name a few companies that use Apache Spark in production.
Pinterest, Conviva, Shopify, Open Table.
Q.Which spark library allows reliable file sharing at memory speed across different cluster frameworks?