If you're looking for Apache Scala Interview Questions & Answers for Experienced or Freshers, you are at the right place. There are a lot of opportunities from many reputed companies in the world. According to research Apache Scala has a market share of about 0.1%. So, You still have the opportunity to move ahead in your career in Apache Scala Engineering. Mindmajix offers Advanced Apache Scala Interview Questions 2018 that helps you in cracking your interview & acquire dream career as Apache Scala Engineer.
Simplicity, Flexibility and Performance are the major advantages of using Spark over Hadoop. Spark is 100 times faster than Hadoop for big data processing as it stores the data in-memory, by placing it in Resilient Distributed Databases (RDD).
Most of the data users know only SQL and are not good at programming. A shark is a tool, developed for people who are from a database background – to access Scala MLib capabilities through Hive like SQL interface. Shark tool helps data users run Hive on Spark – offering compatibility with Hive metastore, queries and data.
A sparse vector has two parallel arrays –one for indices and the other for values. These vectors are used for storing non-zero entries to save space.
RDDs (Resilient Distributed Datasets) are a basic abstraction in Apache Spark that represent the data coming into the system in object format. RDDs are used for in-memory computations on large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records, that are –
Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include a map, filter and reduceByKey.
Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.
Scala, Java, Python, R and Clojure
Yes, it is possible if you use Spark Cassandra Connector.
Yes, Apache Spark can be run on the hardware clusters managed by Mesos.
The 3 different clusters managers supported in Apache Spark are:
To connect Spark with Mesos-
Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:
These are read-only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().
Yes, it is possible to run Spark and Mesos with Hadoop by launching each of these as a separate service on the machines. Mesos acts as a unified scheduler that assigns tasks to either Spark or Hadoop.
The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.
You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long-running jobs into different batches and writing the intermediary results to the disk.
It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other big data frameworks.
Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.
Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two operations –
Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.
Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.
Pinterest, Conviva, Shopify, Open Table.
Free Demo for Corporate & Online Trainings.