Apache Spark Architecture and Its Core Components
Building Blocks of Apache Spark
In this section, we will discuss on Apache Spark architecture and its core components. Apache spark is built upon 3 main components as Data Storage, API and Resource Management. In this section, we will discuss about these 3 building blocks of the framework.
As per Data Storage, Spark is built upon an HDFS file system and capable of handling data from HBase or Cassandra systems as well. Spark API consists of interfaces to develop applications based on it in Java, Python and Scala languages. Using Spark, resource management can be done either in a single server instance or using a framework such as Mesos or YARN in a distributed manner.
Resilient Distributed Data Set
Apache Spark is built upon RDD (Resilient Distributed Data Set) concept. Similar to a table in a database, an RDD can hold data in various formats and is an immutable distributed collection of records. Spark can alternatively execute multiple programs between RDDs and RDD can recover faults efficiently through recomputing lost partitions upon a failure.
If you call any transformation upon an RDD, new RDD will be returned, while the original remains unmodified, since RDDs are immutable objects.
An RDD has 2 sets of parallel operations as Transformation and Action. A Transformation operation will always return an RDD, not a value and no evaluation happens in that case. Example Transformation operations are map, filter, flatMap, groupByKey, reduceByKey, sample, union, etc.
An Action operation, evaluates called functions on RDDs they are called upon,executing the queries and returns a value as the result. Example Action operations are count, reduce, collect, take, etc.
Getting Hands Dirty with Spark
Let us now install Apache spark and run a simple word count application. You can either use Spark setups available from vendors like Cloudera, MapR or HortonWorks or use it in the cloud. Spark needs Java installed on the system for it to run on your local machine. Hence we will first set up Java Development Kit. The steps described below are for a machine running Windows operating system. Note that,steps to set up Apache Spark on Linux or Mac OS will be similar but, the manner of setting up the environment variables may differ.
Installing JDK is quite straight forward. Download the JDK (Version 1.7 recommended) from the official vendor (Oracle) website and run the installer. Once installation is completed, verify successful completion by running below in the command line. Upon successful installation, it will show the Java version.
Now let us install Spark.
To install Spark on the system, navigate to the official Apache Spark website, download the latest version,unzip the file if necessary and move it to a convenient location. Move to that folder and launch the Spark Shell. An example is shown below, assuming it has been extracted to the location: C:\spark_setup.
cd C:\spark_setup\ spark-1.4.0-bin-hadoop2.1
If you see the below prompt, Whola!! You have installed Spark successfully.
15/07/21 13:15:32 INFO Http-Server:- Starting HTTP-Server
15/07/21 13:15:32 INFO Utils:- service is successfully in progress ‘HTTP-class_
server’ on the port of 58132.
Making Use of Scala (version is 2.10.4) (Java HotSpot)
Enter the expressions to evaluate them.
Type:- Taking help to find further info.
15/07/21 13:15:41INFO BlockManagerMaster: Registered BlockManager
15/07/21 13:15:41 INFO Spark-I Loop: Spark-context is created..
Spark-context can be made available – as sc.
To verify whether Spark shell executes properly, try below
To quit the shell, use below command.
Windows does not come with Python interpreter and hence, to run the Spark Python shell, we need to setup Pyhon in our environment. Python official website provides an installer for Windws or we can use a package like Anaconda, which comes with an added collection of computational tools written in Python.
Once python has been installed, you can launch the Spark Python Shell by executing pyspark in Spark installation directory. An example is given below:
cd C:\spark_setup\ spark-1.4.0-bin-hadoop2.1
That is all we need to run Apache Spark interactive shell in Scala or Python. It also comes with a web console. Let us see how to use Apache Spark web console.
Using Apache Spark web Console
While working with Spark, to view analysis results and other information, navigate to below URL.
MasterURL for different modes:
Connection to Spark engine can be done in different modes. When running Spark locally or on cloud, this is done configuring the ‘MasterUrl’ parameter as per below.
Setting MasterURL parameter to:
- Local : Runs Spark as one process with no threads in the local machine.
- Local[n] : n refers to the number of threads to be run on parallel. It is recommended to set n to the number of cores in the local machine.
- Local[*]: Creates threads equal to the number of logical cores in the local machine. Spark://host_name:port_number: Runs on specified Spark cluster(host_name) running on the given port_number. By default in most of the systems, the port number is 7077.
- Mesos:// host_name:port_number: Runs on specified Mesos cluster(host_name) running on the given port_number. By default in most of the systems, the port number is 5050. Note that if the cluster is using Zookeeper, this may need to be changed as below.
- Yarn-client: Connects to Yarn cluster as a client. The cluster location is to be configured using HADOOF_CONF_DIR variable.
- Yarn-cluster: Connects to Yarn cluster as a cluster. The cluster location is to be configured using HADOOF_CONF_DIR variable.
Two types of shared variables can be used in Apache Spark to speed up the applications running on a cluster.
- Broadcast Variables: These can be used to cache read-only values locally on each machine so that for each task, sending a copy of the value is not needed. This enables to exchange large datasets between cluster nodes efficiently. Below, it is an example way of using broadcast variables in Scala prompt.
// Initializing broadcast variables
valbroadCastElement = sc.broadcast(Array(‘Nirman’, ‘Shan’, ‘Srini’))
// using broadcast variables
- Accumulators: Accumulator variables are always ‘added’ through an associative operation. Hence, they are optimal for implementations to be run in parallel such as counters or sums. Once created using a given initial value, tasks on clusters can be added to the variable using ‘add’ command. But, its value can only be read by the driver program and will not be available for tasks running on clusters.
Below, is an example way of using an accumulator in Scalaprompt.
//Usage of accumulator variables
ValaccumuatorVar = sc.accumulator(0, “Examle Accumulator Variable”)
sc.parallelize(Array(‘Nirman’, ‘Shan’, ‘Srini’)).foreach(i =>accumuatorVar += i)
With the tools in hand, let us now collect the pieces together and build our simple word count application.
Word Count Application with Apache Spark
Using Spark API, data can be easily read from text files and processed. With the below example in Scalashell and we’ll see how they can be used.
To run the conventional word count application,in a Scala shell, run below commands.
valtextFileRead = “sample_data.md”
valtextFileData = sc.textFile(textFileRead)
Calling cache(),stores the RDD in cache and it can be easily read in further queries.cache() will be lazy-evaluated, meaning that it will be storing data not immediately, but whenever an action is called upon the RDD.
To read number of lines in the text file, run below command.
Now, can print out the word count next to each word in the file, as below.
valwordCountData= textFileData.flatMap(list =>list.split(” “)).map(word => (word, 1)).reduceByKey(_ + _)