Faster, Easier Application Development with Apache Spark
Faster Application Development with Apache Spark
As we noted, in previous chapters, Apache Spark comes as a comprehensive framework for large-scale data analytics. With various supporting utilities, developers can develop, build and deploy their applications on Spark faster. In this chapter, we will see the different forms of Spark applications and how we can build and run them using a build tool.
An Apache Spark application can be run within the interactive shell or submitted, either in the local machine or on a cluster in a distributed manner. In the following sections, we will discuss further on these two methods.
Data Analysis with Spark Interactive Shell
We saw an example of how to run the word count application with Spark interactive shell, in chapter 1. Let’s now see how to build, develop, and run an application in Scala with a build tool.
sbt : Simple Build Tool
‘sbt’ is a build tool developed in Scala and can be used to buid applications in Spark shell. Moving forward from the word count application,let us develop an application to return how many lines are there with the word ‘and’. With Apache Spark already setup in the machine, we also need to have ‘sbt’ installed in our system for this. The preferred sbt installer can be downloaded from their official website and must be installed, before trying out below commands.
In order to build Apache Spark, in a command prompt, move to Spark installation folder and execute the following:
Now that we have built Spark, create a new folder with a preferred name for our application. We will name it as ‘SparkLineCountApplication’.
Move to the project folder and create a file as below and save it with a preferred name, making sure it is saved in .sbt extension.
Below, will be the content of SparkLineCountApplication/lineCount.sbt
name := ” SparkLineCountApplication “
The parameter ‘resolvers’ above refer to the repositories, we may need to refer for various dependencies.
Next, the implementation of the line count application should be stored in a location as SparkLineCountApplication LineCountApplication.scala containing below.
Those are the two files needed for the line count application. To build and run the application, move to SparkLineCountApplication project folder, run below commands.
Submitting Data Analysis Tasks in Spark
We can submit an application in Spark, to run on a cluster and needs a cluster manager supporting Spark. There are vendors like Cloudera which provide cluster managers. For submitting an application, the spark-submit script in bin folder of Spark installation, is used. If there are any dependencies for the application, they should be bundled together in one package created as an assembly (also termed “uber”) jar before forwarding to clusters. The packaging can be done, using a build tool like sbt or Maven.
Running Tasks with Spark-submit
Once, packaged with dependencies, applications can be submitted using spark-submit. The additional parameters to the script, are passed in –key value format. Below is an example of using spark-submit.
Let us see frequently used arguments with spark-submit refer to:]
+ Class : The main class of your application.(such as example.src.main.ApplicationMain)
+ Master : Master URL (discussed in Chapter)
+ Deploy mode: The value of this could be either ‘cluster’ or ‘client’ which refers to where the driver should be deployed, either on worker nodes or locally. The default value of this option is client meaning to run as the external client.
+ Conf: Any other conofigurations of Spark, specified in ‘key=value’ format. If values have spaces within them, it should be wrapped within the quotes as ‘”key=value”’.
+ Application-jar Location of the jar in which the application and and its dependencies have bundled. This location should be visible globally to the cluster.
+ Arguments -for-application: Arguments to be passed for key method of the main class of the application.