As we noted, in previous chapters, Apache Spark comes as a comprehensive framework for large-scale data analytics. With various supporting utilities, developers can develop, build and deploy their applications on Spark faster. In this chapter, we will see the different forms of Spark applications and how we can build and run them using a build tool.

Are you interested in taking up for Apache Spark Certification Training? Enroll for Free Demo on Apache Spark Training!

An Apache Spark application can be run within the interactive shell or submitted, either in the local machine or on a cluster in a distributed manner. In the following sections, we will discuss further on these two methods.

Data Analysis with Spark Interactive Shell

We saw an example of how to run the word count application with Spark interactive shell, in chapter 1. Let’s now see how to build, develop, and run an application in Scala with a build tool.

sbt : Simple Build Tool

‘sbt’ is a build tool developed in Scala and can be used to buid applications in Spark shell. Moving forward from the word count application,let us develop an application to return how many lines are there with the word ‘and’. With Apache Spark already setup in the machine, we also need to have ‘sbt’ installed in our system for this. The preferred sbt installer can be downloaded from their official website and must be installed, before trying out below commands.

Related Page: REPL Environment for Apache Spark Shell

In order to build Apache Spark, in a command prompt, move to Spark installation folder and execute the following:

Now that we have built Spark, create a new folder with a preferred name for our application. We will name it ‘SparkLineCountApplication’.

Move to the project folder and create a file as below and save it with a preferred name, making sure it is saved in .sbt extension.

Below, will be the content of SparkLineCountApplication/lineCount.sbt
name := ” SparkLineCountApplication“

The parameter ‘resolvers’ above refer to the repositories, we may need to refer for various dependencies.

Learn Apache Spark Tutorial

Next, the implementation of the line count application should be stored in a location as SparkLineCountApplication

LineCountApplication.scala containing below.

Those are the two files needed for the line count application. To build and run the application, move to SparkLineCountApplication project folder, run below commands.

sbt package
sbt run

Checkout Apache Spark Interview Questions

Submitting Data Analysis Tasks in Spark

We can submit an application in Spark, to run on a cluster and needs a cluster manager supporting Spark. There are vendors like Cloudera which provide cluster managers. For submitting an application, the spark-submit script in bin folder of Spark installation, is used. If there are any dependencies for the application, they should be bundled together in one package created as an assembly (also termed “uber”) jar before forwarding to clusters. The packaging can be done, using a build tool like sbt or Maven.

Running Tasks with Spark-submit

Once, packaged with dependencies, applications can be submitted using spark-submit. The additional parameters to the script, are passed in –key value format. Below is an example of using spark-submit.

[arguments-for-application]

Let us see frequently used arguments with spark-submit refer to:]

Class: The main class of your application.(such as example.src.main.ApplicationMain)

Master: Master URL (discussed in Chapter)

Deploy mode: The value of this could be either ‘cluster’ or ‘client’ which refers to where the driver should be deployed, either on worker nodes or locally. The default value of this option is client meaning to run as the external client.

Conf: Any other configurations of Spark, specified in ‘key=value’ format. If values have spaces within them, it should be wrapped within the quotes as ‘”key=value”’.

Application-jar Location of the jar in which the application and and its dependencies have bundled. This location should be visible globally to the cluster.

Arguments -for-application: Arguments to be passed for key method of the main class of the application.

Explore Apache Spark Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!

Are you looking to get trained on Apache Spark, we have the right course designed according to your needs. Our expert trainers help you gain the essential knowledge required for the latest industry needs. Join our Apache Spark Certification Training program from your nearest city.

Apache Spark Certification Training Bangalore

These courses are equipped with Live Instructor-Led Training, Industry Use cases, and hands-on live projects. Additionally, you get access to Free Mock Interviews, Job and Certification Assistance by Certified Apache Spark Trainer

Join our newsletter

Stay updated with our newsletter, packed with Tutorials, Interview Questions, How-to's, Tips & Tricks, Latest Trends & Updates, and more ➤ Straight to your inbox!

Course Schedule

Name	Dates
Apache Spark Training	Apr 27 to May 12	View Details
Apache Spark Training	Apr 30 to May 15	View Details
Apache Spark Training	May 04 to May 19	View Details
Apache Spark Training	May 07 to May 22	View Details

Last updated: 03 Apr 2023

About Author

Ravindra Savaram

Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.