Launching a Apache Spark Clusters on Amazon EC2
Spark on EC2
In the Spark’s EC2 directory, there is the Spark-EC2 script which is used for launching, shutting down, and management of Spark clusters on the Amazon EC2. With this, the Spark, Shark, and HDFS will be automatically set up in your cluster and on your behalf. In case you have not created an EC2 account for yourself, then begin by creating one on Amazon.
First of all, an Amazon EC2 key pair will be needed, so create one. To do this, just login to your Amazon Web Services account by use of the AWS console, on the left sidebar, click on the “Key Pairs.” You can then create and download your key. You also have to ensure that the permissions for the private key file are set to 600. This will mean that only reading and writing of it will be allowed, and you will be in a position to use ssh. For the script to be used, one has to change or set some parameters to what is necessary.
How to Launch a Cluster
1. Begin by navigating to the EC2 directory of the Spark which you have downloaded.
2. You can then run the following command from this directory:
./spark-ec2 -k <keypair> -i <key-file> -s <number-slaves> launch <name-ofcluster>
In the above command, the name of your EC2 pair is the “keypair,” and this should be the name which you provided when you were creating it. The “keyfile” here is the private key file indentifying your key pair. The “number-slave” is the number of the slave which is to be launched by the node. The parameter “name-of-cluster” signifies the name that you need to give to your cluster.
3. Once everything has been started, check to see if the cluster scheduler is up and whether it is seeing all the slaves by just opening the web UI, and this can be found at the end of the script. If you need to see more options for usage, then run the following command:
The following are some of the options which can be printed:
- instance-type=<INSTANCE_TYPE>- this is used when we want to specify the instance of EC2 that we need to use. At the moment, only 64- bit versions are supported by the script, and a default one is provided.
- region=<EC2_REGION>- this is used for the specification of a region of EC2 in which the instance is to be launched.
- zone=<EC2_ZONE>- this is used for specification of an EC2 availability zone in which the instances will be launched. Note that a single zone does not have enough capacity, so sometimes you might get an error. If this happens, try to launch the instance in another zone.
- zone=<EC2_ZONE>- this is used for attaching an EBS volume having a given amount of space to each of the nodes so that one can have a persistent HDFS cluster.
- spot-price=PRICE- this is used for launching the worker nodes as Spot instances.
How to Run Applications
For you to run the applications, follow the steps given below:
1. Begin by navigating to the EC2 directory where you have stored the Spark which you have downloaded.
2. We now need to ssh into the cluster. Just run the following command:
./spark-ec2 -k <keypair> -i <key-file> login <cluster-name>
The parameters “keypair” and “key-file” should not change since we mentioned what they are earlier.
3. For the purpose of deployment of data or code into the cluster, just login and then use the script “~/spark-ec2/copy-dir.”
4. For those whose applications are in need of accessing large datasets and you need to do it faster, you can load it from the Amazon S3 or just from an Amazon EBS device and then into a Distributed File System of Hadoop on your nodes. An instance of the HDFS will be set for you by the Spark-EC2 script. The installation of this was done in the “/root/ephemeral-hdfs” and to access it, you can use the script “bin/hadoop” in the directory. Note that once you stop and then restart your machine, the data contained in the HDFS will go away.
5. The directory “/root/persistent-hdfs” has a persistent HDFS instance and this will be tasked with the keeping of data across cluster restarts. Note that each node in the cluster will have little space for persistent data but this can be altered so that the persistent data can be stored in each node.
6. Some of you might get errors while running their applications. If this happens, then look at the logs for the slaves inside the work directory for the scheduler. You can also use the web UI so as to view the status of your cluster.
For the purpose of configuration, the file “/root/spark/conf/spark-env.sh” which is located on each machine of the cluster can be edited so that the Spark configuration options like the JVM options can be set up. So that changes can be reflected in each of the machines, the file has to be copied to each of them. However, doing this manually can be a bit tiresome. There is the script “copy-dir” which can be used for this purpose, and the process will have been made easier. Begin by editing the file “spark-env.sh” located in the master, and then run the command “~/spark-ec2/copy-dir /root/spark/conf” which will RSYNC it to all of the workers.
You also need to know the data EC2 nodes cannot be recovered once they have been shut down. This is why you should first copy them before you can stop them. Just navigate to the EC2 directory of the Spark which you downloaded and then execute the following command:
./spark-ec2 destroy <cluster-name>
Pausing and Restarting of Clusters
With the Spark EC2, a cluster can easily be paused. When this happens, the VMs are stopped, but they are not terminated. All the data which is contained in the ephemeral disks will be lost, but it will be kept in the persistent HDFS and in the root partitions. The machines which have been stopped will not cost you any cycles for EC2, but money for EBS storage will be a cost.
1. If you need to stop a single cluster, navigate to the EC2 directory, and then execute the following command:
./spark-ec2 stop <cluster-name>
2. If you need to restart the cluster, then run the command given below:
./spark-ec2 -i <key-file> start <cluster-name>
3. You might also need to totally destroy the cluster so that it consumes no more EBS space. To do this, just run the following command:
./spark-ec2 destroy <cluster-name>
However, there are some limitations associated with the Spark on EC2. There is a limited support for cluster compute, and it provides no means for how a locality group can be specified. However, slave nodes can be launched in the <clusterName> manually, and then the Spark-EC2 launch –resume can be used for starting of the cluster.
How to Access Data in S3
The file interface of Spark makes it possible for it to process data in Amazon S3 by use of the same URI formats which are supported for Hadoop. In S3, a path can be specified as the input through the URI taking the form “s3n://<bucket>/path.” The credentials for your Amazon security will also have to be provided. This can be done by setting the parameters AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID which are environmental variables, and these can be set before the program or through the SparkContext.hadoopConfiguration. If you need to learn all the instructions on how to access S3 by use of Hadoop, input libraries can be accessed in the Hadoop S3 page.