Learn How to Configure Spark Properly and Utilize its API
The Magic with Apache Spark
The very first purpose for which Spark was used by Capital One was to generate product recommendations for 10 million products and 25 million users. This is one of the biggest sets of data for this modeling type. Furthermore, there was also the aim to put into consideration all the products possible including the products that were not always purchased by customers. This makes the issue more difficult as you will have 250e12 different combinations in order to achieve all the possible pairings of products and users. This information cannot be stored on a memory due to its size. It is thus important to learn how to configure Spark properly and utilize its API in a way that things won’t break. Spark allows you to work without really bothering about what happens at the background. You would however, need the knowledge of the background working to write functional and effective codes in Spark.
We are going to first review the major players in the infrastructure of Spark. While we won’t discuss every component, we will cover some of the basics. They include caching, shuffle, serialization and partitioning operation.
Caching is a component in Spark that can help to enhance performance. This is something very special to be utilized so if you have a frequently used data structure e.g. a lookup table. It could however be challenging if you have many caches as they always consume a large memory chunk.
You would need to move data around in some cases, even it is an expensive process. For instance, you might need to perform a certain operation that would require a single node consolidated data to make it possible to co-locate it in the memory. An example of this operation is a reduction operation on all values which are linked to a certain key using key-value RDD (reduce By Key()). This expensive data reorganization is called shuffle.
In computing distribution, there is the need to evade the expensive back and forth data writing. The easiest way to achieve this is to be able to use different code on the same data which has led to using JVM on a lot of frameworks. The process of code translation to a format compressed ideally for transfer efficiency through the network is called Serialization.
Resilient Distributed Dataset also known as RDD is the basic abstraction of Spark. Complex operations such as groupBy or join are simplified through the RDD in Spark. This also covers the fragmented data that you are working with behind the scene.
We will now discuss the combination of these components in writing a program in Spark:
1: Memory Spark Gorges
One of the ability of Spark is memory caching. The disadvantage of memory caching is that it consumes a lot of memory. In the first instance, if you use YARN and JVM, they take up a good chunk of memory. This reduces significantly, the amount of memory you will have for other operation such as caching and movement of data. Furthermore, metadata could be a result of shuffle operations accrue on the driver which could lead to problems on jobs that run for a very long time. Lastly, overhead might also be introduced by Scala or Java classes in your RDDs. Up to 60 bytes can be consumed by a Java string that is 10-character long.
It is therefore advisable to first of all tackle this issue. It is therefore important to apply wisdom in partitioning especially if you are a starter. This will help in memory management pressure and to make sure resource utilization is complete. Furthermore, you should always know the details of your data including types, size and distribution. This is to avoid the data being distributed in a skewed way. A custom partitioner can be used to solve this problem. Finally, as earlier mentioned, it is more efficient and faster to use Kryo serialization.
As regards solving the problem of metadata accumulation, there are two alternatives. The first alternative is to use automatic cleanup parameters which can be triggered with spark.cleaner.ttl. It can however cause problems as it would also clean any RDDs that is still persisting. The second alternative which is recommended is to use batches to divide jobs that run for a long time. This allows you to run every batch of job on a new environment which would not have any accumulated metadata.
2: Evade Movement of Data
Minimizing data transfer and evading shuffles would lead to faster running and more reliable programs. You should however note that once in a while, shuffling of data can help.
So how do you evade data shipping? The straightforward way is to evade data shuffling operations such as coalesce and repartition. You should also avoid join operations like join and cogropu as well as ByKey operations such as reduceByKey and groupByKey. The only ByKey operation that would not trigger shuffling is for counting. Two mechanisms provided by Spark to help include Broadcast variables and Accumulators.
3: Evading Movement of Data is Difficult.
Using the mechanisms listed above can be challenging. This is because you need to first collect() an RDD from the driver node before you can broadcast it. You would also need the data to be serialized and aggregated at the driver before the results of the execution distribution can be accumulated. This results in memory pressure increase at the driver. This could result in running out of memory quickly on the driver. It is however possible to increase the allocated memory by using the spark.driver.memory, but its effect is limited.
You would also require a good understanding of your data and how they come together to be able to manage partition increases and how Akka is used by Spark for messaging. The default buffer for Akka is 10MB. This can however be evaded by using the akka.frameSize operation.
Proper understanding of the three lessons initially discussed would lead to program stability. This lesson would however help to improve the performance of your program. Firstly, you must use your caching liberally. This is especially due to the fact that your cache memory is limited. Instead of using a data twice, place it in a cache memory. Secondly, broadcast variables are very helpful. You can use utilize them for lookup tables and large maps.
Lastly, you might have to execute Spark programs in a parallel way. Using one key at a time could lead to terrible utilization of resources. Using this method would also deny you the advantages of the in-built parallelism of Spark.
Spark can thus only be useful to you as much as you are able to aid it to be functional.