Let’s Build Lambda Architecture With The Help of Spark Streaming
After designing a data concept’s proof ingest pipeline and implementing it, you would most likely make some observations. The observations would include liking the design and how validation of data is done as soon as they arrive. You might also like the fact that raw data can be stored for explorative analysis and the ability to give aggregates that are pre-computed to business analysts which will result to response times that would be faster. You would also admire the fact that data can be handled automatically even when they arrive late and that the data algorithms and structures would be automatically changed. You might however wish that a component that is real-time is available to eliminate the delay of about one hour that occurs after data is received before it is accessible on the dashboards. The major reason for the one hour delay is so that unclean data can be filtered out which would lead to the efficiency of the system being improved. There are however, some cases whereby you would prefer to have new data the moment they come in, clean or unclean so that you can respond to them as soon as possible. You might thus want to include a component for real-time so that it would be easier to impress your users.
The solution to the one hour delay problem is a feature known as lambda architecture. The feature puts together the real-time and batch components. You would need the 2 components due to the fact that real time data arrival always contains fundamental problems. You do not have the assurance that an individual event will be received all at once. Furthermore, the data could have double that would be a sort of distraction for the data. There is also the probability that due to server instability or network, problems might arise as a result of a data arriving late. Experts came up with a solution, whereby you are able to get your fast and probably unreliable data immediately, while the slow and reliable one arrives later.
Why You Should Use Spark?
There is a cost that comes with using the Lambda Architecture approach. The same logic of business would have to be implemented and maintained on two systems that are different. Using Apache Storm to implement your system that is real time and Apache Pig or Apache Hive to implement batch system will result in re-writing related aggregates in Java and in SQL. The situation could however lead to a maintenance nightmare very quickly as noted by some experts.
If you want to use Lambda Architecture however, it would be advisable to use Apache Spark. This is because if the system had been developed with Storm, you would have to do a re-implementation of the whole logical aggregation with storm.
Spark is famous as a structure that can be used to learn machine. It is however quite capable of also undertaking ETL tasks. The APIs for spark can be used easily and they are clean. You can easily read them, and when compared to MapReduce, they have lower boilerplace code. You can also quickly do a logic prototyping quickly using the REPL interface with business operators. This leads to a faster execution of the system when compared with MapReduce.
Spark Streaming, however, happens to be the best advantage of Spark. Spark streaming would allow you to use aggregates which are same with the ones you used as a data stream in real-time for your batch application as well. You would thus not need to do a business logic re-implementation or maintain and test a base for second code. It is thus much easier and faster to achieve a component that is real-time on your system. This would not only be appreciated by the management of the developers and the developers but also by the users.
Do it Yourself
This is a simple and quick example you can follow to implement the Lambda Architecture. To make it easier the grasp, the only included steps are the required ones.
1. You first of all, write a function for business logic implementation. For this instance, we want the quantity of errors for a day to be counted and collected in an events log. The event logs would be composed of the time and date as well as the level of the log, the process of the logging and the message itself:
To be able to get the number of errors recorded in a day, you will have to filter the log level before counting the amount of messages you got for the day:
From the function, you notice that the lines that had “ERROR” were filtered, then use the function, map to place the line’s first word (which is the date) to be the key. You then use the reduced by key function to get the quantity of errors which you got for the day.
You would thus, observe that the RDD is transformed by the function into another. The main data framework of Spark are RDDs which are partitioned essentially, the collections are replicated. You would not be able to see the complex distributed collection handling as Spark is able to hide it. Just like any of the remaining collection, you can easily work with them.
2. Data can be read to an RDD from the HDFS using a function during the process of the Spark ETL. Furthermore they can also be used to count the number of errors and the results saved to the HDFS:
In this instance, a code was executed within a Cluster in Spark by initializing a SparkContext. (You should note that the initialization will not be required if you are using REPL in Spark, as SparkContext initialization is automatic in this case. Once the initialization of the SparkContext has been achieved, there is the need for lines to be read into an RDD from the file before the function for the error count is executed and the result is saved into the file.
The Uniform Resource Locators in the errCount.saveAsTextFile and spark.textFile can be kept in the HDFS through the use of hdfs://….Another alternative is Amazon S3 or local filesystem files amongst others.
3. In the event that you don’t have the luxury of a whole day before you would need the number of errors as you would prefer to document the results on a per minute basis. It would not be necessary for you to do an aggregation re-implementation. All you will need to do is to just use the code for streaming again.
You would have to do a context initialization once more. In this case, an event stream (which would be gotten from socket in a network; Apache Kafka, which is a service that is reliable will also be used by production architecture) would be taken from StreamingContext and converted into an RDD stream.
One of the stream’s micro-batching will be represented by each RDD. The time frame is configurable for each of the micro-batch (which are batches in 60- second in this instance) which can be used as a balance among the larger batches known as throughout and the smaller batches known as latency.
On the DStream, you will perform a map job run, using the function, countErrors, to convert each of the RDD of the lines to RDD in the form (date, errorCount) from the stream.
For each of the RDD, we would display the number of error for the particular batch, and then you can perform a stream update with the counts of the totals that are running using the same RDD. The running totals would be printed by this stream.
For ease of use, you could display the output on the screen, while you further save it in the HDFS, Kafka, or Apache HBase where it can be used by real-time users and applications.