Get Help of Spark to Resolve Big Data Errors

Get Help of Spark (Lambda Architecture)

It can be very challenging to build a reliable, well-designed and functional application for big data that serves different requirements for end-user latency. It is already intimidating to keep pace with quickly changing innovations in technology. The challenge however increases as it is required to build software that would serve for the current problem. It is thus better to begin slowly and create software applications one on one, especially as a beginner. Many architectural data designs have also been built to allow beginners to be able to picture how different types of software work together in a big data structure. Some of the architectural designs and technologies as well as their functions have also been explained.

Some of the major things you will need to construct the architecture are:

  • Human errors and hardware failure fault-tolerance.
  • Support and updates for different use cases including querying for low latency.
  • Capabilities for linear scale-out which implies that the job could be done by adding more machines to solve the problem.
  • Ability to extend to make it easy for new features to be easily added to the system.

One of the major uses of Lambda architecture, therefore, is to solve human errors and hardware failures. The 3 major parts of Lambda architecture include batch layer, serving layer and speed layer. For example, the datasets for the batch layer could be in a file-system that is distributed. MapReduce on the other hand can be applied in creating views for batches which could be sent to the layer for serving. This chapter would discuss errors from big data companies that were solved through Lambda architecture or something similar to Lambda Architecture.

Examples of Errors from Big Data Companies

Many big data companies in the past have encountered problems in the past which led to a shutdown of their database. These big data companies include Facebook, Twitter, RBS and Google.


On the 23rd of September, 2010, Facebook went offline for almost 2.5 hours, due to a fault which they explained as unfortunate handling of error condition. What actually led to the error was that a system which automatically verifies configurations caused more problems than it solved. In lay man terms, what actually happens is that the automatic system checks for an invalid configuration value. In the event that it finds any, it automatically replaces it with a valid configuration value from the persistent store. For as long as the persistent store is valid however, the system would always function perfectly. However, in the event that the persistent store is invalid, the system would not be able to get a valid value to replace the invalid configuration value. This was what led to the error that made facebook to go offline. To overcome the problem, they had to restart the entire Facebook site, which implies that they had to go back to the last reliable batch of data.


Twitter suffered a similar fate on the 21st of June, 2012. On this particular day, Twitter crashed and the site could not display anything to the users at all. The problem that led to the task was termed as a cascaded bug. What actually happened was that a set of characters developed fault and sort of like a wave effect, the faulty characters started to affect other characters which also developed faults. That was why it was called a cascading bug, meaning one failure affecting a functional part making it faulty. This new set of faulty characters also affected another functional part until the whole system crashed. The irony of the whole issue was that the new headquarter of Twitter hosted an Open House Engineering Reliability just 2 days before. To also solve this problem, the Twitter site was rolled back to the last website’s version that was stable.


In June of 2012, RBS also had a technical outage due to an upgrade process. The bank claimed that the error cost the firm about £125m. The outage affected many of their customers which were over a million for a full week. After the problem was identified and rectified, there was still the need to manually intervene in the complex and automated batch processing environment. This caused a lot of daily data backlog and processing of information. The problem would however been more easily solved in the event Lambda architecture was used for the initial software. The manual intervention would not have been necessary or less minimal as they would have just immediately rolled back to the latest version that was stable. This further led to the firm reporting a total loss of £1.5bn for that quarter.


On the 24thof January 2014, Google and all their other sites including Gmail, Documents, Calendar and Google+ were not accessible for about 25 minutes. For some other users, it took about 55 minutes before they could access any Google services. The problem was caused by a software bug on a system that sends messages to the other system on what to do. The bug led to the system sending wrong messages to the other systems for about 15 minutes. This wrong information led to data requests from users to be ignored which led to errors. The effect of these was being felt starting from about 2 minutes after 11am. The original error was later removed and new correct messages were now being generated before the effects started to subsidize and Google services was restored. Lambda architecture could have also served a similar purpose in restoring the Google services as soon as the fault was observed.

Lambda Architecture and Resolving of Big Data Errors

Lambda Architecture was built to basically accept that hardware was going to fail, and in order to accept that, it had to solve it via software. When we look at a platform, we’ve got a platform that can scale; we have built fault tolerance into the platform via the software. But what happens when you get to the development of the software you want to run on this platform. There is a probability that an update in a database could lead to part of the database been deleted mistakenly. Lambda architecture was developed to fix this type of problem by having human fault tolerance. So as the system grows more complex, you will need to have an understanding that mistakes are going to be made, it’s inevitable, so you have to deal with it. You have to put in place measures that would make you not to go through too much stress in the event that you mistakenly delete something important or when an error occurs.

This is where fault tolerance comes in. As earlier discussed, even big companies sometimes have some errors that are disastrous and have to be solved as soon as possible. Now, the Lambda architecture concept isn’t brand new, but it was just recently documented. So the concept as a whole has being around, but Nathan Marz came along and wrote that very succinctly. He clarified it, he gave it a name and he gave something that pretty much anybody should be able to easily understand and implement. He created storm which is really substantial in fault tolerance. His experience at that time and when he became part of twitter really gave way to understanding and absorbing that we have to protect against hardware failures and human errors. So when you’ve pushed bad configuration file out of configuration, how do you recover. What you will need in enterprise application platforms and support operations is to make sure you have support for both that and low frequency. You could build Lambda architecture of low latency, OLAP or OLTP. You would have to build a relational database, put index in certain places that will support the way certain types of queries are created.

You will have data going into your Lambda architecture as well as your batch layer. You would also have to add updates to your database. Furthermore, you would also have to decide how long you want to keep data, perhaps for 7 years. You would then have to build a reliable mater distributed data set in your batch layer. You also have to build it in a way that it would support low latency for usage of data as well as a history of click events. You could also have data streaming for your database. The major factor of the Lambda architecture is that whenever you have a bug, you could go into your batch side and fix your bug. You can also recreate all the crapped up records that were affected. It gives you a way as an enterprise to recover from failure, recover from bugs, and it protects you from all of the headaches that may come on your way.

For example, if you have a flight data, you would have your immutable data set which have stored in your batch layer. Over time, you are going to add different events that have occurred, take-offs, landings, etc. Your data will grow as the days go by and you are going to have to be prepared for this. You can understand what all of this different scaling practices are. But you are going to have the ability to come in here and start asking questions in building queries that you care about. So, how many planes are airborne right now, how many are airborne per airline, how many planes are currently sitting at the tarmac of every single airport, these are simple questions, The important thing here however, is that simple questions fall into either category, you can have a batch, you can have a streaming, how far back do you want to look at data? What do you need to support your used cases, when you have your immutable data set, you have the ability to go in and change the rules of your world, anytime you want to. So if you want to implement this architecture in your own businesses, what do you do?

Role of Spark

There are a lot of different technologies that would fit on this stack and they could follow different places. Some of these components, they can meet the needs in layers. It is however, a good point for you to understand what technologies are out there. So you need to have a way to store historic data, generate queries, view the data and support streaming. You also need to have a reliable batch stream that you can use to fix errors and bugs. To achieve this, you can use platforms such as Kafka, Storm and HBase amongst others. You might see it as a lot of technology to learn and use. However, this is just showing you how to protect yourself from yourself. So, if you take it at that, you can now say that you can throw Spark in here. But you have reasons why you don’t want Spark in certain places or maybe those reasons why you want spark in other situations and no other technologies. You might have a whole bunch of legacy codes that were written 2 to 3 years ago, sitting at your data center, running operations and you don’t want to go through the effort to replace it. This is one of the advantages of Spark. So, when you look at these technologies in an integrated approach, there is a soft authority out there, twitter released something, there is a project called lamp dupe that has not delivered only lambda architecture but its working towards that. And it also patches spark. A patched spark, fundamentally delivers all of the pieces necessary to satisfy the lambda architecture. So, hopefully, to take away from most people here is you can implement the lambda architecture, simply, with spark.

Is Spark Going to Be Every Single Facet of Your Lambda Implementation?

Maybe or maybe not! It doesn’t dictate that, it just happens to have all the pieces necessary for you to sprout your used cases. You can generate views of spark sequel and you can sprout the speed layer with spark streaming as you have the spark to execution engine. Now from MapR perspective, you can actually have a customer in Cisco who has effectively implemented lambda architecture with Spark. You can thus use them for security analysis and other security related stuffs. Now again, the benefits of this technology stack, this architecture is to save you panics.

0 Responses on Get Help of Spark to Resolve Big Data Errors"

Leave a Message

Your email address will not be published. Required fields are marked *

Copy Rights Reserved © Mindmajix.com All rights reserved. Disclaimer.