Spark is an open-source framework for cluster computing. Its processing is done inmemory so that the speed of processing of analytic applications can be increased even up to 100 times faster as compared to the other technologies which are available in the market today. The development of the framework was done in the AMPLab at UC Berkeley. With this framework, the complexity involved in the interaction of data is greatly reduced. The speed of processing will also be greatly improved. The framework also has the effect of enhancing the effectiveness of mission critical applications with a very deep intelligence.
The framework is supported by different platforms, and this shows how flexible it is. The process of creating algorithms by use of Spark is very easy, and this makes it easy for its users to harness their insight from very complex data. In the year 2014, Spark was greatly expanded to a top level project. The expansion is still ongoing, and this shows how great the framework might be in the future. The framework is also well suitable for use with algorithms for machine learning. For you to use Spark, you are required to have a distributed storage system and a cluster manager. When it comes to cluster management, Spark will support Hadoop YARN, standalone or Apache Mesos.
When it comes to a distributed storage, there is a wide variety of apps which can be interfaced with Spark.
Ignite Solutions With Apache Spark
Millions and millions of people are communicating through massively connected networks, generating vast amounts of data daily. With the advancement of technologies, researchers can now extract a huge amount of sample data within a few hours. These kind of applications leave us with one major concern. How we are supposed to process and analyse such a vast amount of large-scale data and co-op up with their speed to provide better and faster solutions?
Related Page:: Apache Spark Resource Administration and YARN App Models
In this post, let us get introduced to Apache Spark: an open source framework for big data analysis, look upon what features it comes with and install and perform a simple analysis with the framework as an example.
How Spark Ignited
Originated in the AMP lab of the University of California, Berkley in 2009 and merged with Apache projects in a year later, Apache Spark emerged as a fast and convenient solution to perform complex analysis of large-scale data. A significant advantage of Apache Spark over already existing technologies like Hadoop and Storm is that it comes as a complete solution to analyze data coming from different sources like real time or batch processing in various formats such as images, texts, graphs, etc. We will be comparing it with Hadoop: an already existing solution for Mapreduce later in this post.
Not only MapReduce procedures, Spark framework comes with tools for machine learning, data streaming, processing graph data and also running SQL queries. These functions can be either performed individually or combined to be run in a pipeline, according to user requirement.
Some of the eye catching features of Apache Spark are that it enables users to query data easily with its inbuilt set of high level operators which is over 80 in number and allows programming in Java, Python or Scala. It can also speed up an application running in a Hadoop cluster in 100 times in-memory and 10 times on disk.
Next, Let us see more details about interesting features Spark brings to the users.
Apache Spark Support for Large-scale Data Analysis
Compared to existing technologies, Spark significantly speeds up the analysis and generates results in real time while effectively storing data in-memory. Spark can perform both in-memory and on-disk and hence will effectively handle the cases of where in-memory is insufficient to handle complete data set or they do not fit in the total of all the cluster memories. Execution in in-memory or on-disk memory can be adjusted according to your application requirements. Since, Spark performs many operations, keeping intermediate results in-memory, it enables to achieve improved performance if you are constantly using the same set of data.
Apache Spark has optimized Mapreduce functions, reducing the computational cost involved in processing the data and also enables the user to optimize data processing pipelines by supporting lazy-evaluation. As mentioned above, Spark supports additional functionalities than MapReduce for data processing and analysis and has improved performance in generating arbitrary operator graphs as well.
Spark was originally written in Scala language and runs inside a Java Virtual Machine (JVM). A command line tool is available for Scala and Python and it provides programming interfaces in Scala, Java and Python languages. In addition to these three languages, applications running with Spark framework can be also implemented in R and Clojure.
Comparison with Hadoop
As far as big data analysis is considered, Hadoop has been a promising solution for over a decade. With its added features and optimizations, Apache Spark comes as a promising alternative for HadoopMapReduce. Let’s dig deeper to see how.
Hadoop would be a perfect for a solution including sequential data processing, where each step would consist of a Map and a Reduce function with a single pass over the input. But for certain applications involving several passes over the input, it comes with the cost of maintaining the MapReduce operations at each step of the computation. This workflow of Hadoop MapReduce requires constant writing to and reading from the cluster memories and may significantly slow down the system with added time taken for these memory operations. Some other considerations which make using Hadoop painful are the complexity of configuring and maintaining the clusters and integration of different third party tools based on what the application includes, whether it includes machine learning or processing of streamed data, etc.
With its optimized design, Apache Spark executes on a set up similar to Hadoop Distributed File System and has achieved improved performance over the existing, while providing added functionalities. With its functions implemented using Directed Acyclic Graphs, Spark enables to develop pipelines involving multiprocessing of the data. It shares the data across these data structures in-memory and allows to process the same set of data in parallel. Spark comes with utilities to develop applications including different types of data analysis and processing and hence comes as a comprehensive solution as illustrated in the next section.
Apache Spark Libraries
In addition to its core, Spark also provides some useful set of libraries for huge data analytics. A summary of some important libraries is presented in the table below.
On top of the above, comes a vast set of libraries providing various exciting functionalities. One such is the approximate query engine Blink DB which enables to execute SQL queries upon a huge amount of data in an interactive manner, to view clear interpretation of the results with error bars and select between query accuracy and response time. Another is TACHYON, a distributed file system which speeds up the file sharing among different frameworks such as Spark and Hadoop, performing in-memory operations so that, frequently read data sets can be cached and accessed quickly.
Get Updates on Tech posts, Interview & Certification questions and training schedules