Apache Spark Tutorial
Apache Spark Tutorial
This tutorial gives you an overview and talks about the fundamentals of Apache Spark.
- The Spark project consists of multiple components:- Spark core and Resilient distributed datasets(RDD’s), Spark SQL, Spark Streaming, MLlib Machine learning library and GraphX.
- Spark Core and Resilient Distributed Datasets (RDDs): Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic I/O functionalities. The fundamental programming abstraction is called Resilient Distributed Datasets, a logical collection of data partitioned across machines. RDDs can be created by referencing datasets in external storage systems, or by applying coarse-grained transformations (e.g. map, filter, reduce, join) on existing RDDs.The RDD abstraction is exposed through a language-integrated API in Java, Python, Scala similar to local, in-process collections. This simplifies programming complexity because the way applications manipulate RDDs is similar to manipulating local collections of data.
- Spark SQL: Spark SQL is a component on top of Spark Core that introduces a new data abstraction called Schema RDD, which provides support for structured and semi-structured data. Spark SQL provides a domain-specific language to manipulate SchemaRDDs in Scala, Java, or Python. It also provides SQL language support, with command-line interfaces and ODBC/JDBC server.
- Spark Streaming: Spark Streaming leverages Spark Core’s fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD transformations on those mini-batches of data. This design enables the same set of application code written for batch analytics to be used in streaming analytics, on a single engine.
- MLlib Machine Learning Library: MLlib is a distributed machine learning framework on top of Spark that because of the distributed memory-based Spark architecture is, according to benchmarks done by the MLlib developers, ten times as fast as Hadoop disk-based Apache Mahout and even scales better than Vowpal Wabbit. It implements many common machine learning and statistical algorithms to simplify large scale machine learning pipelines, including:
- summary statistics, correlations, stratified sampling, hypothesis testing, random data generation
- classification and regression: SVMs, logistic regression, linear regression, decision trees, naive Bayes
- collaborative filtering: alternating least squares (ALS)
- clustering: k-means
- dimensionality reduction: singular value decomposition (SVD), principal component analysis (PCA)
- feature extraction and transformation
- optimization primitives: stochastic gradient descent, limited-memory BFGS (L-BFGS)
This article is just an overview to enlighten you over Apache Spark software. The Spark training sessions are however designed to be more composed, knowledgeable and in-depth.