Apache Spark Tutorial

This tutorial gives you an overview and talks about the fundamentals of Apache Spark.

  • The Spark project consists of multiple components:- Spark core and Resilient distributed datasets(RDD’s), Spark SQL, Spark Streaming, MLlib Machine learning library and GraphX.
  • Spark Core and Resilient Distributed Datasets (RDDs): Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic I/O functionalities. The fundamental programming abstraction is called Resilient Distributed Datasets, a logical collection of data partitioned across machines. RDDs can be created by referencing datasets in external storage systems, or by applying coarse-grained transformations (e.g. map, filter, reduce, join) on existing RDDs.The RDD abstraction is exposed through a language-integrated API in Java, Python, Scala similar to local, in-process collections. This simplifies programming complexity because the way applications manipulate RDDs is similar to manipulating local collections of data.
  • Spark SQL: Spark SQL is a component on top of Spark Core that introduces a new data abstraction called Schema RDD, which provides support for structured and semi-structured data. Spark SQL provides a domain-specific language to manipulate SchemaRDDs in Scala, Java, or Python. It also provides SQL language support, with command-line interfaces and ODBC/JDBC server.
  • Spark Streaming: Spark Streaming leverages Spark Core’s fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD transformations on those mini-batches of data. This design enables the same set of application code written for batch analytics to be used in streaming analytics, on a single engine.
  • MLlib Machine Learning Library: MLlib is a distributed machine learning framework on top of Spark that because of the distributed memory-based Spark architecture is, according to benchmarks done by the MLlib developers, ten times as fast as Hadoop disk-based Apache Mahout and even scales better than Vowpal Wabbit. It implements many common machine learning and statistical algorithms to simplify large scale machine learning pipelines, including:
    • summary statistics, correlations, stratified sampling, hypothesis testing, random data generation
    • classification and regression: SVMs, logistic regression, linear regression, decision trees, naive Bayes
    • collaborative filtering: alternating least squares (ALS)
    • clustering: k-means
    • dimensionality reduction: singular value decomposition (SVD), principal component analysis (PCA)
    • feature extraction and transformation
    • optimization primitives: stochastic gradient descent, limited-memory BFGS (L-BFGS)

This article is just an overview to enlighten you over Apache Spark software. The Spark training sessions are however designed to be more composed, knowledgeable and in-depth.


Get Updates on Tech posts, Interview & Certification questions and training schedules