Spark

Overview

Spark training provides participants with a good technical introduction to the Spark architecture and how it works. Participants will learn the basic building blocks of Spark, which includes RDDs and the distributed compute engine and higher-level concepts which provides a simpler and more capable interface, including Spark SQL and Data Frames. This training course covers advanced skills like use of Spark Streaming to process streaming data, and also provides an overview of Spark Graph Processing – GraphX and GraphFrames and Spark Machine Learning- SparkML Pipelines. The participants will explore possible performance issues, cluster deployment techniques, troubleshooting, and strategies for optimization. The participants will:
    • Understand the need for Spark in data processing and also the Spark architecture as to how it distributes its computations to cluster nodes
    • Understand and be familiar with basic installation, setup, and layout of Spark
    • Understand the use of Spark for interactive and ad-hoc operations
    • Understand the use of Dataset, DataFrame, Spark SQL to process structured data
Duration
2 Days

Pre-Requisites
  • Fundamental knowledge of any programming language
  • Basic understanding of either of database, SQL and query language
  • Participants must have working knowledge of Linux/ Unix based systems
  • Course Outline

    • Overview, Motivations, Spark Systems
    • Spark Ecosystem
    • Spark vs. Hadoop
    • Typical Spark Deployment and Usage Environments
    • RDD Concepts, Partitions, Lifecycle, Lazy Evaluation
    • Working with RDDs by Creating & Transforming map, filter, etc.
    • Caching – Concepts, Storage Type, Guidelines
    • Introduction and Usage
    • Creating and Using a DataSet
    • Working with JSON
    • Using the DataSet DSL
    • Using SQL with Spark
    • Data Formats
    • Optimizations: Catalyst and Tungsten
    • DataSets vs. DataFrames vs. RDDs
    • Overview, Basic Driver Code, SparkConf
    • Creating and Using a SparkContext/SparkSession
    • Building and Running Applications
    • Application Lifecycle
    • Cluster Managers
    • Logging and Debugging
    • Overview and Streaming Basics
    • Structured Streaming
    • DStreams Discretized Steams,
    • Architecture, Stateless, Stateful, and Windowed Transformations
    • Spark Streaming API
    • Programming and Transformations
    • The Spark UI
    • Narrow vs. Wide Dependencies
    • Minimizing Data Processing and Shuffling
    • Caching – Concepts, Storage Type, Guidelines
    • Using Caching
    • Using Broadcast Variables and Accumulators
    • Introduction
    • Constructing Simple Graphs
    • GraphX API
    • Shortest Path Example
    • Introduction
    • Feature Vectors
    • Clustering / Grouping, K-Means
    • Recommendations
    • Classifications