Spark

Overview

Spark training provides participants with a good technical introduction to the Spark architecture and how it works. Participants will learn the basic building blocks of Spark, which includes RDDs and the distributed compute engine and higher-level concepts which provides a simpler and more capable interface, including Spark SQL and Data Frames. This training course covers advanced skills like use of Spark Streaming to process streaming data, and also provides an overview of Spark Graph Processing – GraphX and GraphFrames and Spark Machine Learning- SparkML Pipelines. The participants will explore possible performance issues, cluster deployment techniques, troubleshooting, and strategies for optimization. The participants will:

- Understand the need for Spark in data processing and also the Spark architecture as to how it distributes its computations to cluster nodes
- Understand and be familiar with basic installation, setup, and layout of Spark
- Understand the use of Spark for interactive and ad-hoc operations
- Understand the use of Dataset, DataFrame, Spark SQL to process structured data

Duration
2 Days

Pre-Requisites

Fundamental knowledge of any programming language

Basic understanding of either of database, SQL and query language

Participants must have working knowledge of Linux/ Unix based systems

Course Outline

Introduction to Spark

Overview, Motivations, Spark Systems
Spark Ecosystem
Spark vs. Hadoop
Typical Spark Deployment and Usage Environments

RDDs and Spark Architecture

RDD Concepts, Partitions, Lifecycle, Lazy Evaluation
Working with RDDs by Creating & Transforming map, filter, etc.
Caching – Concepts, Storage Type, Guidelines

DataSets, DataFrames and Spark SQL

Introduction and Usage
Creating and Using a DataSet
Working with JSON
Using the DataSet DSL
Using SQL with Spark
Data Formats
Optimizations: Catalyst and Tungsten
DataSets vs. DataFrames vs. RDDs

Creating Spark Applications

Overview, Basic Driver Code, SparkConf
Creating and Using a SparkContext/SparkSession
Building and Running Applications
Application Lifecycle
Cluster Managers
Logging and Debugging

Spark Streaming

Overview and Streaming Basics
Structured Streaming
DStreams Discretized Steams,
Architecture, Stateless, Stateful, and Windowed Transformations
Spark Streaming API
Programming and Transformations

Performance Characteristics and Tuning

The Spark UI
Narrow vs. Wide Dependencies
Minimizing Data Processing and Shuffling
Caching – Concepts, Storage Type, Guidelines
Using Caching
Using Broadcast Variables and Accumulators

Spark GraphX Overview

Introduction
Constructing Simple Graphs
GraphX API
Shortest Path Example

MLLib Overview

Introduction
Feature Vectors
Clustering / Grouping, K-Means
Recommendations
Classifications

Spark

Stay updated about us!