Hadoop, Spark and Scala

Overview

A general persistent and dissolute cluster computing or framework is called Spark whereas Scala is a programming language in which Spark is written. Participants would

- Learn about Hadoop Traditional Models
- Understand HDFS Architecture
- Understand MapReduce
- Learn about Impala and Hive
- Understand RDD lineage
- Understand PIG

Duration
3 Days

Pre-Requisites

Participants must have familiarity with Java

Participants must have Intermediate level of exposure in Data Analytics

Course Outline

Lesson 1

Traditional-models
Problems: Traditional Large-Scale-Systems
Understanding of Hadoop
Hadoop-Eco-System

Lesson 2

Distributed Processing: On a Cluster
Storage: HDFS-Architecture
Storage: Using-HDFS
Resource-Management: YARN
Resource-Management: YARN-Architecture
Resource-Management: Using YARN

Lesson 3

Map-reduce
Characteristics of Map-reduce
Advanced map-reduce
Sqoop overview,
basic import & exports inSqoop
improving Sqoop’s performance
limitations of Sqoop and Sqoop2

Lesson 4

Introducing Impala/Hive
Importance of Impala/Hive
Difference: Impala/Hive
How: Impala/Hive
Hive & Traditional-Database

Lesson 5

Understanding Meta-store
Creating: Databases & Tables in Hive & Impala
Loading Data into Tables of Hive & Impala
Understanding HCatalog
Impala: cluster

Lesson 6

Various File Format
Hadoop Tool Support: File Formats
Understanding Avro Schemas
Understanding Avro with Hive/Sqoop
Evolution: Avro Schema

Lesson 7

Overview: DataFile Partitioning
Partitioning: Impala/Hive
Using Partition
Bucketing: Hive
Advance concepts: Hive

Lesson 8

Overview of Sqoop
Basic: Imports & Exports
Performance improving Sqoop
Limitations: Sqoop
Understanding Sqoop 2
Understanding Apache Flume
Basic: Flume Architecture
Understanding Flume-Sources
Understanding Flume-Sinks
Understanding Flume-Channels
Configuration ofFlume
Understanding HBase
Architecture HBase
Data storage: HBase
Comparing HBase & RDBMS
Using HBase

Lesson 9

Understanding Pig
Components: Pig
Comparing Pig & SQL
Using Pig

Lesson 10

Understanding Apache Spark
What is Spark Shell
Understanding RDDs (Resilient Distributed Datasets)
Functional Programming: Sparks

Lesson 11

Exploring RDD
OtherPair: RDD Operations
Key-Value – Pair RDD

Lesson 12

Comparing Spark Applications/Spark Shell
Build Spark Context
Creating: Spark-Application (Scala and Java)
Spark onYARN: Client-Mode
Spark on YARN: Cluster-Mode
Dynamic-Resource-Allocation
Configuration: Spark-Properties

Lesson 13

Spark: Cluster
Understanding RDD-Partitions
Partitioning: File-based RDDs
HDFS & Data-Locality
Parallel Operations: Partitions
Understanding Stages & Tasks
Controlling: Levels Parallelism

Lesson 14

Understanding RDD Lineage
Overview of Caching
Distributed-Persistence
Storage Levels of RDD Persistence
Correct RDD Persistence Storage Level
RDD: Fault tolerance

Lesson 15

Used Cases: Spark
Iterative Algorithms: Spark
Understanding Machine Learning
Graph Processing & Analysis
Example k-means

Lesson 16

Context: Spark SQL & the SQL
Creation of Data-Frames
Transforming/Querying of Data-Frames
Impala Vs Spark SQL

Hadoop, Spark and Scala

Stay updated about us!