Big Data with Hadoop

Overview

This training on Apache Hadoop is 50% Lecture/50% Lab. Hands-on exercises make up the lab portion of the class which includes Hadoop setup in pseudo distributed mode, managing files in HDFS, write map reduce programs in Java, Hadoop monitoring, sqoop, hive & pig

Duration
5 Days

Pre-Requisites
Prior knowledge of Core Java and SQL will be helpful but is not mandatory

Course Outline

Day 1

Introduction to BigData

Which data is called as BigData?
What are business use cases for BigData?
BigData requirement for traditional Data warehousing & BI space
BigData solutions

Introduction to Hadoop

The amount of data processing in today’s life
What Hadoop is why it is important
Hadoop comparison with traditional systems
Hadoop history
Hadoop main components & architecture

Hadoop Distributed File System (HDFS)

Overview and design
Architecture
File storage
Component failures & recoveries
Block placement
Balancing the Hadoop cluster

Working with HDFS

Ways of accessing data in HDFS
Common HDFS operations & commands
Different HDFS commands
Internals of a file read in HDFS
Data copying with ‘distcp’

Map-Reduce Abstraction

What MapReduce is?
Why it is popular?
The Big Picture of MapReduce
MapReduce process & terminology
MapReduce components failures & recoveries
Working with MapReduce
Lab: Working with MapReduce

Day 2

Programming MapReduce Jobs

Java MapReduce implementation
Map() and Reduce() methods
Java MapReduce calling code
Lab: Programming Word Count

MapReduce Features

Joining Data Sets in MapReduce Jobs
How to write a Map-Side Join?
How to write a Reduce-Side Join?
MapReduce Counters
Built-in & user-defined counters
Retrieving MapReduce counters
Lab: Map-Side Join

Troubleshooting MapReduce Jobs

How to Find & Review Logs for Yarn MapReduce Jobs?
Understanding log messages
Viewing & Filtering MapReduce Activities

Day 3

Hive

Hive Background
Hive Use Case
About Hive
Hive Vs Pig
Hive Architecture and Components
Metastore in Hive
Limitations of Hive
Comparison with Traditional Database
Hive Data Types and Data Models
Partitions and Buckets
Hive Tables – Managed Tables and External Tables
Importing Data
Querying Data
Managing Outputs
Hive Script
Hive UDF
Hive Demo on Healthcare Data set

Hands On:

Understanding the map reduce flow in the Hive-SQL
Creating Static partition table
Creating Dynamic partition table
Loading unstructured text file into table using Regex serde
Loading JSON file into table using Json serde
Creating transaction table
Creating view and indexes
Creating ORC, Parquet tables and using compression techniques
Creating Sequence file table

Writing Java code for UDF

Writing JAVA code to connect with Hive & perform CRUD Operations using JDBC

Using Sqoop to capture
- RDBMS data into HDFS
- RDBMS data into Hive
- RDBMS into Hbase
- exporting data into RDBMS from HDFS

Scala

Duration: 4 Hours

Basics:

Hello World
Primitive Types
Type inference
Vars vs Vals
Lazy Vals
Methods
Pass By Name
No parens/Brackets
Default Arguments
Named Arguments

Classes:

Introduction
Inheritance
Main/Additional Constructors
Private Constructors
Uniform Access
Case Classes
Objects
Traits

Day 4

Collections

Lists
Collection Manipulation
Simple Methods
Methods With Functions
Use Cases With Common Methods
Tuples

Types

Type parameterization
Covariance
Contravariance
Type Upper Bounds
‘Nothing’ Type

Options

Option Implementation
Like Lists
Practice Application

Anonymous Classes

Introduction
Structural Typing
Anonymous Classes with Structural Typing

Special Methods

Apply
Update

Closure and functions

Currying

Introduction
Applications

Implicit

Implicit Values/Parameters
Implicit Conversions
With Anonymous Classes
Implicit Classes

For Loops

Introduction
Coding Style
With Options
And flat Map
Guards
Definitions

Var Args

Introduction
Ascribing the _* type

Partial Functions

Introduction
Match
Match Values/Constants
Match Types
Extractors
If Conditions
Or

Working with XML & JSON

Performance tuning guidelines

Packing and deployment

Introduction of Spark

Evolution of distributed systems
Why we need new generation of distributed system?
Limitation with Map Reduce in Hadoop,
Understanding need of Batch Vs. Real Time Analytics
Batch Analytics
- Hadoop Ecosystem Overview
- Real Time Analytics Options
Introduction to stream and in memory analysis
What is Spark?
A Brief History

Using Scala for creating Spark Application

Invoking Spark Shell
Creating the SparkContext
Loading a File in Shell
Performing Some Basic Operations on Files in Spark Shell
Building a Spark Project with sbt
Running Spark Project with sbt
Caching Overview
Distributed Persistence
Spark Streaming Overview
Example: Streaming Word Count
Testing Tips in Scala
Performance Tuning Tips in Spark
Shared Variables: Broadcast Variables
Shared Variables: Accumulators

Hands On:

Installing Spark
Installing SBT & maven for building the project
Writing code for
- Converting HDFS data into RDD
- Performing different transformation and action
Understanding tasks, stages related to spark job
Writing code for using different storage levels & Caching
Creating broadcast & accumulators & using them

Running SQL queries using Spark SQL

Starting Point: SQLContext
Creating DataFrames
DataFrame Operations
Running SQL Queries Programmatically
Interoperating with RDDs
Inferring the Schema Using Reflection
PInferring the Schema Using Reflection
Data Sources
Generic Load/Save Functions
Save Modes
Saving to Persistent Tables
Parquet Files
Loading Data Programmatically
Partition Discovery
Schema Merging
JSON Datasets
Hive Tables
JDBC To Other Databases
Hbase Integration
Read Solr results as a data Frame
Troubleshooting
Performance Tuning
Caching Data in Memory
Compatibility with Apache Hive
Unsupported Hive Functionality

Hands On:

Writing code for Creating
- SparkContext
- HiveContext
- HbaseContext objects
Writing code for Running
- Hive queries using Spark-SQL
Writing code
- Loading
- transforming text file data
- converting that into Dataframe
Writing code
- Reading and storing JSON files as Dataframes inside the spark code
Writing code for Reading & storing PERQUET files as Dataframes
Reading & Writing data into RDBMS using Spark-SQL
Caching the dataframes
Java code for Reading Solr results as DataFrame

Day 5

Spark Streaming

Micro batch
Discretized Streams – DStreams
Input DStreams and Receivers
Dstream to RDD
Basic Sources
Advanced Sources
Transformations on DStreams
Output Operations on DStreams
Design Patterns for using foreachRDD
DataFrame and SQL Operations
Checkpointing
Socket stream
File Stream
Stateful operations
How stateful operations work?
Window Operations
Join Operations

Tuning Spark

Data Serialization
Memory Tuning
Determining Memory Consumption
Tuning Data Structures
Serialized RDD Storage
Garbage Collection Tuning
Other Considerations
Level of Parallelism
Memory Usage of Reduce Tasks
Broadcasting Large Variables
Data Locality
Summary

Spark ML Programming

Data types
Classification & regression
Collaborative filtering
Alternating least squares (ALS)

Hands On:

Writing code for
- Processing Flume data using Spark streaming
- Processing network data using Spark streaming
- Processing Kafka data using Spark streaming
Writing code & performing SVMs, linear regression, logistic regression

Data Loading:

Learn different data loading options available in Hadoop
Learn details about Flume & Sqoop to demonstrate how to bring various kind of files such as
- Web server logs
- stream data
- RDBMS
- twetter ‘s tweet into HDFS.

Flume and Sqoop

Kafka

Introduction
Basic Kafka Concepts
Kafka vs Other Messaging Systems
Intra-Cluster Replication
nside Look at Kafka’s Components
Cluster Administration
Using Kafka Connect to Move Data

Hands-On:

Using flume to capture and transport
- network data
- web server log data
- Twitter data
Creating topic & configuring replication factor & no of partition for the same
Loading data
Reading data

Big Data with Hadoop

Stay updated about us!