Big Data with Hadoop

Overview

This training on Apache Hadoop is 50% Lecture/50% Lab. Hands-on exercises make up the lab portion of the class which includes Hadoop setup in pseudo distributed mode, managing files in HDFS, write map reduce programs in Java, Hadoop monitoring, sqoop, hive & pig
Duration
5 Days

Pre-Requisites
Prior knowledge of Core Java and SQL will be helpful but is not mandatory

Course Outline

Introduction to BigData 

  • Which data is called as BigData?
  • What are business use cases for BigData?
  • BigData requirement for traditional Data warehousing & BI space  
  • BigData solutions

 Introduction to Hadoop

  • The amount of data processing in today’s life
  • What Hadoop is why it is important
  • Hadoop comparison with traditional systems
  • Hadoop history
  • Hadoop main components & architecture

 Hadoop Distributed File System (HDFS)

  • Overview and design
  • Architecture
  • File storage
  • Component failures & recoveries
  • Block placement
  • Balancing the Hadoop cluster

 Working with HDFS

  • Ways of accessing data in HDFS
  • Common HDFS operations & commands
  • Different HDFS commands
  • Internals of a file read in HDFS
  • Data copying with ‘distcp’

 Map-Reduce Abstraction

  • What MapReduce is? 
  • Why it is popular?
  • The Big Picture of MapReduce
  • MapReduce process & terminology
  • MapReduce components failures & recoveries
  • Working with MapReduce
  • Lab: Working with MapReduce

Programming MapReduce Jobs

  • Java MapReduce implementation
  • Map() and Reduce() methods
  • Java MapReduce calling code
  • Lab: Programming Word Count

 MapReduce Features

  • Joining Data Sets in MapReduce Jobs
  • How to write a Map-Side Join?
  • How to write a Reduce-Side Join?
  • MapReduce Counters
  • Built-in & user-defined counters
  • Retrieving MapReduce counters
  • Lab: Map-Side Join

 Troubleshooting MapReduce Jobs

  • How to Find & Review Logs for Yarn MapReduce Jobs?
  • Understanding log messages
  • Viewing & Filtering MapReduce Activities

Hive

  • Hive Background
  • Hive Use Case
  • About Hive
  • Hive Vs Pig
  • Hive Architecture and Components
  • Metastore in Hive
  • Limitations of Hive
  • Comparison with Traditional Database
  • Hive Data Types and Data Models
  • Partitions and Buckets
  • Hive Tables – Managed Tables and External Tables
  • Importing Data
  • Querying Data
  • Managing Outputs
  • Hive Script
  • Hive UDF
  • Hive Demo on Healthcare Data set

 Hands On:

  • Understanding the map reduce flow in the Hive-SQL
  • Creating Static partition table
  • Creating Dynamic partition table
  • Loading unstructured text file into table using Regex serde
  • Loading JSON file into table using Json serde
  • Creating transaction table
  • Creating view and indexes
  • Creating ORC, Parquet tables and using compression techniques
  • Creating Sequence file table 

Writing Java code for UDF

Writing JAVA code to connect with Hive & perform CRUD Operations using JDBC

  • Using Sqoop to capture 
    • RDBMS data into HDFS
    • RDBMS data into Hive
    • RDBMS into Hbase
    • exporting data into RDBMS from HDFS

Scala

Duration: 4 Hours 

Basics:

  • Hello World
  • Primitive Types
  • Type inference
  • Vars vs Vals
  • Lazy Vals
  • Methods
  • Pass By Name
  • No parens/Brackets
  • Default Arguments
  • Named Arguments

Classes:

  • Introduction
  • Inheritance
  • Main/Additional Constructors
  • Private Constructors
  • Uniform Access
  • Case Classes
  • Objects
  • Traits

Collections

  • Lists
  • Collection Manipulation
  • Simple Methods
  • Methods With Functions
  • Use Cases With Common Methods
  • Tuples

 Types

  • Type parameterization
  • Covariance
  • Contravariance
  • Type Upper Bounds
  • ‘Nothing’ Type

 Options

  • Option Implementation
  • Like Lists
  • Practice Application

 Anonymous Classes

  • Introduction
  • Structural Typing
  • Anonymous Classes with Structural Typing

 Special Methods

  • Apply
  • Update

 Closure and functions

 Currying

  • Introduction
  • Applications

 Implicit

  • Implicit Values/Parameters
  • Implicit Conversions
  • With Anonymous Classes
  • Implicit Classes

 For Loops

  • Introduction
  • Coding Style
  • With Options
  • And flat Map
  • Guards
  • Definitions

 Var Args

  • Introduction
  • Ascribing the _* type

 Partial Functions

  • Introduction
  • Match
  • Match Values/Constants
  • Match Types
  • Extractors
  • If Conditions
  • Or

 Working with XML & JSON

Performance tuning guidelines 

Packing and deployment 

 Introduction of Spark

  • Evolution of distributed systems
  • Why we need new generation of distributed system?
  • Limitation with Map Reduce in Hadoop, 
  • Understanding need of Batch Vs. Real Time Analytics 
  • Batch Analytics 
    • Hadoop Ecosystem Overview
    • Real Time Analytics Options
  • Introduction to stream and in memory analysis 
  • What is Spark?
  • A Brief History

 Using Scala for creating Spark Application

  • Invoking Spark Shell
  • Creating the SparkContext 
  • Loading a File in Shell
  • Performing Some Basic Operations on Files in Spark Shell
  • Building a Spark Project with sbt
  • Running Spark Project with sbt
  • Caching Overview
  • Distributed Persistence
  • Spark Streaming Overview
  • Example: Streaming Word Count
  • Testing Tips in Scala
  • Performance Tuning Tips in Spark
  • Shared Variables: Broadcast Variables
  • Shared Variables: Accumulators

Hands On:

  • Installing Spark
  • Installing SBT & maven for building the project
  • Writing code for 
    • Converting HDFS data into RDD
    • Performing different transformation and action
  • Understanding tasks, stages related to spark job
  • Writing code for using different storage levels & Caching 
  • Creating broadcast & accumulators & using them

 Running SQL queries using Spark SQL

  • Starting Point: SQLContext
  • Creating DataFrames
  • DataFrame Operations
  • Running SQL Queries Programmatically
  • Interoperating with RDDs
  • Inferring the Schema Using Reflection
  • PInferring the Schema Using Reflection
  • Data Sources
  • Generic Load/Save Functions
  • Save Modes
  • Saving to Persistent Tables
  • Parquet Files
  • Loading Data Programmatically
  • Partition Discovery
  • Schema Merging
  • JSON Datasets
  • Hive Tables
  • JDBC To Other Databases
  • Hbase Integration
  • Read Solr results as a data Frame
  • Troubleshooting
  • Performance Tuning
  • Caching Data in Memory
  • Compatibility with Apache Hive
  • Unsupported Hive Functionality

 Hands On:

  • Writing code for Creating 
    • SparkContext 
    • HiveContext 
    • HbaseContext objects
  • Writing code for Running 
    • Hive queries using Spark-SQL
  • Writing code 
    • Loading 
    • transforming text file data 
    • converting that into Dataframe 
  • Writing code 
    • Reading and storing JSON files as Dataframes inside the spark code
  • Writing code for Reading & storing PERQUET files as Dataframes
  • Reading & Writing data into RDBMS using Spark-SQL
  • Caching the dataframes
  • Java code for Reading Solr results as DataFrame

Spark Streaming

  • Micro batch
  • Discretized Streams – DStreams
  • Input DStreams and Receivers
  • Dstream to RDD
  • Basic Sources
  • Advanced Sources
  • Transformations on DStreams
  • Output Operations on DStreams
  • Design Patterns for using foreachRDD
  • DataFrame and SQL Operations
  • Checkpointing
  • Socket stream
  • File Stream
  • Stateful operations
  • How stateful operations work?
  • Window Operations
  • Join Operations

Tuning Spark

  • Data Serialization
  • Memory Tuning
  • Determining Memory Consumption
  • Tuning Data Structures
  • Serialized RDD Storage
  • Garbage Collection Tuning
  • Other Considerations
  • Level of Parallelism
  • Memory Usage of Reduce Tasks
  • Broadcasting Large Variables
  • Data Locality
  • Summary

Spark ML Programming 

  • Data types
  • Classification & regression
  • Collaborative filtering
  • Alternating least squares (ALS)

 Hands On:

  • Writing code for 
    • Processing Flume data using Spark streaming
    • Processing network data using Spark streaming
    • Processing Kafka data using Spark streaming 
  • Writing code & performing SVMs, linear regression, logistic regression

 Data Loading: 

  • Learn different data loading options available in Hadoop 
  • Learn details about Flume & Sqoop to demonstrate how to bring various kind of files such as 
    • Web server logs 
    • stream data
    • RDBMS
    • twetter ‘s tweet into HDFS. 

Flume and Sqoop 

Kafka

  • Introduction
  • Basic Kafka Concepts
  • Kafka vs Other Messaging Systems
  • Intra-Cluster Replication
  • nside Look at Kafka’s Components
  • Cluster Administration
  • Using Kafka Connect to Move Data

Hands-On:

  • Using flume to capture and transport 
    • network data
    • web server log data
    • Twitter data
  • Creating topic & configuring replication factor & no of partition for the same
  • Loading data
  • Reading data