Hadoop, Spark and Scala

Overview

A general persistent and dissolute cluster computing or framework is called Spark whereas Scala is a programming language in which Spark is written. Participants would
    • Learn about Hadoop Traditional Models
    • Understand HDFS Architecture
    • Understand MapReduce
    • Learn about Impala and Hive
    • Understand RDD lineage
    • Understand PIG
Duration
3 Days

Pre-Requisites
  • Participants must have familiarity with Java
  • Participants must have Intermediate level of exposure in Data Analytics
  • Course Outline

    • Traditional-models
    • Problems: Traditional Large-Scale-Systems
    • Understanding of Hadoop
    • Hadoop-Eco-System
    • Distributed Processing: On a Cluster
    • Storage: HDFS-Architecture
    • Storage: Using-HDFS
    • Resource-Management: YARN
    • Resource-Management: YARN-Architecture
    • Resource-Management: Using YARN
    • Map-reduce
    • Characteristics of Map-reduce
    • Advanced map-reduce
    • Sqoop overview,
    • basic import & exports inSqoop
    • improving Sqoop’s performance
    • limitations of Sqoop and Sqoop2
    • Introducing Impala/Hive
    • Importance of Impala/Hive
    • Difference: Impala/Hive
    • How: Impala/Hive
    • Hive & Traditional-Database
    • Understanding Meta-store
    • Creating: Databases & Tables in Hive & Impala
    • Loading Data into Tables of Hive & Impala
    • Understanding HCatalog
    • Impala: cluster
    • Various File Format
    • Hadoop Tool Support: File Formats
    • Understanding Avro Schemas
    • Understanding Avro with Hive/Sqoop
    • Evolution: Avro Schema
    • Overview: DataFile Partitioning
    • Partitioning: Impala/Hive
    • Using Partition
    • Bucketing: Hive
    • Advance concepts: Hive
    • Overview of Sqoop
    • Basic: Imports & Exports
    • Performance improving Sqoop
    • Limitations: Sqoop
    • Understanding Sqoop 2
    • Understanding Apache Flume
    • Basic: Flume Architecture
    • Understanding Flume-Sources
    • Understanding Flume-Sinks
    • Understanding Flume-Channels
    • Configuration ofFlume
    • Understanding HBase
    • Architecture HBase
    • Data storage: HBase
    • Comparing HBase & RDBMS
    • Using HBase
    • Understanding Pig
    • Components: Pig
    • Comparing Pig & SQL
    • Using Pig
    • Understanding Apache Spark
    • What is Spark Shell
    • Understanding RDDs (Resilient Distributed Datasets)
    • Functional Programming: Sparks
    • Exploring RDD
    • OtherPair: RDD Operations
    • Key-Value – Pair RDD
    • Comparing Spark Applications/Spark Shell
    • Build Spark Context
    • Creating: Spark-Application (Scala and Java)
    • Spark onYARN: Client-Mode
    • Spark on YARN: Cluster-Mode
    • Dynamic-Resource-Allocation
    • Configuration: Spark-Properties
    • Spark: Cluster
    • Understanding RDD-Partitions
    • Partitioning: File-based RDDs
    • HDFS & Data-Locality
    • Parallel Operations: Partitions
    • Understanding Stages & Tasks
    • Controlling: Levels Parallelism
    • Understanding RDD Lineage
    • Overview of Caching
    • Distributed-Persistence
    • Storage Levels of RDD Persistence
    • Correct RDD Persistence Storage Level
    • RDD: Fault tolerance
    • Used Cases: Spark
    • Iterative Algorithms: Spark
    • Understanding Machine Learning
    • Graph Processing & Analysis
    • Example k-means
    • Context: Spark SQL & the SQL
    • Creation of Data-Frames
    • Transforming/Querying of Data-Frames
    • Impala Vs Spark SQL