Pentaho Data Integration

Overview

Pentaho Data Integration is a part of Pentaho Open-Source BI Suite, which includes software of all sort to support the business decision making. The objective of this session:
    • Getting Started: PDI
    • Getting Started: Transformations
    • Manipulating Real World Data
    • Filtering Searching & Performing other useful operation with Data
    • Transforming the Row set
    • Performing Advanced Operations with Databases
    • Creating Basic Task Flows
    • Real-Time Data Integration
    • Performance Tuning
    • Parallelization, Clustering, and Partitioning
Duration
5 Days

Pre-Requisites
Participants must have knowledge of ETL concepts

Course Outline

  • Pentaho Data Integration & Pentaho BI Suite  
  • Using PDI in real-world scenarios
  • Loading data warehouses or data marts
  • Integrating data
  • Data Cleansing
  • Migrating information
  • Exporting data
  • Integrating PDI along with other Pentaho tools

Demo – Installing PDI

Demo – starting & customizing Spoon

  • Setting preferences in the Options window
  • Storing transformations and jobs in a repository  

Demo  – creating a hello world transformation  Directing Kettle engine with transformations

  • Exploring the Spoon interface
  • Designing a transformation
  • Running and previewing the transformation

Demo – installing MySQL on Windows

  • Designing and previewing transformations
  • Demo – creating a simple transformation and getting familiar with the design process
  • Understanding the Kettle row set
  • Seeing the results in the Execution-Results-pane
  • Demo – generating a range of dates and inspecting the data as it is being created
  • Adding/modifying fields using various PDI steps
  • Demo – avoiding errors while converting the estimated time from string to integer
  • The error handling functionality
  • Demo – configuring the error handling to see the description of the errors
  • Personalizing the error handling

Reading data from files

Demo – reading results of football matches from files

  • Reading several files at once

Demo – reading all your files at a time using a single text file input step  

  • regular expressions
  • Troubleshooting reading files
  • Sending data to files

Demo – sending the results of matches to a plain file

  • Getting system information

Demo – reading and writing matches files with flexibility

  •  Running transformations from a terminal window

Demo – running the matches transformation from a terminal window

Demo – getting data from an XML file with information about countries

  • PDI transformation files
  • Getting data from XML files
  • Kettle variables

Sorting data

Demo – sorting information about matches with the Sort rows step

  • Calculations on groups of rows

Demo – calculating football match statistics by grouping data Group by Step

  • Numeric fields
  • Filtering

Demo – counting frequent words by filtering

Demo – refining the counting task by filtering even more

  • Filtering rows using the Filter rows step
  • Looking up data

Demo – finding out which language people speak

  • The Stream lookup step
  • Data cleaning

Demo – fixing words before counting them

  • Cleansing data with PDI
  • Controlling the Flow of Data
  • Splitting streams

Demo – browsing new features of PDI by copying a dataset

  • Copying rows
  • Distributing rows

Demo – assigning tasks by distributing

  • Splitting the stream based on conditions

Demo – assigning tasks by filtering priorities with the Filter rows step

  • PDI steps – splitting the stream-based on conditions

Demo – assigning tasks by filtering priorities with the Switch/Case step

  • Merging streams

Demo – gathering progress and merging it all together

  • PDI options for merging streams

Demo – giving priority to Bouchard by using the Append Stream

  • Treating invalid data by splitting and merging streams

Demo – treating errors in the estimated time to avoid discarding rows

  • Treating rows with invalid data
  • Transforming Your Data by Coding
  • Doing simple tasks with the JavaScript

Demo – counting frequent words by coding in JavaScript

  • Using the JavaScript language in PDI
  • Using transformation predefined constants
  • Testing the script using the Test script button
  • Reading and parsing unstructured files with JavaScript

Demo – changing a list of house descriptions with JavaScript

  • Looping over the dataset rows
  • Doing simple tasks with the Java Class

Demo – counting frequent words by coding in Java

  • Using the Java language in PDI
  • Sending rows to the next step
  • Data types equivalence
  • Transforming the dataset with Java

Demo – splitting the field to rows using Java

    • Avoiding coding by using purpose-built steps

Converting rows to columns

Demo – enhancing the file by converting rows to columns

    • Converting row data to column data by using the Row Demoralizer
    • Aggregating data with a Row Demoralizer

Demo – aggregating football matches data with the Row Demoralizer

  • Using Row Demoralizer for aggregating data
  • Normalizing data

Demo – enhancing the matches file by normalizing the dataset

  • Modifying the dataset with a Row Normalizer
  • Summarizing: PDI steps which operate on sets of rows
  • Generate a customized time-dimension dataset using Kettle-variables

Demo – creating the time dimension dataset

Getting variables

Demo – parameterizing the start and end date of the time dimension dataset

  • Introducing the Steel Wheels sample database
  • Connecting to the Steel Wheels database
  • Demo – creating a connection to the Steel Wheels database
    • Connecting with Relational Database Management Systems
    • Querying a database

Demo – getting data about shipped orders

  • Generating data from the database from Table input step
  • Generating a new dataset using the SELECT statement  
  • Making flexible queries using parameters

Demo – getting orders in a range of dates using parameters

  • Adding parameters to your queries
  • Making flexible queries by using Kettle variables

Demo – getting orders in a range of dates by using Kettle variables

  • Using Kettle variables in your queries
  • Sending data to a database

Demo – loading a table with a list of manufacturers

  • Inserting or updating data by using other PDI

Demo – inserting new products or updating existing ones

Demo – testing the update of existing products

  • Eliminating data from a database

Demo – deleting data about discontinued items

Demo – populating the Jigsaw database

    • Exploring the Jigsaw database model
    • Doing simple lookups

Demo – using a Database lookup step to create a list of products to buy

Performing complex lookups

Demo – using a Database join step to create a list of suggested products to buy

  • Joining data to the stream-data by using a Database-join
  • Introducing dimensional modeling
  • Loading dimensions with data

Demo – loading a region dimension with a Combination lookup/update step

  • Describing data with dimensions
  • Loading Type I SCD with a Combination lookup/update
  • Storing history of changes

Demo – keeping a history of changes in products by using the Dimension lookup/update

  • Keeping an entire history of data with a Type II SCD
  • Loading Type II SCDs with the Dimension lookup/update step

Introducing the Steel Wheels sample database

Connecting to the Steel Wheels database

Demo – creating a connection to the Steel Wheels database

    • Connecting with Relational Database Management Systems
    • Querying a database

Demo – getting data about shipped orders

  • Generating data from the database from Table input step
  • Generating a new dataset using the SELECT statement  
  • Making flexible queries using parameters

Demo – getting orders in a range of dates using parameters

  • Adding parameters to your queries
  • Making flexible queries by using Kettle variables

Demo – getting orders in a range of dates by using Kettle variables

  • Using Kettle variables in your queries
  • Sending data to a database

Demo – loading a table with a list of manufacturers

  • Inserting or updating data by using other PDI

Demo – inserting new products or updating existing ones

Demo – testing the update of existing products

  • Eliminating data from a database

Demo – deleting data about discontinued items

Demo – populating the Jigsaw database

    • Exploring the Jigsaw database model
    • Doing simple lookups

Demo – using a Database lookup step to create a list of products to buy

Performing complex lookups

Demo – using a Database join step to create a list of suggested products to buy

  • Joining data to the stream-data by using a Database-join
  • Introducing dimensional modeling
  • Loading dimensions with data

Demo – loading a region dimension with a Combination lookup/update step

  • Describing data with dimensions
  • Loading Type I SCD with a Combination lookup/update
  • Storing history of changes

Demo – keeping a history of changes in products by using the Dimension lookup/update

  • Keeping an entire history of data with a Type II SCD
  • Loading Type II SCDs with the Dimension lookup/update step
  • Introducing PDI jobs
  • Demo – creating a folder with a Kettle job
    • Executing processes with PDI jobs
    • Using Spoon to design and run jobs
  • Demo – creating a simple job and getting familiar with the design process
    • Changing the flow of execution on the basis of conditions
    • Look at the results – Execution results window
    • Running transformations from jobs
  • Demo – generating a range of dates and inspecting how things are running
    • Using the Transformation job entry
    • Receiving arguments and parameters in a job
  • Demo – generating a hello world file by using arguments and parameters
    • Using named parameters in jobs
    • Running jobs from a terminal window
  • Demo – executing the hello world job from a terminal window
  • Demo – calling the hello world transformation with fixed arguments and parameters
    • Deciding: use of a command-line argument or a named parameter
    • Creating Advanced Transformations and Jobs
    • Re-using part of your transformations
  • Demo– calculating statistics with the use of a sub-transformation
    • Creating a job as a process flow
  • Demo – generating top average scores by copying and getting rows
    • Use the copy/get rows mechanism to transfer data between transformations  
  • Demo – generating custom files by executing a transformation for  every input row
    • Executing for each row
    • Enhancing your processes with the use of variables
  • Demo – generating custom messages by setting a variable with the  name of the examination file
    • Case Study: Developing and Implementing a Sample Datamart
    • Exploring the sales data mart
    • Deciding the level of granularity
    • Loading the dimensions
  • Demo – loading the dimensions for the sales datamart
    • Extending the sales datamart model
    • Loading a fact table with aggregated data
  • Demo – loading the sales fact table by looking up dimensions
    • Get the data from the source – SQL queries
    • Translating the business keys into surrogate keys
    • Obtaining the surrogate key for Type I SCD
    • Obtaining the surrogate key for Type II SCD
    • Obtaining the surrogate key for the Junk dimension
    • Obtaining the surrogate key for the Time dimension
    • Getting facts and dimensions together
  • Demo – loading the fact table using a range of dates obtained from the command line
  • Demo – loading the SALES star schema model
    • Automating the administrative tasks
  • Demo – automating the loading of the sales data mart
  • Introduction to Real-Time ETL
  • Real-Time Challenges
  • Requirements
  • Transformation Streaming
  • A Practical Example of Transformation Streaming
  • Debugging
  • Third-Party Software and Real-Time Integration
  • Java Message Service
  • Creating a JMS Connection and Session
  • Transformation Performance: Finding the Weakest Link
  • Finding Bottlenecks by Simplifying
  • Finding Bottlenecks by Measuring
  • copying Rows of Data
  • Improving Transformation Performance
  • Using Lazy Conversion for Reading Text Files
  • Single-File Parallel Reading, Multi-File Parallel Reading
  • Configuring the NIO Block Size
  • Changing Disks and Reading Text Files
  • Improving Performance in Writing Text Files
  • Using Lazy Conversion for Writing Text Files
  • Parallel Files Writing
  • Improving Database Performance
  • Avoiding Dynamic SQL, Handling Roundtrips
  • Handling Relational Databases
  • Sorting Data, Sorting on the Database, Sorting in Parallel
  • Reducing CPU Usage
  • Optimizing the Use of JavaScript
  • Launching Multiple Copies of a Step
  • Selecting and Removing Values
  • Managing Thread Priorities
  • Adding Static Data to Rows of Data
  • Limiting the Number of Step Copies
  • Avoiding Excessive Logging
  • Improving Job Performance
  • Loops in Jobs
  • Database Connection Pools
  • Multi-Threading
  • Row Distribution, Row Merging, and Row Redistribution
  • Data Pipelining
  • Consequences of Multi-Threading
  • Database Connections
  • Order of Execution, Parallel Execution in a Job
  • Using Carte as a Slave Server
  • The Configuration File
  • Defining Slave Servers
  • Remote Execution, Monitoring Slave Servers
  • Carte Security, Services
  • Clustering Transformations
  • Defining a Cluster Schema, Designing Clustered Transformations
  • Execution and Monitoring
  • Metadata Transformations
  • Rules
  • Data Pipelining, Partitioning
  • Defining a Partitioning Schema
  • Objectives of Partitioning, Implementing Partitioning
  • Internal Variables
  • Database Partitions
  • Partitioning in a Clustered Transformation
  • Working with Repositories
  • Pan/Kitchen – Launching Transformations and Jobs from the Command-Line