Pentaho Data Integration

Overview

Pentaho Data Integration is a part of Pentaho Open-Source BI Suite, which includes software of all sort to support the business decision making. The objective of this session:

- Getting Started: PDI
- Getting Started: Transformations
- Manipulating Real World Data
- Filtering Searching & Performing other useful operation with Data
- Transforming the Row set
- Performing Advanced Operations with Databases
- Creating Basic Task Flows
- Real-Time Data Integration
- Performance Tuning
- Parallelization, Clustering, and Partitioning

Duration
5 Days

Pre-Requisites
Participants must have knowledge of ETL concepts

Course Outline

Getting Started with PDI

Pentaho Data Integration & Pentaho BI Suite
Using PDI in real-world scenarios
Loading data warehouses or data marts
Integrating data
Data Cleansing
Migrating information
Exporting data
Integrating PDI along with other Pentaho tools

Demo – Installing PDI

Demo – starting & customizing Spoon

Setting preferences in the Options window
Storing transformations and jobs in a repository

Demo – creating a hello world transformation Directing Kettle engine with transformations

Exploring the Spoon interface
Designing a transformation
Running and previewing the transformation

Demo – installing MySQL on Windows

Getting Started with Transformations

Designing and previewing transformations
Demo – creating a simple transformation and getting familiar with the design process
Understanding the Kettle row set
Seeing the results in the Execution-Results-pane
Demo – generating a range of dates and inspecting the data as it is being created
Adding/modifying fields using various PDI steps
Demo – avoiding errors while converting the estimated time from string to integer
The error handling functionality
Demo – configuring the error handling to see the description of the errors
Personalizing the error handling

Manipulating Real-world Data

Reading data from files

Demo – reading results of football matches from files

Reading several files at once

Demo – reading all your files at a time using a single text file input step

regular expressions
Troubleshooting reading files
Sending data to files

Demo – sending the results of matches to a plain file

Getting system information

Demo – reading and writing matches files with flexibility

Running transformations from a terminal window

Demo – running the matches transformation from a terminal window

Demo – getting data from an XML file with information about countries

PDI transformation files
Getting data from XML files
Kettle variables

Filtering/Searching/Performing related Useful Operations with Data

Sorting data

Demo – sorting information about matches with the Sort rows step

Calculations on groups of rows

Demo – calculating football match statistics by grouping data Group by Step

Numeric fields
Filtering

Demo – counting frequent words by filtering

Demo – refining the counting task by filtering even more

Filtering rows using the Filter rows step
Looking up data

Demo – finding out which language people speak

The Stream lookup step
Data cleaning

Demo – fixing words before counting them

Cleansing data with PDI
Controlling the Flow of Data
Splitting streams

Demo – browsing new features of PDI by copying a dataset

Copying rows
Distributing rows

Demo – assigning tasks by distributing

Splitting the stream based on conditions

Demo – assigning tasks by filtering priorities with the Filter rows step

PDI steps – splitting the stream-based on conditions

Demo – assigning tasks by filtering priorities with the Switch/Case step

Merging streams

Demo – gathering progress and merging it all together

PDI options for merging streams

Demo – giving priority to Bouchard by using the Append Stream

Treating invalid data by splitting and merging streams

Demo – treating errors in the estimated time to avoid discarding rows

Treating rows with invalid data
Transforming Your Data by Coding
Doing simple tasks with the JavaScript

Demo – counting frequent words by coding in JavaScript

Using the JavaScript language in PDI
Using transformation predefined constants
Testing the script using the Test script button
Reading and parsing unstructured files with JavaScript

Demo – changing a list of house descriptions with JavaScript

Looping over the dataset rows
Doing simple tasks with the Java Class

Demo – counting frequent words by coding in Java

Using the Java language in PDI
Sending rows to the next step
Data types equivalence
Transforming the dataset with Java

Demo – splitting the field to rows using Java

- Avoiding coding by using purpose-built steps

Transforming the Row set

Converting rows to columns

Demo – enhancing the file by converting rows to columns

- Converting row data to column data by using the Row Demoralizer
- Aggregating data with a Row Demoralizer

Demo – aggregating football matches data with the Row Demoralizer

Using Row Demoralizer for aggregating data
Normalizing data

Demo – enhancing the matches file by normalizing the dataset

Modifying the dataset with a Row Normalizer
Summarizing: PDI steps which operate on sets of rows
Generate a customized time-dimension dataset using Kettle-variables

Demo – creating the time dimension dataset

Getting variables

Demo – parameterizing the start and end date of the time dimension dataset

Working with Databases

Introducing the Steel Wheels sample database
Connecting to the Steel Wheels database
Demo – creating a connection to the Steel Wheels database
- Connecting with Relational Database Management Systems
- Querying a database

Demo – getting data about shipped orders

Generating data from the database from Table input step
Generating a new dataset using the SELECT statement
Making flexible queries using parameters

Demo – getting orders in a range of dates using parameters

Adding parameters to your queries
Making flexible queries by using Kettle variables

Demo – getting orders in a range of dates by using Kettle variables

Using Kettle variables in your queries
Sending data to a database

Demo – loading a table with a list of manufacturers

Inserting or updating data by using other PDI

Demo – inserting new products or updating existing ones

Demo – testing the update of existing products

Eliminating data from a database

Demo – deleting data about discontinued items

Performing Advanced Operations with Databases

Demo – populating the Jigsaw database

- Exploring the Jigsaw database model
- Doing simple lookups

Demo – using a Database lookup step to create a list of products to buy

Performing complex lookups

Demo – using a Database join step to create a list of suggested products to buy

Joining data to the stream-data by using a Database-join
Introducing dimensional modeling
Loading dimensions with data

Demo – loading a region dimension with a Combination lookup/update step

Describing data with dimensions
Loading Type I SCD with a Combination lookup/update
Storing history of changes

Demo – keeping a history of changes in products by using the Dimension lookup/update

Keeping an entire history of data with a Type II SCD
Loading Type II SCDs with the Dimension lookup/update step

Working with Databases

Introducing the Steel Wheels sample database

Connecting to the Steel Wheels database

Demo – creating a connection to the Steel Wheels database

- Connecting with Relational Database Management Systems
- Querying a database