Pentaho Data Integration
Overview
Pentaho Data Integration is a part of Pentaho Open-Source BI Suite, which includes software of all sort to support the business decision making. The objective of this session:-
- Getting Started: PDI
- Getting Started: Transformations
- Manipulating Real World Data
- Filtering Searching & Performing other useful operation with Data
- Transforming the Row set
- Performing Advanced Operations with Databases
- Creating Basic Task Flows
- Real-Time Data Integration
- Performance Tuning
- Parallelization, Clustering, and Partitioning
5 Days
Pre-Requisites
Participants must have knowledge of ETL concepts
Course Outline
- Pentaho Data Integration & Pentaho BI Suite
- Using PDI in real-world scenarios
- Loading data warehouses or data marts
- Integrating data
- Data Cleansing
- Migrating information
- Exporting data
- Integrating PDI along with other Pentaho tools
Demo – Installing PDI
Demo – starting & customizing Spoon
- Setting preferences in the Options window
- Storing transformations and jobs in a repository
Demo – creating a hello world transformation Directing Kettle engine with transformations
- Exploring the Spoon interface
- Designing a transformation
- Running and previewing the transformation
Demo – installing MySQL on Windows
- Designing and previewing transformations
- Demo – creating a simple transformation and getting familiar with the design process
- Understanding the Kettle row set
- Seeing the results in the Execution-Results-pane
- Demo – generating a range of dates and inspecting the data as it is being created
- Adding/modifying fields using various PDI steps
- Demo – avoiding errors while converting the estimated time from string to integer
- The error handling functionality
- Demo – configuring the error handling to see the description of the errors
- Personalizing the error handling
Reading data from files
Demo – reading results of football matches from files
- Reading several files at once
Demo – reading all your files at a time using a single text file input step
- regular expressions
- Troubleshooting reading files
- Sending data to files
Demo – sending the results of matches to a plain file
- Getting system information
Demo – reading and writing matches files with flexibility
- Running transformations from a terminal window
Demo – running the matches transformation from a terminal window
Demo – getting data from an XML file with information about countries
- PDI transformation files
- Getting data from XML files
- Kettle variables
Sorting data
Demo – sorting information about matches with the Sort rows step
- Calculations on groups of rows
Demo – calculating football match statistics by grouping data Group by Step
- Numeric fields
- Filtering
Demo – counting frequent words by filtering
Demo – refining the counting task by filtering even more
- Filtering rows using the Filter rows step
- Looking up data
Demo – finding out which language people speak
- The Stream lookup step
- Data cleaning
Demo – fixing words before counting them
- Cleansing data with PDI
- Controlling the Flow of Data
- Splitting streams
Demo – browsing new features of PDI by copying a dataset
- Copying rows
- Distributing rows
Demo – assigning tasks by distributing
- Splitting the stream based on conditions
Demo – assigning tasks by filtering priorities with the Filter rows step
- PDI steps – splitting the stream-based on conditions
Demo – assigning tasks by filtering priorities with the Switch/Case step
- Merging streams
Demo – gathering progress and merging it all together
- PDI options for merging streams
Demo – giving priority to Bouchard by using the Append Stream
- Treating invalid data by splitting and merging streams
Demo – treating errors in the estimated time to avoid discarding rows
- Treating rows with invalid data
- Transforming Your Data by Coding
- Doing simple tasks with the JavaScript
Demo – counting frequent words by coding in JavaScript
- Using the JavaScript language in PDI
- Using transformation predefined constants
- Testing the script using the Test script button
- Reading and parsing unstructured files with JavaScript
Demo – changing a list of house descriptions with JavaScript
- Looping over the dataset rows
- Doing simple tasks with the Java Class
Demo – counting frequent words by coding in Java
- Using the Java language in PDI
- Sending rows to the next step
- Data types equivalence
- Transforming the dataset with Java
Demo – splitting the field to rows using Java
- Avoiding coding by using purpose-built steps
Converting rows to columns
Demo – enhancing the file by converting rows to columns
- Converting row data to column data by using the Row Demoralizer
- Aggregating data with a Row Demoralizer
Demo – aggregating football matches data with the Row Demoralizer
- Using Row Demoralizer for aggregating data
- Normalizing data
Demo – enhancing the matches file by normalizing the dataset
- Modifying the dataset with a Row Normalizer
- Summarizing: PDI steps which operate on sets of rows
- Generate a customized time-dimension dataset using Kettle-variables
Demo – creating the time dimension dataset
Getting variables
Demo – parameterizing the start and end date of the time dimension dataset
- Introducing the Steel Wheels sample database
- Connecting to the Steel Wheels database
- Demo – creating a connection to the Steel Wheels database
- Connecting with Relational Database Management Systems
- Querying a database
Demo – getting data about shipped orders
- Generating data from the database from Table input step
- Generating a new dataset using the SELECT statement
- Making flexible queries using parameters
Demo – getting orders in a range of dates using parameters
- Adding parameters to your queries
- Making flexible queries by using Kettle variables
Demo – getting orders in a range of dates by using Kettle variables
- Using Kettle variables in your queries
- Sending data to a database
Demo – loading a table with a list of manufacturers
- Inserting or updating data by using other PDI
Demo – inserting new products or updating existing ones
Demo – testing the update of existing products
- Eliminating data from a database
Demo – deleting data about discontinued items
Demo – populating the Jigsaw database
- Exploring the Jigsaw database model
- Doing simple lookups
Demo – using a Database lookup step to create a list of products to buy
Performing complex lookups
Demo – using a Database join step to create a list of suggested products to buy
- Joining data to the stream-data by using a Database-join
- Introducing dimensional modeling
- Loading dimensions with data
Demo – loading a region dimension with a Combination lookup/update step
- Describing data with dimensions
- Loading Type I SCD with a Combination lookup/update
- Storing history of changes
Demo – keeping a history of changes in products by using the Dimension lookup/update
- Keeping an entire history of data with a Type II SCD
- Loading Type II SCDs with the Dimension lookup/update step
Introducing the Steel Wheels sample database
Connecting to the Steel Wheels database
Demo – creating a connection to the Steel Wheels database
- Connecting with Relational Database Management Systems
- Querying a database
Demo – getting data about shipped orders
- Generating data from the database from Table input step
- Generating a new dataset using the SELECT statement
- Making flexible queries using parameters
Demo – getting orders in a range of dates using parameters
- Adding parameters to your queries
- Making flexible queries by using Kettle variables
Demo – getting orders in a range of dates by using Kettle variables
- Using Kettle variables in your queries
- Sending data to a database
Demo – loading a table with a list of manufacturers
- Inserting or updating data by using other PDI
Demo – inserting new products or updating existing ones
Demo – testing the update of existing products
- Eliminating data from a database
Demo – deleting data about discontinued items
Demo – populating the Jigsaw database
- Exploring the Jigsaw database model
- Doing simple lookups
Demo – using a Database lookup step to create a list of products to buy
Performing complex lookups
Demo – using a Database join step to create a list of suggested products to buy
- Joining data to the stream-data by using a Database-join
- Introducing dimensional modeling
- Loading dimensions with data
Demo – loading a region dimension with a Combination lookup/update step
- Describing data with dimensions
- Loading Type I SCD with a Combination lookup/update
- Storing history of changes
Demo – keeping a history of changes in products by using the Dimension lookup/update
- Keeping an entire history of data with a Type II SCD
- Loading Type II SCDs with the Dimension lookup/update step
- Introducing PDI jobs
- Demo – creating a folder with a Kettle job
- Executing processes with PDI jobs
- Using Spoon to design and run jobs
- Demo – creating a simple job and getting familiar with the design process
- Changing the flow of execution on the basis of conditions
- Look at the results – Execution results window
- Running transformations from jobs
- Demo – generating a range of dates and inspecting how things are running
- Using the Transformation job entry
- Receiving arguments and parameters in a job
- Demo – generating a hello world file by using arguments and parameters
- Using named parameters in jobs
- Running jobs from a terminal window
- Demo – executing the hello world job from a terminal window
- Demo – calling the hello world transformation with fixed arguments and parameters
- Deciding: use of a command-line argument or a named parameter
- Creating Advanced Transformations and Jobs
- Re-using part of your transformations
- Demo– calculating statistics with the use of a sub-transformation
- Creating a job as a process flow
- Demo – generating top average scores by copying and getting rows
- Use the copy/get rows mechanism to transfer data between transformations
- Demo – generating custom files by executing a transformation for every input row
- Executing for each row
- Enhancing your processes with the use of variables
- Demo – generating custom messages by setting a variable with the name of the examination file
- Case Study: Developing and Implementing a Sample Datamart
- Exploring the sales data mart
- Deciding the level of granularity
- Loading the dimensions
- Demo – loading the dimensions for the sales datamart
- Extending the sales datamart model
- Loading a fact table with aggregated data
- Demo – loading the sales fact table by looking up dimensions
- Get the data from the source – SQL queries
- Translating the business keys into surrogate keys
- Obtaining the surrogate key for Type I SCD
- Obtaining the surrogate key for Type II SCD
- Obtaining the surrogate key for the Junk dimension
- Obtaining the surrogate key for the Time dimension
- Getting facts and dimensions together
- Demo – loading the fact table using a range of dates obtained from the command line
- Demo – loading the SALES star schema model
- Automating the administrative tasks
- Demo – automating the loading of the sales data mart
- Introduction to Real-Time ETL
- Real-Time Challenges
- Requirements
- Transformation Streaming
- A Practical Example of Transformation Streaming
- Debugging
- Third-Party Software and Real-Time Integration
- Java Message Service
- Creating a JMS Connection and Session
- Transformation Performance: Finding the Weakest Link
- Finding Bottlenecks by Simplifying
- Finding Bottlenecks by Measuring
- copying Rows of Data
- Improving Transformation Performance
- Using Lazy Conversion for Reading Text Files
- Single-File Parallel Reading, Multi-File Parallel Reading
- Configuring the NIO Block Size
- Changing Disks and Reading Text Files
- Improving Performance in Writing Text Files
- Using Lazy Conversion for Writing Text Files
- Parallel Files Writing
- Improving Database Performance
- Avoiding Dynamic SQL, Handling Roundtrips
- Handling Relational Databases
- Sorting Data, Sorting on the Database, Sorting in Parallel
- Reducing CPU Usage
- Optimizing the Use of JavaScript
- Launching Multiple Copies of a Step
- Selecting and Removing Values
- Managing Thread Priorities
- Adding Static Data to Rows of Data
- Limiting the Number of Step Copies
- Avoiding Excessive Logging
- Improving Job Performance
- Loops in Jobs
- Database Connection Pools
- Multi-Threading
- Row Distribution, Row Merging, and Row Redistribution
- Data Pipelining
- Consequences of Multi-Threading
- Database Connections
- Order of Execution, Parallel Execution in a Job
- Using Carte as a Slave Server
- The Configuration File
- Defining Slave Servers
- Remote Execution, Monitoring Slave Servers
- Carte Security, Services
- Clustering Transformations
- Defining a Cluster Schema, Designing Clustered Transformations
- Execution and Monitoring
- Metadata Transformations
- Rules
- Data Pipelining, Partitioning
- Defining a Partitioning Schema
- Objectives of Partitioning, Implementing Partitioning
- Internal Variables
- Database Partitions
- Partitioning in a Clustered Transformation
- Working with Repositories
- Pan/Kitchen – Launching Transformations and Jobs from the Command-Line
