Description

Book Synopsis
A complete guide to Pentaho Kettle, the Pentaho Data lntegration toolset for ETL

This practical book is a complete guide to installing, configuring, and managing Pentaho Kettle. If you're a database administrator or developer, you'll first get up to speed on Kettle basics and how to apply Kettle to create ETL solutionsbefore progressing to specialized concepts such as clustering, extensibility, and data vault models. Learn how to design and build every phase of an ETL solution.

  • Shows developers and database administrators how to use the open-source Pentaho Kettle for enterprise-level ETL processes (Extracting, Transforming, and Loading data)
  • Assumes no prior knowledge of Kettle or ETL, and brings beginners thoroughly up to speed at their own pace
  • Explains how to get Kettle solutions up and running, then follows the 34 ETL subsystems model, as created by the Kimball Group, to explore the entire ETL lifecycle, including all aspects of data wareho

    Table of Contents

    Introduction xxxi

    Part I Getting Started 1

    Chapter 1 ETL Primer 3

    OLTP versus Data Warehousing 3

    What Is ETL? 5

    The Evolution of ETL Solutions 5

    ETL Building Blocks 7

    ETL, ELT, and EII 8

    ELT 9

    EII: Virtual Data Integration 10

    Data Integration Challenges 11

    Methodology: Agile BI 12

    ETL Design 14

    Data Acquisition 14

    Beware of Spreadsheets 15

    Design for Failure 15

    Change Data Capture 16

    Data Quality 16

    Data Profiling 16

    Data Validation 17

    ETL Tool Requirements 17

    Connectivity 17

    Platform Independence 18

    Scalability 18

    Design Flexibility 19

    Reuse 19

    Extensibility 19

    Data Transformations 20

    Testing and Debugging 21

    Lineage and Impact Analysis 21

    Logging and Auditing 22

    Summary 22

    Chapter 2 Kettle Concepts 23

    Design Principles 23

    The Building Blocks of Kettle Design 25

    Transformations 25

    Steps 26

    Transformation Hops 26

    Parallelism 27

    Rows of Data 27

    Data Conversion 29

    Jobs 30

    Job Entries 31

    Job Hops 31

    Multiple Paths and Backtracking 32

    Parallel Execution 33

    Job Entry Results 34

    Transformation or Job Metadata 36

    Database Connections 37

    Special Options 38

    The Power of the Relational Database 39

    Connections and Transactions 39

    Database Clustering 40

    Tools and Utilities 41

    Repositories 41

    Virtual File Systems 42

    Parameters and Variables 43

    Defining Variables 43

    Named Parameters 44

    Using Variables 44

    Visual Programming 45

    Getting Started 46

    Creating New Steps 47

    Putting It All Together 49

    Summary 51

    Chapter 3 Installation and Configuration 53

    Kettle Software Overview 53

    Integrated Development Environment: Spoon 55

    Command-Line Launchers: Kitchen and Pan 57

    Job Server: Carte 57

    Encr.bat and encr.sh 58

    Installation 58

    Java Environment 58

    Installing Java Manually 58

    Using Your Linux Package Management System 59

    Installing Kettle 59

    Versions and Releases 59

    Archive Names and Formats 60

    Downloading and Uncompressing 60

    Running Kettle Programs 61

    Creating a Shortcut Icon or Launcher for Spoon 62

    Configuration 63

    Configuration Files and the .kettle Directory 63

    The Kettle Shell Scripts 69

    General Structure of the Startup Scripts 70

    Adding an Entry to the Classpath 70

    Changing the Maximum Heap Size 71

    Managing JDBC Drivers 72

    Summary 72

    Chapter 4 An Example ETL Solution—Sakila 73

    Sakila 73

    The Sakila Sample Database 74

    DVD Rental Business Process 74

    Sakila Database Schema Diagram 75

    Sakila Database Subject Areas 75

    General Design Considerations 77

    Installing the Sakila Sample Database 77

    The Rental Star Schema 78

    Rental Star Schema Diagram 78

    Rental Fact Table 79

    Dimension Tables 79

    Keys and Change Data Capture 80

    Installing the Rental Star Schema 81

    Prerequisites and Some Basic Spoon Skills 81

    Setting Up the ETL Solution 82

    Creating Database Accounts 82

    Working with Spoon 82

    Opening Transformation and Job Files 82

    Opening the Step’s Configuration Dialog 83

    Examining Streams 83

    Running Jobs and Transformations 83

    The Sample ETL Solution 84

    Static, Generated Dimensions 84

    Loading the dim_date Dimension Table 84

    Loading the dim_time Dimension Table 86

    Recurring Load 87

    The load_rentals Job 88

    The load_dim_staff Transformation 91

    Database Connections 91

    The load_dim_customer Transformation 95

    The load_dim_store Transformation 98

    The fetch_address Subtransformation 99

    The load_dim_actor Transformation 101

    The load_dim_film Transformation 102

    The load_fact_rental Transformation 107

    Summary 109

    Part II ETL 111

    Chapter 5 ETL Subsystems 113

    Introduction to the 34 Subsystems 114

    Extraction 114

    Subsystems 1–3: Data Profiling, Change Data Capture, and

    Extraction 115

    Cleaning and Conforming Data 116

    Subsystem 4: Data Cleaning and Quality Screen

    Handler System 116

    Subsystem 5: Error Event Handler 117

    Subsystem 6: Audit Dimension Assembler 117

    Subsystem 7: Deduplication System 117

    Subsystem 8: Data Conformer 118

    Data Delivery 118

    Subsystem 9: Slowly Changing Dimension Processor 118

    Subsystem 10: Surrogate Key Creation System 119

    Subsystem 11: Hierarchy Dimension Builder 119

    Subsystem 12: Special Dimension Builder 120

    Subsystem 13: Fact Table Loader 121

    Subsystem 14: Surrogate Key Pipeline 121

    Subsystem 15: Multi-Valued Dimension Bridge Table Builder 121

    Subsystem 16: Late-Arriving Data Handler 122

    Subsystem 17: Dimension Manager System 122

    Subsystem 18: Fact Table Provider System 122

    Subsystem 19: Aggregate Builder 123

    Subsystem 20: Multidimensional (OLAP) Cube Builder 123

    Subsystem 21: Data Integration Manager 123

    Managing the ETL Environment 123

    Summary 126

    Chapter 6 Data Extraction 127

    Kettle Data Extraction Overview 128

    File-Based Extraction 128

    Working with Text Files 128

    Working with XML files 133

    Special File Types 134

    Database-Based Extraction 134

    Web-Based Extraction 137

    Text-Based Web Extraction 137

    HTTP Client 137

    Using SOAP 138

    Stream-Based and Real-Time Extraction 138

    Working with ERP and CRM Systems 138

    ERP Challenges 139

    Kettle ERP Plugins 140

    Working with SAP Data 140

    ERP and CDC Issues 146

    Data Profiling 146

    Using eobjects.org DataCleaner 147

    Adding Profile Tasks 149

    Adding Database Connections 149

    Doing an Initial Profile 151

    Working with Regular Expressions 151

    Profiling and Exploring Results 152

    Validating and Comparing Data 153

    Using a Dictionary for Column Dependency Checks 153

    Alternative Solutions 154

    Text Profiling with Kettle 154

    CDC: Change Data Capture 154

    Source Data–Based CDC 155

    Trigger-Based CDC 157

    Snapshot-Based CDC 158

    Log-Based CDC 162

    Which CDC Alternative Should You Choose? 163

    Delivering Data 164

    Summary 164

    Chapter 7 Cleansing and Conforming 167

    Data Cleansing 168

    Data-Cleansing Steps 169

    Using Reference Tables 172

    Conforming Data Using Lookup Tables 172

    Conforming Data Using Reference Tables 175

    Data Validation 179

    Applying Validation Rules 180

    Validating Dependency Constraints 183

    Error Handling 183

    Handling Process Errors 184

    Transformation Errors 186

    Handling Data (Validation) Errors 187

    Auditing Data and Process Quality 191

    Deduplicating Data 192

    Handling Exact Duplicates 193

    The Problem of Non-Exact Duplicates 194

    Building Deduplication Transforms 195

    Step 1: Fuzzy Match 197

    Step 2: Select Suspects 198

    Step 3: Lookup Validation Value 198

    Step 4: Filter Duplicates 199

    Scripting 200

    Formula 201

    JavaScript 202

    User-Defined Java Expressions 202

    Regular Expressions 203

    Summary 205

    Chapter 8 Handling Dimension Tables 207

    Managing Keys 208

    Managing Business Keys 209

    Keys in the Source System 209

    Keys in the Data Warehouse 209

    Business Keys 209

    Storing Business Keys 210

    Looking Up Keys with Kettle 210

    Generating Surrogate Keys 210

    The “Add sequence” Step 211

    Working with auto_increment or IDENTITY Columns 217

    Keys for Slowly Changing Dimensions 217

    Loading Dimension Tables 218

    Snowflaked Dimension Tables 218

    Top-Down Level-Wise Loading 219

    Sakila Snowflake Example 219

    Sample Transformation 221

    Database Lookup Configuration 222

    Sample Job 225

    Star Schema Dimension Tables 226

    Denormalization 226

    Denormalizing to 1NF with the “Database lookup” Step 226

    Change Data Capture 227

    Slowly Changing Dimensions 228

    Types of Slowly Changing Dimensions 228

    Type 1 Slowly Changing Dimensions 229

    The Insert / Update Step 229

    Type 2 Slowly Changing Dimensions 232

    The “Dimension lookup / update” Step 232

    Other Types of Slowly Changing Dimensions 237

    Type 3 Slowly Changing Dimensions 237

    Hybrid Slowly Changing Dimensions 238

    More Dimensions 239

    Generated Dimensions 239

    Date and Time Dimensions 239

    Generated Mini-Dimensions 239

    Junk Dimensions 241

    Recursive Hierarchies 242

    Summary 243

    Chapter 9 Loading Fact Tables 245

    Loading in Bulk 246

    STDIN and FIFO 247

    Kettle Bulk Loaders 248

    MySQL Bulk Loading 249

    LucidDB Bulk Loader 249

    Oracle Bulk Loader 249

    PostgreSQL Bulk Loader 250

    Table Output Step 250

    General Bulk Load Considerations 250

    Dimension Lookups 251

    Maintaining Referential Integrity 251

    The Surrogate Key Pipeline 252

    Using In-Memory Lookups 253

    Stream Lookups 253

    Late-Arriving Data 255

    Late-Arriving Facts 256

    Late-Arriving Dimensions 256

    Fact Table Handling 260

    Periodic and Accumulating Snapshots 260

    Introducing State-Oriented Fact Tables 261

    Loading Periodic Snapshots 263

    Loading Accumulating Snapshots 264

    Loading State-Oriented Fact Tables 265

    Loading Aggregate Tables 266

    Summary 267

    Chapter 10 Working with OLAP Data 269

    OLAP Benefits and Challenges 270

    OLAP Storage Types 272

    Positioning OLAP 272

    Kettle OLAP Options 273

    Working with Mondrian 274

    Working with XML/A Servers 277

    Working with Palo 282

    Setting Up the Palo Connection 283

    Palo Architecture 284

    Reading Palo Data 285

    Writing Palo Data 289

    Summary 291

    Part III Management and Deployment 293

    Chapter 11 ETL Development Lifecycle 295

    Solution Design 295

    Best and Bad Practices 296

    Data Mapping 297

    Naming and Commentary Conventions 298

    Common Pitfalls 299

    ETL Flow Design 300

    Reusability and Maintainability 300

    Agile Development 301

    Testing and Debugging 306

    Test Activities 307

    ETL Testing 308

    Test Data Requirements 308

    Testing for Completeness 309

    Testing Data Transformations 311

    Test Automation and Continuous Integration 311

    Upgrade Tests 312

    Debugging 312

    Documenting the Solution 315

    Why Isn’t There Any Documentation? 316

    Myth 1: My Software Is Self-Explanatory 316

    Myth 2: Documentation Is Always Outdated 316

    Myth 3: Who Reads Documentation Anyway? 317

    Kettle Documentation Features 317

    Generating Documentation 319

    Summary 320

    Chapter 12 Scheduling and Monitoring 321

    Scheduling 321

    Operating System–Level Scheduling 322

    Executing Kettle Jobs and Transformations from

    the Command Line 322

    UNIX-Based Systems: cron 326

    Windows: The at utility and the Task Scheduler 327

    Using Pentaho’s Built-in Scheduler 327

    Creating an Action Sequence to Run Kettle Jobs and

    Transformations 328

    Kettle Transformations in Action Sequences 329

    Creating and Maintaining Schedules with the

    Administration Console 330

    Attaching an Action Sequence to a Schedule 333

    Monitoring 333

    Logging 333

    Inspecting the Log 333

    Logging Levels 335

    Writing Custom Messages to the Log 336

    E‑mail Notifications 336

    Configuring the Mail Job Entry 337

    Summary 340

    Chapter 13 Versioning and Migration 341

    Version Control Systems 341

    File-Based Version Control Systems 342

    Organization 342

    Leading File-Based VCSs 343

    Content Management Systems 344

    Kettle Metadata 344

    Kettle XML Metadata 345

    Transformation XML 345

    Job XML 346

    Global Replace 347

    Kettle Repository Metadata 348

    The Kettle Database Repository Type 348

    The Kettle File Repository Type 349

    The Kettle Enterprise Repository Type 350

    Managing Repositories 350

    Exporting and Importing Repositories 350

    Upgrading Your Repository 351

    Version Migration System 352

    Managing XML Files 352

    Managing Repositories 352

    Parameterizing Your Solution 353

    Summary 356

    Chapter 14 Lineage and Auditing 357

    Batch-Level Lineage Extraction 358

    Lineage 359

    Lineage Information 359

    Impact Analysis Information 361

    Logging and Operational Metadata 363

    Logging Basics 363

    Logging Architecture 364

    Setting a Maximum Buffer Size 365

    Setting a Maximum Log Line Age 365

    Log Channels 366

    Log Text Capturing in a Job 366

    Logging Tables 367

    Transformation Logging Tables 367

    Job Logging Tables 373

    Summary 374

    Part IV Performance and Scalability 375

    Chapter 15 Performance Tuning 377

    Transformation Performance: Finding the Weakest Link 377

    Finding Bottlenecks by Simplifying 379

    Finding Bottlenecks by Measuring 380

    Copying Rows of Data 382

    Improving Transformation Performance 384

    Improving Performance in Reading Text Files 384

    Using Lazy Conversion for Reading Text Files 385

    Single-File Parallel Reading 385

    Multi-File Parallel Reading 386

    Configuring the NIO Block Size 386

    Changing Disks and Reading Text Files 386

    Improving Performance in Writing Text Files 387

    Using Lazy Conversion for Writing Text Files 387

    Parallel Files Writing 387

    Changing Disks and Writing Text Files 387

    Improving Database Performance 388

    Avoiding Dynamic SQL 388

    Handling Roundtrips 388

    Handling Relational Databases 390

    Sorting Data 392

    Sorting on the Database 393

    Sorting in Parallel 393

    Reducing CPU Usage 394

    Optimizing the Use of JavaScript 394

    Launching Multiple Copies of a Step 396

    Selecting and Removing Values 397

    Managing Thread Priorities 397

    Adding Static Data to Rows of Data 397

    Limiting the Number of Step Copies 398

    Avoiding Excessive Logging 398

    Improving Job Performance 399

    Loops in Jobs 399

    Database Connection Pools 400

    Summary 401

    Chapter 16 Parallelization, Clustering, and Partitioning 403

    Multi-Threading 403

    Row Distribution 404

    Row Merging 405

    Row Redistribution 406

    Data Pipelining 407

    Consequences of Multi-Threading 408

    Database Connections 408

    Order of Execution 409

    Parallel Execution in a Job 411

    Using Carte as a Slave Server 411

    The Configuration File 411

    Defining Slave Servers 412

    Remote Execution 413

    Monitoring Slave Servers 413

    Carte Security 414

    Services 414

    Clustering Transformations 417

    Defining a Cluster Schema 417

    Designing Clustered Transformations 418

    Execution and Monitoring 420

    Metadata Transformations 421

    Rules 422

    Data Pipelining 425

    Partitioning 425

    Defining a Partitioning Schema 425

    Objectives of Partitioning 427

    Implementing Partitioning 428

    Internal Variables 428

    Database Partitions 429

    Partitioning in a Clustered Transformation 430

    Summary 430

    Chapter 17 Dynamic Clustering in the Cloud 433

    Dynamic Clustering 433

    Setting Up a Dynamic Cluster 434

    Using the Dynamic Cluster 436

    Cloud Computing 437

    EC2 438

    Getting Started with EC2 438

    Costs 438

    Customizing an AMI 439

    Packaging a New AMI 442

    Terminating an AMI 442

    Running a Master 442

    Running the Slaves 443

    Using the EC2 Cluster 444

    Monitoring 445

    The Lightweight Principle and Persistence Options 446

    Summary 447

    Chapter 18 Real-Time Data Integration 449

    Introduction to Real-Time ETL 449

    Real-Time Challenges 450

    Requirements 451

    Transformation Streaming 452

    A Practical Example of Transformation Streaming 454

    Debugging 457

    Third-Party Software and Real-Time Integration 458

    Java Message Service 459

    Creating a JMS Connection and Session 459

    Consuming Messages 460

    Producing Messages 460

    Closing Shop 460

    Summary 461

    Part V Advanced Topics 463

    Chapter 19 Data Vault Management 465

    Introduction to Data Vault Modeling 466

    Do You Need a Data Vault? 466

    Data Vault Building Blocks 467

    Hubs 467

    Links 468

    Satellites 469

    Data Vault Characteristics 471

    Building a Data Vault 471

    Transforming Sakila to the Data Vault Model 472

    Sakila Hubs 472

    Sakila Links 473

    Sakila Satellites 474

    Loading the Data Vault: A Sample ETL Solution 477

    Installing the Sakila Data Vault 477

    Setting Up the ETL Solution 477

    Creating a Database Account 477

    The Sample ETL Data Vault Solution 478

    Sample Hub: hub_actor 478

    Sample Link: link_customer_store 480

    Sample Satellite: sat_actor 483

    Loading the Data Vault Tables 485

    Updating a Data Mart from a Data Vault 486

    The Sample ETL Solution 486

    The dim_actor Transformation 486

    The dim_customer Transformation 488

    The dim_film Transformation 492

    The dim_film_actor_bridge Transformation 492

    The fact_rental Transformation 493

    Loading the Star Schema Tables 495

    Summary 495

    Chapter 20 Handling Complex Data Formats 497

    Non-Relational and Non-Tabular Data Formats 498

    Non-Relational Tabular Formats 498

    Handling Multi-Valued Attributes 498

    Using the Split Field to Rows Step 499

    Handling Repeating Groups 500

    Using the Row Normaliser Step 500

    Semi- and Unstructured Data 501

    Kettle Regular Expression Example 503

    Configuring the Regex Evaluation Step 504

    Verifying the Match 507

    Key/Value Pairs 508

    Kettle Key/Value Pairs Example 509

    Text File Input 509

    Regex Evaluation 510

    Grouping Lines into Records 511

    Denormaliser: Turning Rows into Columns 512

    Summary 513

    Chapter 21 Web Services 515

    Web Pages and Web Services 515

    Kettle Web Features 516

    General HTTP Steps 516

    Simple Object Access Protocol 517

    Really Simple Syndication 517

    Apache Virtual File System Integration 517

    Data Formats 517

    XML 518

    Kettle Steps for Working with XML 518

    Kettle Job Entries for XML 519

    HTML 520

    JavaScript Object Notation 520

    Syntax 521

    JSON, Kettle, and ETL/DI 522

    XML Examples 523

    Example XML Document 523

    XML Document Structure 523

    Mapping to the Sakila Sample Database 524

    Extracting Data from XML 525

    Overall Design: The import_xml_into_db Transformation 526

    Using the XSD Validator Step 528

    Using the “Get Data from XML” Step 530

    Generating XML Documents 537

    Overall Design: The export_xml_from_db Transformation 537

    Generating XML with the Add XML Step 538

    Using the XML Join Step 541

    SOAP Examples 544

    Using the “Web services lookup” Step 544

    Configuring the “Web services lookup” Step 544

    Accessing SOAP Services Directly 546

    JSON Example 549

    The Freebase Project 549

    Freebase Versus Wikipedia 549

    Freebase Web Services 550

    The Freebase Read Service 550

    The Metaweb Query Language 551

    Extracting Freebase Data with Kettle 553

    Generate Rows 554

    Issuing a Freebase Read Request 555

    Processing the Freebase Result Envelope 556

    Filtering Out the Original Row 557

    Storing to File 558

    RSS 558

    RSS Structure 558

    Channel 558

    Item 559

    RSS Support in Kettle 560

    RSS Input 561

    RSS Output 562

    Summary 567

    Chapter 22 Kettle Integration 569

    The Kettle API 569

    The LGPL License 569

    The Kettle Java API 570

    Source Code 570

    Building Kettle 571

    Building javadoc 571

    Libraries and the Class Path 571

    Executing Existing Transformations and Jobs 571

    Executing a Transformation 572

    Executing a Job 573

    Embedding Kettle 574

    Pentaho Reporting 574

    Putting Data into a Transformation 576

    Dynamic Transformations 580

    Dynamic Template 583

    Dynamic Jobs 584

    Executing Dynamic ETL in Kettle 586

    Result 587

    Replacing Metadata 588

    Direct Changes with the API 589

    Using a Shared Objects File 589

    OEM Versions and Forks 590

    Creating an OEM Version of PDI 590

    Forking Kettle 591

    Summary 592

    Chapter 23 Extending Kettle 593

    Plugin Architecture Overview 593

    Plugin Types 594

    Architecture 595

    Prerequisites 596

    Kettle API Documentation 596

    Libraries 596

    Integrated Development Environment 596

    Eclipse Project Setup 597

    Examples 598

    Transformation Step Plugins 599

    StepMetaInterface 599

    Value Metadata 605

    Row Metadata 606

    StepDataInterface 607

    StepDialogInterface 607

    Eclipse SWT 607

    Form Layout 607

    Kettle UI Elements 609

    Hello World Example Dialog 609

    StepInterface 614

    Reading Rows from Specific Steps 616

    Writing Rows to Specific Steps 616

    Writing Rows to Error Handling 617

    Identifying a Step Copy 617

    Result Feedback 618

    Variable Substitution 618

    Apache VFS 619

    Step Plugin Deployment 619

    The User-Defined Java Class Step 620

    Passing Metadata 620

    Accessing Input and Fields 620

    Snippets 620

    Example 620

    Job Entry Plugins 621

    JobEntryInterface 622

    JobEntryDialogInterface 624

    Partitioning Method Plugins 624

    Partitioner 625

    Repository Type Plugins 626

    Database Type Plugins 627

    Summary 628

    Appendix A The Kettle Ecosystem 629

    Kettle Development and Versions 629

    The Pentaho Community Wiki 631

    Using the Forums 631

    Jira 632

    ##pentaho 633

    Appendix B Kettle Enterprise Edition Features 635

    Appendix C Built-in Variables and Properties Reference 637

    Internal Variables 637

    Kettle Variables 640

    Variables for Configuring VFS 641

    Noteworthy JRE Variables 642

    Index 643

Pentaho Kettle Solutions

    Product form

    £30.39

    Includes FREE delivery

    RRP £37.99 – you save £7.60 (20%)

    Order before 4pm today for delivery by Fri 3 Jul 2026.

    A Paperback / softback by Matt Casters, Roland Bouman, Jos van Dongen

      Trusted by thousands of customers. See 2,385+ Customer Reviews

      View other formats and editions of Pentaho Kettle Solutions by Matt Casters

      Publisher: John Wiley & Sons Inc
      Publication Date: 24/09/2010
      ISBN13: 9780470635179, 978-0470635179
      ISBN10: 0470635177

      Description

      Book Synopsis
      A complete guide to Pentaho Kettle, the Pentaho Data lntegration toolset for ETL

      This practical book is a complete guide to installing, configuring, and managing Pentaho Kettle. If you're a database administrator or developer, you'll first get up to speed on Kettle basics and how to apply Kettle to create ETL solutionsbefore progressing to specialized concepts such as clustering, extensibility, and data vault models. Learn how to design and build every phase of an ETL solution.

      • Shows developers and database administrators how to use the open-source Pentaho Kettle for enterprise-level ETL processes (Extracting, Transforming, and Loading data)
      • Assumes no prior knowledge of Kettle or ETL, and brings beginners thoroughly up to speed at their own pace
      • Explains how to get Kettle solutions up and running, then follows the 34 ETL subsystems model, as created by the Kimball Group, to explore the entire ETL lifecycle, including all aspects of data wareho

        Table of Contents

        Introduction xxxi

        Part I Getting Started 1

        Chapter 1 ETL Primer 3

        OLTP versus Data Warehousing 3

        What Is ETL? 5

        The Evolution of ETL Solutions 5

        ETL Building Blocks 7

        ETL, ELT, and EII 8

        ELT 9

        EII: Virtual Data Integration 10

        Data Integration Challenges 11

        Methodology: Agile BI 12

        ETL Design 14

        Data Acquisition 14

        Beware of Spreadsheets 15

        Design for Failure 15

        Change Data Capture 16

        Data Quality 16

        Data Profiling 16

        Data Validation 17

        ETL Tool Requirements 17

        Connectivity 17

        Platform Independence 18

        Scalability 18

        Design Flexibility 19

        Reuse 19

        Extensibility 19

        Data Transformations 20

        Testing and Debugging 21

        Lineage and Impact Analysis 21

        Logging and Auditing 22

        Summary 22

        Chapter 2 Kettle Concepts 23

        Design Principles 23

        The Building Blocks of Kettle Design 25

        Transformations 25

        Steps 26

        Transformation Hops 26

        Parallelism 27

        Rows of Data 27

        Data Conversion 29

        Jobs 30

        Job Entries 31

        Job Hops 31

        Multiple Paths and Backtracking 32

        Parallel Execution 33

        Job Entry Results 34

        Transformation or Job Metadata 36

        Database Connections 37

        Special Options 38

        The Power of the Relational Database 39

        Connections and Transactions 39

        Database Clustering 40

        Tools and Utilities 41

        Repositories 41

        Virtual File Systems 42

        Parameters and Variables 43

        Defining Variables 43

        Named Parameters 44

        Using Variables 44

        Visual Programming 45

        Getting Started 46

        Creating New Steps 47

        Putting It All Together 49

        Summary 51

        Chapter 3 Installation and Configuration 53

        Kettle Software Overview 53

        Integrated Development Environment: Spoon 55

        Command-Line Launchers: Kitchen and Pan 57

        Job Server: Carte 57

        Encr.bat and encr.sh 58

        Installation 58

        Java Environment 58

        Installing Java Manually 58

        Using Your Linux Package Management System 59

        Installing Kettle 59

        Versions and Releases 59

        Archive Names and Formats 60

        Downloading and Uncompressing 60

        Running Kettle Programs 61

        Creating a Shortcut Icon or Launcher for Spoon 62

        Configuration 63

        Configuration Files and the .kettle Directory 63

        The Kettle Shell Scripts 69

        General Structure of the Startup Scripts 70

        Adding an Entry to the Classpath 70

        Changing the Maximum Heap Size 71

        Managing JDBC Drivers 72

        Summary 72

        Chapter 4 An Example ETL Solution—Sakila 73

        Sakila 73

        The Sakila Sample Database 74

        DVD Rental Business Process 74

        Sakila Database Schema Diagram 75

        Sakila Database Subject Areas 75

        General Design Considerations 77

        Installing the Sakila Sample Database 77

        The Rental Star Schema 78

        Rental Star Schema Diagram 78

        Rental Fact Table 79

        Dimension Tables 79

        Keys and Change Data Capture 80

        Installing the Rental Star Schema 81

        Prerequisites and Some Basic Spoon Skills 81

        Setting Up the ETL Solution 82

        Creating Database Accounts 82

        Working with Spoon 82

        Opening Transformation and Job Files 82

        Opening the Step’s Configuration Dialog 83

        Examining Streams 83

        Running Jobs and Transformations 83

        The Sample ETL Solution 84

        Static, Generated Dimensions 84

        Loading the dim_date Dimension Table 84

        Loading the dim_time Dimension Table 86

        Recurring Load 87

        The load_rentals Job 88

        The load_dim_staff Transformation 91

        Database Connections 91

        The load_dim_customer Transformation 95

        The load_dim_store Transformation 98

        The fetch_address Subtransformation 99

        The load_dim_actor Transformation 101

        The load_dim_film Transformation 102

        The load_fact_rental Transformation 107

        Summary 109

        Part II ETL 111

        Chapter 5 ETL Subsystems 113

        Introduction to the 34 Subsystems 114

        Extraction 114

        Subsystems 1–3: Data Profiling, Change Data Capture, and

        Extraction 115

        Cleaning and Conforming Data 116

        Subsystem 4: Data Cleaning and Quality Screen

        Handler System 116

        Subsystem 5: Error Event Handler 117

        Subsystem 6: Audit Dimension Assembler 117

        Subsystem 7: Deduplication System 117

        Subsystem 8: Data Conformer 118

        Data Delivery 118

        Subsystem 9: Slowly Changing Dimension Processor 118

        Subsystem 10: Surrogate Key Creation System 119

        Subsystem 11: Hierarchy Dimension Builder 119

        Subsystem 12: Special Dimension Builder 120

        Subsystem 13: Fact Table Loader 121

        Subsystem 14: Surrogate Key Pipeline 121

        Subsystem 15: Multi-Valued Dimension Bridge Table Builder 121

        Subsystem 16: Late-Arriving Data Handler 122

        Subsystem 17: Dimension Manager System 122

        Subsystem 18: Fact Table Provider System 122

        Subsystem 19: Aggregate Builder 123

        Subsystem 20: Multidimensional (OLAP) Cube Builder 123

        Subsystem 21: Data Integration Manager 123

        Managing the ETL Environment 123

        Summary 126

        Chapter 6 Data Extraction 127

        Kettle Data Extraction Overview 128

        File-Based Extraction 128

        Working with Text Files 128

        Working with XML files 133

        Special File Types 134

        Database-Based Extraction 134

        Web-Based Extraction 137

        Text-Based Web Extraction 137

        HTTP Client 137

        Using SOAP 138

        Stream-Based and Real-Time Extraction 138

        Working with ERP and CRM Systems 138

        ERP Challenges 139

        Kettle ERP Plugins 140

        Working with SAP Data 140

        ERP and CDC Issues 146

        Data Profiling 146

        Using eobjects.org DataCleaner 147

        Adding Profile Tasks 149

        Adding Database Connections 149

        Doing an Initial Profile 151

        Working with Regular Expressions 151

        Profiling and Exploring Results 152

        Validating and Comparing Data 153

        Using a Dictionary for Column Dependency Checks 153

        Alternative Solutions 154

        Text Profiling with Kettle 154

        CDC: Change Data Capture 154

        Source Data–Based CDC 155

        Trigger-Based CDC 157

        Snapshot-Based CDC 158

        Log-Based CDC 162

        Which CDC Alternative Should You Choose? 163

        Delivering Data 164

        Summary 164

        Chapter 7 Cleansing and Conforming 167

        Data Cleansing 168

        Data-Cleansing Steps 169

        Using Reference Tables 172

        Conforming Data Using Lookup Tables 172

        Conforming Data Using Reference Tables 175

        Data Validation 179

        Applying Validation Rules 180

        Validating Dependency Constraints 183

        Error Handling 183

        Handling Process Errors 184

        Transformation Errors 186

        Handling Data (Validation) Errors 187

        Auditing Data and Process Quality 191

        Deduplicating Data 192

        Handling Exact Duplicates 193

        The Problem of Non-Exact Duplicates 194

        Building Deduplication Transforms 195

        Step 1: Fuzzy Match 197

        Step 2: Select Suspects 198

        Step 3: Lookup Validation Value 198

        Step 4: Filter Duplicates 199

        Scripting 200

        Formula 201

        JavaScript 202

        User-Defined Java Expressions 202

        Regular Expressions 203

        Summary 205

        Chapter 8 Handling Dimension Tables 207

        Managing Keys 208

        Managing Business Keys 209

        Keys in the Source System 209

        Keys in the Data Warehouse 209

        Business Keys 209

        Storing Business Keys 210

        Looking Up Keys with Kettle 210

        Generating Surrogate Keys 210

        The “Add sequence” Step 211

        Working with auto_increment or IDENTITY Columns 217

        Keys for Slowly Changing Dimensions 217

        Loading Dimension Tables 218

        Snowflaked Dimension Tables 218

        Top-Down Level-Wise Loading 219

        Sakila Snowflake Example 219

        Sample Transformation 221

        Database Lookup Configuration 222

        Sample Job 225

        Star Schema Dimension Tables 226

        Denormalization 226

        Denormalizing to 1NF with the “Database lookup” Step 226

        Change Data Capture 227

        Slowly Changing Dimensions 228

        Types of Slowly Changing Dimensions 228

        Type 1 Slowly Changing Dimensions 229

        The Insert / Update Step 229

        Type 2 Slowly Changing Dimensions 232

        The “Dimension lookup / update” Step 232

        Other Types of Slowly Changing Dimensions 237

        Type 3 Slowly Changing Dimensions 237

        Hybrid Slowly Changing Dimensions 238

        More Dimensions 239

        Generated Dimensions 239

        Date and Time Dimensions 239

        Generated Mini-Dimensions 239

        Junk Dimensions 241

        Recursive Hierarchies 242

        Summary 243

        Chapter 9 Loading Fact Tables 245

        Loading in Bulk 246

        STDIN and FIFO 247

        Kettle Bulk Loaders 248

        MySQL Bulk Loading 249

        LucidDB Bulk Loader 249

        Oracle Bulk Loader 249

        PostgreSQL Bulk Loader 250

        Table Output Step 250

        General Bulk Load Considerations 250

        Dimension Lookups 251

        Maintaining Referential Integrity 251

        The Surrogate Key Pipeline 252

        Using In-Memory Lookups 253

        Stream Lookups 253

        Late-Arriving Data 255

        Late-Arriving Facts 256

        Late-Arriving Dimensions 256

        Fact Table Handling 260

        Periodic and Accumulating Snapshots 260

        Introducing State-Oriented Fact Tables 261

        Loading Periodic Snapshots 263

        Loading Accumulating Snapshots 264

        Loading State-Oriented Fact Tables 265

        Loading Aggregate Tables 266

        Summary 267

        Chapter 10 Working with OLAP Data 269

        OLAP Benefits and Challenges 270

        OLAP Storage Types 272

        Positioning OLAP 272

        Kettle OLAP Options 273

        Working with Mondrian 274

        Working with XML/A Servers 277

        Working with Palo 282

        Setting Up the Palo Connection 283

        Palo Architecture 284

        Reading Palo Data 285

        Writing Palo Data 289

        Summary 291

        Part III Management and Deployment 293

        Chapter 11 ETL Development Lifecycle 295

        Solution Design 295

        Best and Bad Practices 296

        Data Mapping 297

        Naming and Commentary Conventions 298

        Common Pitfalls 299

        ETL Flow Design 300

        Reusability and Maintainability 300

        Agile Development 301

        Testing and Debugging 306

        Test Activities 307

        ETL Testing 308

        Test Data Requirements 308

        Testing for Completeness 309

        Testing Data Transformations 311

        Test Automation and Continuous Integration 311

        Upgrade Tests 312

        Debugging 312

        Documenting the Solution 315

        Why Isn’t There Any Documentation? 316

        Myth 1: My Software Is Self-Explanatory 316

        Myth 2: Documentation Is Always Outdated 316

        Myth 3: Who Reads Documentation Anyway? 317

        Kettle Documentation Features 317

        Generating Documentation 319

        Summary 320

        Chapter 12 Scheduling and Monitoring 321

        Scheduling 321

        Operating System–Level Scheduling 322

        Executing Kettle Jobs and Transformations from

        the Command Line 322

        UNIX-Based Systems: cron 326

        Windows: The at utility and the Task Scheduler 327

        Using Pentaho’s Built-in Scheduler 327

        Creating an Action Sequence to Run Kettle Jobs and

        Transformations 328

        Kettle Transformations in Action Sequences 329

        Creating and Maintaining Schedules with the

        Administration Console 330

        Attaching an Action Sequence to a Schedule 333

        Monitoring 333

        Logging 333

        Inspecting the Log 333

        Logging Levels 335

        Writing Custom Messages to the Log 336

        E‑mail Notifications 336

        Configuring the Mail Job Entry 337

        Summary 340

        Chapter 13 Versioning and Migration 341

        Version Control Systems 341

        File-Based Version Control Systems 342

        Organization 342

        Leading File-Based VCSs 343

        Content Management Systems 344

        Kettle Metadata 344

        Kettle XML Metadata 345

        Transformation XML 345

        Job XML 346

        Global Replace 347

        Kettle Repository Metadata 348

        The Kettle Database Repository Type 348

        The Kettle File Repository Type 349

        The Kettle Enterprise Repository Type 350

        Managing Repositories 350

        Exporting and Importing Repositories 350

        Upgrading Your Repository 351

        Version Migration System 352

        Managing XML Files 352

        Managing Repositories 352

        Parameterizing Your Solution 353

        Summary 356

        Chapter 14 Lineage and Auditing 357

        Batch-Level Lineage Extraction 358

        Lineage 359

        Lineage Information 359

        Impact Analysis Information 361

        Logging and Operational Metadata 363

        Logging Basics 363

        Logging Architecture 364

        Setting a Maximum Buffer Size 365

        Setting a Maximum Log Line Age 365

        Log Channels 366

        Log Text Capturing in a Job 366

        Logging Tables 367

        Transformation Logging Tables 367

        Job Logging Tables 373

        Summary 374

        Part IV Performance and Scalability 375

        Chapter 15 Performance Tuning 377

        Transformation Performance: Finding the Weakest Link 377

        Finding Bottlenecks by Simplifying 379

        Finding Bottlenecks by Measuring 380

        Copying Rows of Data 382

        Improving Transformation Performance 384

        Improving Performance in Reading Text Files 384

        Using Lazy Conversion for Reading Text Files 385

        Single-File Parallel Reading 385

        Multi-File Parallel Reading 386

        Configuring the NIO Block Size 386

        Changing Disks and Reading Text Files 386

        Improving Performance in Writing Text Files 387

        Using Lazy Conversion for Writing Text Files 387

        Parallel Files Writing 387

        Changing Disks and Writing Text Files 387

        Improving Database Performance 388

        Avoiding Dynamic SQL 388

        Handling Roundtrips 388

        Handling Relational Databases 390

        Sorting Data 392

        Sorting on the Database 393

        Sorting in Parallel 393

        Reducing CPU Usage 394

        Optimizing the Use of JavaScript 394

        Launching Multiple Copies of a Step 396

        Selecting and Removing Values 397

        Managing Thread Priorities 397

        Adding Static Data to Rows of Data 397

        Limiting the Number of Step Copies 398

        Avoiding Excessive Logging 398

        Improving Job Performance 399

        Loops in Jobs 399

        Database Connection Pools 400

        Summary 401

        Chapter 16 Parallelization, Clustering, and Partitioning 403

        Multi-Threading 403

        Row Distribution 404

        Row Merging 405

        Row Redistribution 406

        Data Pipelining 407

        Consequences of Multi-Threading 408

        Database Connections 408

        Order of Execution 409

        Parallel Execution in a Job 411

        Using Carte as a Slave Server 411

        The Configuration File 411

        Defining Slave Servers 412

        Remote Execution 413

        Monitoring Slave Servers 413

        Carte Security 414

        Services 414

        Clustering Transformations 417

        Defining a Cluster Schema 417

        Designing Clustered Transformations 418

        Execution and Monitoring 420

        Metadata Transformations 421

        Rules 422

        Data Pipelining 425

        Partitioning 425

        Defining a Partitioning Schema 425

        Objectives of Partitioning 427

        Implementing Partitioning 428

        Internal Variables 428

        Database Partitions 429

        Partitioning in a Clustered Transformation 430

        Summary 430

        Chapter 17 Dynamic Clustering in the Cloud 433

        Dynamic Clustering 433

        Setting Up a Dynamic Cluster 434

        Using the Dynamic Cluster 436

        Cloud Computing 437

        EC2 438

        Getting Started with EC2 438

        Costs 438

        Customizing an AMI 439

        Packaging a New AMI 442

        Terminating an AMI 442

        Running a Master 442

        Running the Slaves 443

        Using the EC2 Cluster 444

        Monitoring 445

        The Lightweight Principle and Persistence Options 446

        Summary 447

        Chapter 18 Real-Time Data Integration 449

        Introduction to Real-Time ETL 449

        Real-Time Challenges 450

        Requirements 451

        Transformation Streaming 452

        A Practical Example of Transformation Streaming 454

        Debugging 457

        Third-Party Software and Real-Time Integration 458

        Java Message Service 459

        Creating a JMS Connection and Session 459

        Consuming Messages 460

        Producing Messages 460

        Closing Shop 460

        Summary 461

        Part V Advanced Topics 463

        Chapter 19 Data Vault Management 465

        Introduction to Data Vault Modeling 466

        Do You Need a Data Vault? 466

        Data Vault Building Blocks 467

        Hubs 467

        Links 468

        Satellites 469

        Data Vault Characteristics 471

        Building a Data Vault 471

        Transforming Sakila to the Data Vault Model 472

        Sakila Hubs 472

        Sakila Links 473

        Sakila Satellites 474

        Loading the Data Vault: A Sample ETL Solution 477

        Installing the Sakila Data Vault 477

        Setting Up the ETL Solution 477

        Creating a Database Account 477

        The Sample ETL Data Vault Solution 478

        Sample Hub: hub_actor 478

        Sample Link: link_customer_store 480

        Sample Satellite: sat_actor 483

        Loading the Data Vault Tables 485

        Updating a Data Mart from a Data Vault 486

        The Sample ETL Solution 486

        The dim_actor Transformation 486

        The dim_customer Transformation 488

        The dim_film Transformation 492

        The dim_film_actor_bridge Transformation 492

        The fact_rental Transformation 493

        Loading the Star Schema Tables 495

        Summary 495

        Chapter 20 Handling Complex Data Formats 497

        Non-Relational and Non-Tabular Data Formats 498

        Non-Relational Tabular Formats 498

        Handling Multi-Valued Attributes 498

        Using the Split Field to Rows Step 499

        Handling Repeating Groups 500

        Using the Row Normaliser Step 500

        Semi- and Unstructured Data 501

        Kettle Regular Expression Example 503

        Configuring the Regex Evaluation Step 504

        Verifying the Match 507

        Key/Value Pairs 508

        Kettle Key/Value Pairs Example 509

        Text File Input 509

        Regex Evaluation 510

        Grouping Lines into Records 511

        Denormaliser: Turning Rows into Columns 512

        Summary 513

        Chapter 21 Web Services 515

        Web Pages and Web Services 515

        Kettle Web Features 516

        General HTTP Steps 516

        Simple Object Access Protocol 517

        Really Simple Syndication 517

        Apache Virtual File System Integration 517

        Data Formats 517

        XML 518

        Kettle Steps for Working with XML 518

        Kettle Job Entries for XML 519

        HTML 520

        JavaScript Object Notation 520

        Syntax 521

        JSON, Kettle, and ETL/DI 522

        XML Examples 523

        Example XML Document 523

        XML Document Structure 523

        Mapping to the Sakila Sample Database 524

        Extracting Data from XML 525

        Overall Design: The import_xml_into_db Transformation 526

        Using the XSD Validator Step 528

        Using the “Get Data from XML” Step 530

        Generating XML Documents 537

        Overall Design: The export_xml_from_db Transformation 537

        Generating XML with the Add XML Step 538

        Using the XML Join Step 541

        SOAP Examples 544

        Using the “Web services lookup” Step 544

        Configuring the “Web services lookup” Step 544

        Accessing SOAP Services Directly 546

        JSON Example 549

        The Freebase Project 549

        Freebase Versus Wikipedia 549

        Freebase Web Services 550

        The Freebase Read Service 550

        The Metaweb Query Language 551

        Extracting Freebase Data with Kettle 553

        Generate Rows 554

        Issuing a Freebase Read Request 555

        Processing the Freebase Result Envelope 556

        Filtering Out the Original Row 557

        Storing to File 558

        RSS 558

        RSS Structure 558

        Channel 558

        Item 559

        RSS Support in Kettle 560

        RSS Input 561

        RSS Output 562

        Summary 567

        Chapter 22 Kettle Integration 569

        The Kettle API 569

        The LGPL License 569

        The Kettle Java API 570

        Source Code 570

        Building Kettle 571

        Building javadoc 571

        Libraries and the Class Path 571

        Executing Existing Transformations and Jobs 571

        Executing a Transformation 572

        Executing a Job 573

        Embedding Kettle 574

        Pentaho Reporting 574

        Putting Data into a Transformation 576

        Dynamic Transformations 580

        Dynamic Template 583

        Dynamic Jobs 584

        Executing Dynamic ETL in Kettle 586

        Result 587

        Replacing Metadata 588

        Direct Changes with the API 589

        Using a Shared Objects File 589

        OEM Versions and Forks 590

        Creating an OEM Version of PDI 590

        Forking Kettle 591

        Summary 592

        Chapter 23 Extending Kettle 593

        Plugin Architecture Overview 593

        Plugin Types 594

        Architecture 595

        Prerequisites 596

        Kettle API Documentation 596

        Libraries 596

        Integrated Development Environment 596

        Eclipse Project Setup 597

        Examples 598

        Transformation Step Plugins 599

        StepMetaInterface 599

        Value Metadata 605

        Row Metadata 606

        StepDataInterface 607

        StepDialogInterface 607

        Eclipse SWT 607

        Form Layout 607

        Kettle UI Elements 609

        Hello World Example Dialog 609

        StepInterface 614

        Reading Rows from Specific Steps 616

        Writing Rows to Specific Steps 616

        Writing Rows to Error Handling 617

        Identifying a Step Copy 617

        Result Feedback 618

        Variable Substitution 618

        Apache VFS 619

        Step Plugin Deployment 619

        The User-Defined Java Class Step 620

        Passing Metadata 620

        Accessing Input and Fields 620

        Snippets 620

        Example 620

        Job Entry Plugins 621

        JobEntryInterface 622

        JobEntryDialogInterface 624

        Partitioning Method Plugins 624

        Partitioner 625

        Repository Type Plugins 626

        Database Type Plugins 627

        Summary 628

        Appendix A The Kettle Ecosystem 629

        Kettle Development and Versions 629

        The Pentaho Community Wiki 631

        Using the Forums 631

        Jira 632

        ##pentaho 633

        Appendix B Kettle Enterprise Edition Features 635

        Appendix C Built-in Variables and Properties Reference 637

        Internal Variables 637

        Kettle Variables 640

        Variables for Configuring VFS 641

        Noteworthy JRE Variables 642

        Index 643

      Recently viewed products

      © 2026 Book Curl

        • American Express
        • Apple Pay
        • Diners Club
        • Discover
        • Google Pay
        • Maestro
        • Mastercard
        • PayPal
        • Shop Pay
        • Union Pay
        • Visa

        Login

        Forgot your password?

        Don't have an account yet?
        Create account