Description

Book Synopsis
A complete guide to Pentaho Kettle, the Pentaho Data lntegration toolset for ETL

This practical book is a complete guide to installing, configuring, and managing Pentaho Kettle. If you're a database administrator or developer, you'll first get up to speed on Kettle basics and how to apply Kettle to create ETL solutionsbefore progressing to specialized concepts such as clustering, extensibility, and data vault models. Learn how to design and build every phase of an ETL solution.

  • Shows developers and database administrators how to use the open-source Pentaho Kettle for enterprise-level ETL processes (Extracting, Transforming, and Loading data)
  • Assumes no prior knowledge of Kettle or ETL, and brings beginners thoroughly up to speed at their own pace
  • Explains how to get Kettle solutions up and running, then follows the 34 ETL subsystems model, as created by the Kimball Group, to explore the entire ETL lifecycle, including all aspects of data wareho

    Table of Contents

    Introduction xxxi

    Part I Getting Started 1

    Chapter 1 ETL Primer 3

    OLTP versus Data Warehousing 3

    What Is ETL? 5

    The Evolution of ETL Solutions 5

    ETL Building Blocks 7

    ETL, ELT, and EII 8

    ELT 9

    EII: Virtual Data Integration 10

    Data Integration Challenges 11

    Methodology: Agile BI 12

    ETL Design 14

    Data Acquisition 14

    Beware of Spreadsheets 15

    Design for Failure 15

    Change Data Capture 16

    Data Quality 16

    Data Profiling 16

    Data Validation 17

    ETL Tool Requirements 17

    Connectivity 17

    Platform Independence 18

    Scalability 18

    Design Flexibility 19

    Reuse 19

    Extensibility 19

    Data Transformations 20

    Testing and Debugging 21

    Lineage and Impact Analysis 21

    Logging and Auditing 22

    Summary 22

    Chapter 2 Kettle Concepts 23

    Design Principles 23

    The Building Blocks of Kettle Design 25

    Transformations 25

    Steps 26

    Transformation Hops 26

    Parallelism 27

    Rows of Data 27

    Data Conversion 29

    Jobs 30

    Job Entries 31

    Job Hops 31

    Multiple Paths and Backtracking 32

    Parallel Execution 33

    Job Entry Results 34

    Transformation or Job Metadata 36

    Database Connections 37

    Special Options 38

    The Power of the Relational Database 39

    Connections and Transactions 39

    Database Clustering 40

    Tools and Utilities 41

    Repositories 41

    Virtual File Systems 42

    Parameters and Variables 43

    Defining Variables 43

    Named Parameters 44

    Using Variables 44

    Visual Programming 45

    Getting Started 46

    Creating New Steps 47

    Putting It All Together 49

    Summary 51

    Chapter 3 Installation and Configuration 53

    Kettle Software Overview 53

    Integrated Development Environment: Spoon 55

    Command-Line Launchers: Kitchen and Pan 57

    Job Server: Carte 57

    Encr.bat and encr.sh 58

    Installation 58

    Java Environment 58

    Installing Java Manually 58

    Using Your Linux Package Management System 59

    Installing Kettle 59

    Versions and Releases 59

    Archive Names and Formats 60

    Downloading and Uncompressing 60

    Running Kettle Programs 61

    Creating a Shortcut Icon or Launcher for Spoon 62

    Configuration 63

    Configuration Files and the .kettle Directory 63

    The Kettle Shell Scripts 69

    General Structure of the Startup Scripts 70

    Adding an Entry to the Classpath 70

    Changing the Maximum Heap Size 71

    Managing JDBC Drivers 72

    Summary 72

    Chapter 4 An Example ETL Solution—Sakila 73

    Sakila 73

    The Sakila Sample Database 74

    DVD Rental Business Process 74

    Sakila Database Schema Diagram 75

    Sakila Database Subject Areas 75

    General Design Considerations 77

    Installing the Sakila Sample Database 77

    The Rental Star Schema 78

    Rental Star Schema Diagram 78

    Rental Fact Table 79

    Dimension Tables 79

    Keys and Change Data Capture 80

    Installing the Rental Star Schema 81

    Prerequisites and Some Basic Spoon Skills 81

    Setting Up the ETL Solution 82

    Creating Database Accounts 82

    Working with Spoon 82

    Opening Transformation and Job Files 82

    Opening the Step’s Configuration Dialog 83

    Examining Streams 83

    Running Jobs and Transformations 83

    The Sample ETL Solution 84

    Static, Generated Dimensions 84

    Loading the dim_date Dimension Table 84

    Loading the dim_time Dimension Table 86

    Recurring Load 87

    The load_rentals Job 88

    The load_dim_staff Transformation 91

    Database Connections 91

    The load_dim_customer Transformation 95

    The load_dim_store Transformation 98

    The fetch_address Subtransformation 99

    The load_dim_actor Transformation 101

    The load_dim_film Transformation 102

    The load_fact_rental Transformation 107

    Summary 109

    Part II ETL 111

    Chapter 5 ETL Subsystems 113

    Introduction to the 34 Subsystems 114

    Extraction 114

    Subsystems 1–3: Data Profiling, Change Data Capture, and

    Extraction 115

    Cleaning and Conforming Data 116

    Subsystem 4: Data Cleaning and Quality Screen

    Handler System 116

    Subsystem 5: Error Event Handler 117

    Subsystem 6: Audit Dimension Assembler 117

    Subsystem 7: Deduplication System 117

    Subsystem 8: Data Conformer 118

    Data Delivery 118

    Subsystem 9: Slowly Changing Dimension Processor 118

    Subsystem 10: Surrogate Key Creation System 119

    Subsystem 11: Hierarchy Dimension Builder 119

    Subsystem 12: Special Dimension Builder 120

    Subsystem 13: Fact Table Loader 121

    Subsystem 14: Surrogate Key Pipeline 121

    Subsystem 15: Multi-Valued Dimension Bridge Table Builder 121

    Subsystem 16: Late-Arriving Data Handler 122

    Subsystem 17: Dimension Manager System 122

    Subsystem 18: Fact Table Provider System 122

    Subsystem 19: Aggregate Builder 123

    Subsystem 20: Multidimensional (OLAP) Cube Builder 123

    Subsystem 21: Data Integration Manager 123

    Managing the ETL Environment 123

    Summary 126

    Chapter 6 Data Extraction 127

    Kettle Data Extraction Overview 128

    File-Based Extraction 128

    Working with Text Files 128

    Working with XML files 133

    Special File Types 134

    Database-Based Extraction 134

    Web-Based Extraction 137

    Text-Based Web Extraction 137

    HTTP Client 137

    Using SOAP 138

    Stream-Based and Real-Time Extraction 138

    Working with ERP and CRM Systems 138

    ERP Challenges 139

    Kettle ERP Plugins 140

    Working with SAP Data 140

    ERP and CDC Issues 146

    Data Profiling 146

    Using eobjects.org DataCleaner 147

    Adding Profile Tasks 149

    Adding Database Connections 149

    Doing an Initial Profile 151

    Working with Regular Expressions 151

    Profiling and Exploring Results 152

    Validating and Comparing Data 153

    Using a Dictionary for Column Dependency Checks 153

    Alternative Solutions 154

    Text Profiling with Kettle 154

    CDC: Change Data Capture 154

    Source Data–Based CDC 155

    Trigger-Based CDC 157

    Snapshot-Based CDC 158

    Log-Based CDC 162

    Which CDC Alternative Should You Choose? 163

    Delivering Data 164

    Summary 164

    Chapter 7 Cleansing and Conforming 167

    Data Cleansing 168

    Data-Cleansing Steps 169

    Using Reference Tables 172

    Conforming Data Using Lookup Tables 172

    Conforming Data Using Reference Tables 175

    Data Validation 179

    Applying Validation Rules 180

    Validating Dependency Constraints 183

    Error Handling 183

    Handling Process Errors 184

    Transformation Errors 186

    Handling Data (Validation) Errors 187

    Auditing Data and Process Quality 191

    Deduplicating Data 192

    Handling Exact Duplicates 193

    The Problem of Non-Exact Duplicates 194

    Building Deduplication Transforms 195

    Step 1: Fuzzy Match 197

    Step 2: Select Suspects 198

    Step 3: Lookup Validation Value 198

    Step 4: Filter Duplicates 199

    Scripting 200

    Formula 201

    JavaScript 202

    User-Defined Java Expressions 202

    Regular Expressions 203

    Summary 205

    Chapter 8 Handling Dimension Tables 207

    Managing Keys 208

    Managing Business Keys 209

    Keys in the Source System 209

    Keys in the Data Warehouse 209

    Business Keys 209

    Storing Business Keys 210

    Looking Up Keys with Kettle 210

    Generating Surrogate Keys 210

    The “Add sequence” Step 211

    Working with auto_increment or IDENTITY Columns 217

    Keys for Slowly Changing Dimensions 217

    Loading Dimension Tables 218

    Snowflaked Dimension Tables 218

    Top-Down Level-Wise Loading 219

    Sakila Snowflake Example 219

    Sample Transformation 221

    Database Lookup Configuration 222

    Sample Job 225

    Star Schema Dimension Tables 226

    Denormalization 226

    Denormalizing to 1NF with the “Database lookup” Step 226

    Change Data Capture 227

    Slowly Changing Dimensions 228

    Types of Slowly Changing Dimensions 228

    Type 1 Slowly Changing Dimensions 229

    The Insert / Update Step 229

    Type 2 Slowly Changing Dimensions 232

    The “Dimension lookup / update” Step 232

    Other Types of Slowly Changing Dimensions 237

    Type 3 Slowly Changing Dimensions 237

    Hybrid Slowly Changing Dimensions 238

    More Dimensions 239

    Generated Dimensions 239

    Date and Time Dimensions 239

    Generated Mini-Dimensions 239

    Junk Dimensions 241

    Recursive Hierarchies 242

    Summary 243

    Chapter 9 Loading Fact Tables 245

    Loading in Bulk 246

    STDIN and FIFO 247

    Kettle Bulk Loaders 248

    MySQL Bulk Loading 249

    LucidDB Bulk Loader 249

    Oracle Bulk Loader 249

    PostgreSQL Bulk Loader 250

    Table Output Step 250

    General Bulk Load Considerations 250

    Dimension Lookups 251

    Maintaining Referential Integrity 251

    The Surrogate Key Pipeline 252

    Using In-Memory Lookups 253

    Stream Lookups 253

    Late-Arriving Data 255

    Late-Arriving Facts 256

    Late-Arriving Dimensions 256

    Fact Table Handling 260

    Periodic and Accumulating Snapshots 260

    Introducing State-Oriented Fact Tables 261

    Loading Periodic Snapshots 263

    Loading Accumulating Snapshots 264

    Loading State-Oriented Fact Tables 265

    Loading Aggregate Tables 266

    Summary 267

    Chapter 10 Working with OLAP Data 269

    OLAP Benefits and Challenges 270

    OLAP Storage Types 272

    Positioning OLAP 272

    Kettle OLAP Options 273

    Working with Mondrian 274

    Working with XML/A Servers 277

    Working with Palo 282

    Setting Up the Palo Connection 283

    Palo Architecture 284

    Reading Palo Data 285

    Writing Palo Data 289

    Summary 291

    Part III Management and Deployment 293

    Chapter 11 ETL Development Lifecycle 295

    Solution Design 295

    Best and Bad Practices 296

    Data Mapping 297

    Naming and Commentary Conventions 298

    Common Pitfalls 299

    ETL Flow Design 300

    Reusability and Maintainability 300

    Agile Development 301

    Testing and Debugging 306

    Test Activities 307

    ETL Testing 308

    Test Data Requirements 308

    Testing for Completeness 309

    Testing Data Transformations 311

    Test Automation and Continuous Integration 311

    Upgrade Tests 312

    Debugging 312

    Documenting the Solution 315

    Why Isn’t There Any Documentation? 316

    Myth 1: My Software Is Self-Explanatory 316

    Myth 2: Documentation Is Always Outdated 316

    Myth 3: Who Reads Documentation Anyway? 317

    Kettle Documentation Features 317

    Generating Documentation 319

    Summary 320

    Chapter 12 Scheduling and Monitoring 321

    Scheduling 321

    Operating System–Level Scheduling 322

    Executing Kettle Jobs and Transformations from

    the Command Line 322

    UNIX-Based Systems: cron 326

    Windows: The at utility and the Task Scheduler 327

    Using Pentaho’s Built-in Scheduler 327

    Creating an Action Sequence to Run Kettle Jobs and

    Transformations 328

    Kettle Transformations in Action Sequences 329

    Creating and Maintaining Schedules with the

    Administration Console 330

    Attaching an Action Sequence to a Schedule 333

    Monitoring 333

    Logging 333

    Inspecting the Log 333

    Logging Levels 335

    Writing Custom Messages to the Log 336

    E‑mail Notifications 336

    Configuring the Mail Job Entry 337

    Summary 340

    Chapter 13 Versioning and Migration 341

    Version Control Systems 341

    File-Based Version Control Systems 342

    Organization 342

    Leading File-Based VCSs 343

    Content Management Systems 344

    Kettle Metadata 344

    Kettle XML Metadata 345

    Transformation XML 345

    Job XML 346

    Global Replace 347

    Kettle Repository Metadata 348

    The Kettle Database Repository Type 348

    The Kettle File Repository Type 349

    The Kettle Enterprise Repository Type 350

    Managing Repositories 350

    Exporting and Importing Repositories 350

    Upgrading Your Repository 351

    Version Migration System 352

    Managing XML Files 352

    Managing Repositories 352

    Parameterizing Your Solution 353

    Summary 356

    Chapter 14 Lineage and Auditing 357

    Batch-Level Lineage Extraction 358

    Lineage 359

    Lineage Information 359

    Impact Analysis Information 361

    Logging and Operational Metadata 363

    Logging Basics 363

    Logging Architecture 364

    Setting a Maximum Buffer Size 365

    Setting a Maximum Log Line Age 365

    Log Channels 366

    Log Text Capturing in a Job 366

    Logging Tables 367

    Transformation Logging Tables 367

    Job Logging Tables 373

    Summary 374

    Part IV Performance and Scalability 375

    Chapter 15 Performance Tuning 377

    Transformation Performance: Finding the Weakest Link 377

    Finding Bottlenecks by Simplifying 379

    Finding Bottlenecks by Measuring 380

    Copying Rows of Data 382

    Improving Transformation Performance 384

    Improving Performance in Reading Text Files 384

    Using Lazy Conversion for Reading Text Files 385

    Single-File Parallel Reading 385

    Multi-File Parallel Reading 386

    Configuring the NIO Block Size 386

    Changing Disks and Reading Text Files 386

    Improving Performance in Writing Text Files 387

    Using Lazy Conversion for Writing Text Files 387

    Parallel Files Writing 387

    Changing Disks and Writing Text Files 387

    Improving Database Performance 388

    Avoiding Dynamic SQL 388

    Handling Roundtrips 388

    Handling Relational Databases 390

    Sorting Data 392

    Sorting on the Database 393

    Sorting in Parallel 393

    Reducing CPU Usage 394

    Optimizing the Use of JavaScript 394

    Launching Multiple Copies of a Step 396

    Selecting and Removing Values 397

    Managing Thread Priorities 397

    Adding Static Data to Rows of Data 397

    Limiting the Number of Step Copies 398

    Avoiding Excessive Logging 398

    Improving Job Performance 399

    Loops in Jobs 399

    Database Connection Pools 400

    Summary 401

    Chapter 16 Parallelization, Clustering, and Partitioning 403

    Multi-Threading 403

    Row Distribution 404

    Row Merging 405

    Row Redistribution 406

    Data Pipelining 407

    Consequences of Multi-Threading 408

    Database Connections 408

    Order of Execution 409

    Parallel Execution in a Job 411

    Using Carte as a Slave Server 411

    The Configuration File 411

    Defining Slave Servers 412

    Remote Execution 413

    Monitoring Slave Servers 413

    Carte Security 414

    Services 414

    Clustering Transformations 417

    Defining a Cluster Schema 417

    Designing Clustered Transformations 418

    Execution and Monitoring 420

    Metadata Transformations 421

    Rules 422

    Data Pipelining 425

    Partitioning 425

    Defining a Partitioning Schema 425

    Objectives of Partitioning 427

    Implementing Partitioning 428

    Internal Variables 428

    Database Partitions 429

    Partitioning in a Clustered Transformation 430

    Summary 430

    Chapter 17 Dynamic Clustering in the Cloud 433

    Dynamic Clustering 433

    Setting Up a Dynamic Cluster 434

    Using the Dynamic Cluster 436

    Cloud Computing 437

    EC2 438

    Getting Started with EC2 438

    Costs 438

    Customizing an AMI 439

    Packaging a New AMI 442

    Terminating an AMI 442

    Running a Master 442

    Running the Slaves 443

    Using the EC2 Cluster 444

    Monitoring 445

    The Lightweight Principle and Persistence Options 446

    Summary 447

    Chapter 18 Real-Time Data Integration 449

    Introduction to Real-Time ETL 449

    Real-Time Challenges 450

    Requirements 451

    Transformation Streaming 452

    A Practical Example of Transformation Streaming 454

    Debugging 457

    Third-Party Software and Real-Time Integration 458

    Java Message Service 459

    Creating a JMS Connection and Session 459

    Consuming Messages 460

    Producing Messages 460

    Closing Shop 460

    Summary 461

    Part V Advanced Topics 463

    Chapter 19 Data Vault Management 465

    Introduction to Data Vault Modeling 466

    Do You Need a Data Vault? 466

    Data Vault Building Blocks 467

    Hubs 467

    Links 468

    Satellites 469

    Data Vault Characteristics 471

    Building a Data Vault 471

    Transforming Sakila to the Data Vault Model 472

    Sakila Hubs 472

    Sakila Links 473

    Sakila Satellites 474

    Loading the Data Vault: A Sample ETL Solution 477

    Installing the Sakila Data Vault 477

    Setting Up the ETL Solution 477

    Creating a Database Account 477

    The Sample ETL Data Vault Solution 478

    Sample Hub: hub_actor 478

    Sample Link: link_customer_store 480

    Sample Satellite: sat_actor 483

    Loading the Data Vault Tables 485

    Updating a Data Mart from a Data Vault 486

    The Sample ETL Solution 486

    The dim_actor Transformation 486

    The dim_customer Transformation 488

    The dim_film Transformation 492

    The dim_film_actor_bridge Transformation 492

    The fact_rental Transformation 493

    Loading the Star Schema Tables 495

    Summary 495

    Chapter 20 Handling Complex Data Formats 497

    Non-Relational and Non-Tabular Data Formats 498

    Non-Relational Tabular Formats 498

    Handling Multi-Valued Attributes 498

    Using the Split Field to Rows Step 499

    Handling Repeating Groups 500

    Using the Row Normaliser Step 500

    Semi- and Unstructured Data 501

    Kettle Regular Expression Example 503

    Configuring the Regex Evaluation Step 504

    Verifying the Match 507

    Key/Value Pairs 508

    Kettle Key/Value Pairs Example 509

    Text File Input 509

    Regex Evaluation 510

    Grouping Lines into Records 511

    Denormaliser: Turning Rows into Columns 512

    Summary 513

    Chapter 21 Web Services 515

    Web Pages and Web Services 515

    Kettle Web Features 516

    General HTTP Steps 516

    Simple Object Access Protocol 517

    Really Simple Syndication 517

    Apache Virtual File System Integration 517

    Data Formats 517

    XML 518

    Kettle Steps for Working with XML 518

    Kettle Job Entries for XML 519

    HTML 520

    JavaScript Object Notation 520

    Syntax 521

    JSON, Kettle, and ETL/DI 522

    XML Examples 523

    Example XML Document 523

    XML Document Structure 523

    Mapping to the Sakila Sample Database 524

    Extracting Data from XML 525

    Overall Design: The import_xml_into_db Transformation 526

    Using the XSD Validator Step 528

    Using the “Get Data from XML” Step 530

    Generating XML Documents 537

    Overall Design: The export_xml_from_db Transformation 537

    Generating XML with the Add XML Step 538

    Using the XML Join Step 541

    SOAP Examples 544

    Using the “Web services lookup” Step 544

    Configuring the “Web services lookup” Step 544

    Accessing SOAP Services Directly 546

    JSON Example 549

    The Freebase Project 549

    Freebase Versus Wikipedia 549

    Freebase Web Services 550

    The Freebase Read Service 550

    The Metaweb Query Language 551

    Extracting Freebase Data with Kettle 553

    Generate Rows 554

    Issuing a Freebase Read Request 555

    Processing the Freebase Result Envelope 556

    Filtering Out the Original Row 557

    Storing to File 558

    RSS 558

    RSS Structure 558

    Channel 558

    Item 559

    RSS Support in Kettle 560

    RSS Input 561

    RSS Output 562

    Summary 567

    Chapter 22 Kettle Integration 569

    The Kettle API 569

    The LGPL License 569

    The Kettle Java API 570

    Source Code 570

    Building Kettle 571

    Building javadoc 571

    Libraries and the Class Path 571

    Executing Existing Transformations and Jobs 571

    Executing a Transformation 572

    Executing a Job 573

    Embedding Kettle 574

    Pentaho Reporting 574

    Putting Data into a Transformation 576

    Dynamic Transformations 580

    Dynamic Template 583

    Dynamic Jobs 584

    Executing Dynamic ETL in Kettle 586

    Result 587

    Replacing Metadata 588

    Direct Changes with the API 589

    Using a Shared Objects File 589

    OEM Versions and Forks 590

    Creating an OEM Version of PDI 590

    Forking Kettle 591

    Summary 592

    Chapter 23 Extending Kettle 593

    Plugin Architecture Overview 593

    Plugin Types 594

    Architecture 595

    Prerequisites 596

    Kettle API Documentation 596

    Libraries 596

    Integrated Development Environment 596

    Eclipse Project Setup 597

    Examples 598

    Transformation Step Plugins 599

    StepMetaInterface 599

    Value Metadata 605

    Row Metadata 606

    StepDataInterface 607

    StepDialogInterface 607

    Eclipse SWT 607

    Form Layout 607

    Kettle UI Elements 609

    Hello World Example Dialog 609

    StepInterface 614

    Reading Rows from Specific Steps 616

    Writing Rows to Specific Steps 616

    Writing Rows to Error Handling 617

    Identifying a Step Copy 617

    Result Feedback 618

    Variable Substitution 618

    Apache VFS 619

    Step Plugin Deployment 619

    The User-Defined Java Class Step 620

    Passing Metadata 620

    Accessing Input and Fields 620

    Snippets 620

    Example 620

    Job Entry Plugins 621

    JobEntryInterface 622

    JobEntryDialogInterface 624

    Partitioning Method Plugins 624

    Partitioner 625

    Repository Type Plugins 626

    Database Type Plugins 627

    Summary 628

    Appendix A The Kettle Ecosystem 629

    Kettle Development and Versions 629

    The Pentaho Community Wiki 631

    Using the Forums 631

    Jira 632

    ##pentaho 633

    Appendix B Kettle Enterprise Edition Features 635

    Appendix C Built-in Variables and Properties Reference 637

    Internal Variables 637

    Kettle Variables 640

    Variables for Configuring VFS 641

    Noteworthy JRE Variables 642

    Index 643

Pentaho Kettle Solutions

Product form

£30.39

Includes FREE delivery

RRP £37.99 – you save £7.60 (20%)

Order before 4pm today for delivery by Mon 29 Dec 2025.

A Paperback / softback by Matt Casters, Roland Bouman, Jos van Dongen

1 in stock


    View other formats and editions of Pentaho Kettle Solutions by Matt Casters

    Publisher: John Wiley & Sons Inc
    Publication Date: 24/09/2010
    ISBN13: 9780470635179, 978-0470635179
    ISBN10: 0470635177

    Description

    Book Synopsis
    A complete guide to Pentaho Kettle, the Pentaho Data lntegration toolset for ETL

    This practical book is a complete guide to installing, configuring, and managing Pentaho Kettle. If you're a database administrator or developer, you'll first get up to speed on Kettle basics and how to apply Kettle to create ETL solutionsbefore progressing to specialized concepts such as clustering, extensibility, and data vault models. Learn how to design and build every phase of an ETL solution.

    • Shows developers and database administrators how to use the open-source Pentaho Kettle for enterprise-level ETL processes (Extracting, Transforming, and Loading data)
    • Assumes no prior knowledge of Kettle or ETL, and brings beginners thoroughly up to speed at their own pace
    • Explains how to get Kettle solutions up and running, then follows the 34 ETL subsystems model, as created by the Kimball Group, to explore the entire ETL lifecycle, including all aspects of data wareho

      Table of Contents

      Introduction xxxi

      Part I Getting Started 1

      Chapter 1 ETL Primer 3

      OLTP versus Data Warehousing 3

      What Is ETL? 5

      The Evolution of ETL Solutions 5

      ETL Building Blocks 7

      ETL, ELT, and EII 8

      ELT 9

      EII: Virtual Data Integration 10

      Data Integration Challenges 11

      Methodology: Agile BI 12

      ETL Design 14

      Data Acquisition 14

      Beware of Spreadsheets 15

      Design for Failure 15

      Change Data Capture 16

      Data Quality 16

      Data Profiling 16

      Data Validation 17

      ETL Tool Requirements 17

      Connectivity 17

      Platform Independence 18

      Scalability 18

      Design Flexibility 19

      Reuse 19

      Extensibility 19

      Data Transformations 20

      Testing and Debugging 21

      Lineage and Impact Analysis 21

      Logging and Auditing 22

      Summary 22

      Chapter 2 Kettle Concepts 23

      Design Principles 23

      The Building Blocks of Kettle Design 25

      Transformations 25

      Steps 26

      Transformation Hops 26

      Parallelism 27

      Rows of Data 27

      Data Conversion 29

      Jobs 30

      Job Entries 31

      Job Hops 31

      Multiple Paths and Backtracking 32

      Parallel Execution 33

      Job Entry Results 34

      Transformation or Job Metadata 36

      Database Connections 37

      Special Options 38

      The Power of the Relational Database 39

      Connections and Transactions 39

      Database Clustering 40

      Tools and Utilities 41

      Repositories 41

      Virtual File Systems 42

      Parameters and Variables 43

      Defining Variables 43

      Named Parameters 44

      Using Variables 44

      Visual Programming 45

      Getting Started 46

      Creating New Steps 47

      Putting It All Together 49

      Summary 51

      Chapter 3 Installation and Configuration 53

      Kettle Software Overview 53

      Integrated Development Environment: Spoon 55

      Command-Line Launchers: Kitchen and Pan 57

      Job Server: Carte 57

      Encr.bat and encr.sh 58

      Installation 58

      Java Environment 58

      Installing Java Manually 58

      Using Your Linux Package Management System 59

      Installing Kettle 59

      Versions and Releases 59

      Archive Names and Formats 60

      Downloading and Uncompressing 60

      Running Kettle Programs 61

      Creating a Shortcut Icon or Launcher for Spoon 62

      Configuration 63

      Configuration Files and the .kettle Directory 63

      The Kettle Shell Scripts 69

      General Structure of the Startup Scripts 70

      Adding an Entry to the Classpath 70

      Changing the Maximum Heap Size 71

      Managing JDBC Drivers 72

      Summary 72

      Chapter 4 An Example ETL Solution—Sakila 73

      Sakila 73

      The Sakila Sample Database 74

      DVD Rental Business Process 74

      Sakila Database Schema Diagram 75

      Sakila Database Subject Areas 75

      General Design Considerations 77

      Installing the Sakila Sample Database 77

      The Rental Star Schema 78

      Rental Star Schema Diagram 78

      Rental Fact Table 79

      Dimension Tables 79

      Keys and Change Data Capture 80

      Installing the Rental Star Schema 81

      Prerequisites and Some Basic Spoon Skills 81

      Setting Up the ETL Solution 82

      Creating Database Accounts 82

      Working with Spoon 82

      Opening Transformation and Job Files 82

      Opening the Step’s Configuration Dialog 83

      Examining Streams 83

      Running Jobs and Transformations 83

      The Sample ETL Solution 84

      Static, Generated Dimensions 84

      Loading the dim_date Dimension Table 84

      Loading the dim_time Dimension Table 86

      Recurring Load 87

      The load_rentals Job 88

      The load_dim_staff Transformation 91

      Database Connections 91

      The load_dim_customer Transformation 95

      The load_dim_store Transformation 98

      The fetch_address Subtransformation 99

      The load_dim_actor Transformation 101

      The load_dim_film Transformation 102

      The load_fact_rental Transformation 107

      Summary 109

      Part II ETL 111

      Chapter 5 ETL Subsystems 113

      Introduction to the 34 Subsystems 114

      Extraction 114

      Subsystems 1–3: Data Profiling, Change Data Capture, and

      Extraction 115

      Cleaning and Conforming Data 116

      Subsystem 4: Data Cleaning and Quality Screen

      Handler System 116

      Subsystem 5: Error Event Handler 117

      Subsystem 6: Audit Dimension Assembler 117

      Subsystem 7: Deduplication System 117

      Subsystem 8: Data Conformer 118

      Data Delivery 118

      Subsystem 9: Slowly Changing Dimension Processor 118

      Subsystem 10: Surrogate Key Creation System 119

      Subsystem 11: Hierarchy Dimension Builder 119

      Subsystem 12: Special Dimension Builder 120

      Subsystem 13: Fact Table Loader 121

      Subsystem 14: Surrogate Key Pipeline 121

      Subsystem 15: Multi-Valued Dimension Bridge Table Builder 121

      Subsystem 16: Late-Arriving Data Handler 122

      Subsystem 17: Dimension Manager System 122

      Subsystem 18: Fact Table Provider System 122

      Subsystem 19: Aggregate Builder 123

      Subsystem 20: Multidimensional (OLAP) Cube Builder 123

      Subsystem 21: Data Integration Manager 123

      Managing the ETL Environment 123

      Summary 126

      Chapter 6 Data Extraction 127

      Kettle Data Extraction Overview 128

      File-Based Extraction 128

      Working with Text Files 128

      Working with XML files 133

      Special File Types 134

      Database-Based Extraction 134

      Web-Based Extraction 137

      Text-Based Web Extraction 137

      HTTP Client 137

      Using SOAP 138

      Stream-Based and Real-Time Extraction 138

      Working with ERP and CRM Systems 138

      ERP Challenges 139

      Kettle ERP Plugins 140

      Working with SAP Data 140

      ERP and CDC Issues 146

      Data Profiling 146

      Using eobjects.org DataCleaner 147

      Adding Profile Tasks 149

      Adding Database Connections 149

      Doing an Initial Profile 151

      Working with Regular Expressions 151

      Profiling and Exploring Results 152

      Validating and Comparing Data 153

      Using a Dictionary for Column Dependency Checks 153

      Alternative Solutions 154

      Text Profiling with Kettle 154

      CDC: Change Data Capture 154

      Source Data–Based CDC 155

      Trigger-Based CDC 157

      Snapshot-Based CDC 158

      Log-Based CDC 162

      Which CDC Alternative Should You Choose? 163

      Delivering Data 164

      Summary 164

      Chapter 7 Cleansing and Conforming 167

      Data Cleansing 168

      Data-Cleansing Steps 169

      Using Reference Tables 172

      Conforming Data Using Lookup Tables 172

      Conforming Data Using Reference Tables 175

      Data Validation 179

      Applying Validation Rules 180

      Validating Dependency Constraints 183

      Error Handling 183

      Handling Process Errors 184

      Transformation Errors 186

      Handling Data (Validation) Errors 187

      Auditing Data and Process Quality 191

      Deduplicating Data 192

      Handling Exact Duplicates 193

      The Problem of Non-Exact Duplicates 194

      Building Deduplication Transforms 195

      Step 1: Fuzzy Match 197

      Step 2: Select Suspects 198

      Step 3: Lookup Validation Value 198

      Step 4: Filter Duplicates 199

      Scripting 200

      Formula 201

      JavaScript 202

      User-Defined Java Expressions 202

      Regular Expressions 203

      Summary 205

      Chapter 8 Handling Dimension Tables 207

      Managing Keys 208

      Managing Business Keys 209

      Keys in the Source System 209

      Keys in the Data Warehouse 209

      Business Keys 209

      Storing Business Keys 210

      Looking Up Keys with Kettle 210

      Generating Surrogate Keys 210

      The “Add sequence” Step 211

      Working with auto_increment or IDENTITY Columns 217

      Keys for Slowly Changing Dimensions 217

      Loading Dimension Tables 218

      Snowflaked Dimension Tables 218

      Top-Down Level-Wise Loading 219

      Sakila Snowflake Example 219

      Sample Transformation 221

      Database Lookup Configuration 222

      Sample Job 225

      Star Schema Dimension Tables 226

      Denormalization 226

      Denormalizing to 1NF with the “Database lookup” Step 226

      Change Data Capture 227

      Slowly Changing Dimensions 228

      Types of Slowly Changing Dimensions 228

      Type 1 Slowly Changing Dimensions 229

      The Insert / Update Step 229

      Type 2 Slowly Changing Dimensions 232

      The “Dimension lookup / update” Step 232

      Other Types of Slowly Changing Dimensions 237

      Type 3 Slowly Changing Dimensions 237

      Hybrid Slowly Changing Dimensions 238

      More Dimensions 239

      Generated Dimensions 239

      Date and Time Dimensions 239

      Generated Mini-Dimensions 239

      Junk Dimensions 241

      Recursive Hierarchies 242

      Summary 243

      Chapter 9 Loading Fact Tables 245

      Loading in Bulk 246

      STDIN and FIFO 247

      Kettle Bulk Loaders 248

      MySQL Bulk Loading 249

      LucidDB Bulk Loader 249

      Oracle Bulk Loader 249

      PostgreSQL Bulk Loader 250

      Table Output Step 250

      General Bulk Load Considerations 250

      Dimension Lookups 251

      Maintaining Referential Integrity 251

      The Surrogate Key Pipeline 252

      Using In-Memory Lookups 253

      Stream Lookups 253

      Late-Arriving Data 255

      Late-Arriving Facts 256

      Late-Arriving Dimensions 256

      Fact Table Handling 260

      Periodic and Accumulating Snapshots 260

      Introducing State-Oriented Fact Tables 261

      Loading Periodic Snapshots 263

      Loading Accumulating Snapshots 264

      Loading State-Oriented Fact Tables 265

      Loading Aggregate Tables 266

      Summary 267

      Chapter 10 Working with OLAP Data 269

      OLAP Benefits and Challenges 270

      OLAP Storage Types 272

      Positioning OLAP 272

      Kettle OLAP Options 273

      Working with Mondrian 274

      Working with XML/A Servers 277

      Working with Palo 282

      Setting Up the Palo Connection 283

      Palo Architecture 284

      Reading Palo Data 285

      Writing Palo Data 289

      Summary 291

      Part III Management and Deployment 293

      Chapter 11 ETL Development Lifecycle 295

      Solution Design 295

      Best and Bad Practices 296

      Data Mapping 297

      Naming and Commentary Conventions 298

      Common Pitfalls 299

      ETL Flow Design 300

      Reusability and Maintainability 300

      Agile Development 301

      Testing and Debugging 306

      Test Activities 307

      ETL Testing 308

      Test Data Requirements 308

      Testing for Completeness 309

      Testing Data Transformations 311

      Test Automation and Continuous Integration 311

      Upgrade Tests 312

      Debugging 312

      Documenting the Solution 315

      Why Isn’t There Any Documentation? 316

      Myth 1: My Software Is Self-Explanatory 316

      Myth 2: Documentation Is Always Outdated 316

      Myth 3: Who Reads Documentation Anyway? 317

      Kettle Documentation Features 317

      Generating Documentation 319

      Summary 320

      Chapter 12 Scheduling and Monitoring 321

      Scheduling 321

      Operating System–Level Scheduling 322

      Executing Kettle Jobs and Transformations from

      the Command Line 322

      UNIX-Based Systems: cron 326

      Windows: The at utility and the Task Scheduler 327

      Using Pentaho’s Built-in Scheduler 327

      Creating an Action Sequence to Run Kettle Jobs and

      Transformations 328

      Kettle Transformations in Action Sequences 329

      Creating and Maintaining Schedules with the

      Administration Console 330

      Attaching an Action Sequence to a Schedule 333

      Monitoring 333

      Logging 333

      Inspecting the Log 333

      Logging Levels 335

      Writing Custom Messages to the Log 336

      E‑mail Notifications 336

      Configuring the Mail Job Entry 337

      Summary 340

      Chapter 13 Versioning and Migration 341

      Version Control Systems 341

      File-Based Version Control Systems 342

      Organization 342

      Leading File-Based VCSs 343

      Content Management Systems 344

      Kettle Metadata 344

      Kettle XML Metadata 345

      Transformation XML 345

      Job XML 346

      Global Replace 347

      Kettle Repository Metadata 348

      The Kettle Database Repository Type 348

      The Kettle File Repository Type 349

      The Kettle Enterprise Repository Type 350

      Managing Repositories 350

      Exporting and Importing Repositories 350

      Upgrading Your Repository 351

      Version Migration System 352

      Managing XML Files 352

      Managing Repositories 352

      Parameterizing Your Solution 353

      Summary 356

      Chapter 14 Lineage and Auditing 357

      Batch-Level Lineage Extraction 358

      Lineage 359

      Lineage Information 359

      Impact Analysis Information 361

      Logging and Operational Metadata 363

      Logging Basics 363

      Logging Architecture 364

      Setting a Maximum Buffer Size 365

      Setting a Maximum Log Line Age 365

      Log Channels 366

      Log Text Capturing in a Job 366

      Logging Tables 367

      Transformation Logging Tables 367

      Job Logging Tables 373

      Summary 374

      Part IV Performance and Scalability 375

      Chapter 15 Performance Tuning 377

      Transformation Performance: Finding the Weakest Link 377

      Finding Bottlenecks by Simplifying 379

      Finding Bottlenecks by Measuring 380

      Copying Rows of Data 382

      Improving Transformation Performance 384

      Improving Performance in Reading Text Files 384

      Using Lazy Conversion for Reading Text Files 385

      Single-File Parallel Reading 385

      Multi-File Parallel Reading 386

      Configuring the NIO Block Size 386

      Changing Disks and Reading Text Files 386

      Improving Performance in Writing Text Files 387

      Using Lazy Conversion for Writing Text Files 387

      Parallel Files Writing 387

      Changing Disks and Writing Text Files 387

      Improving Database Performance 388

      Avoiding Dynamic SQL 388

      Handling Roundtrips 388

      Handling Relational Databases 390

      Sorting Data 392

      Sorting on the Database 393

      Sorting in Parallel 393

      Reducing CPU Usage 394

      Optimizing the Use of JavaScript 394

      Launching Multiple Copies of a Step 396

      Selecting and Removing Values 397

      Managing Thread Priorities 397

      Adding Static Data to Rows of Data 397

      Limiting the Number of Step Copies 398

      Avoiding Excessive Logging 398

      Improving Job Performance 399

      Loops in Jobs 399

      Database Connection Pools 400

      Summary 401

      Chapter 16 Parallelization, Clustering, and Partitioning 403

      Multi-Threading 403

      Row Distribution 404

      Row Merging 405

      Row Redistribution 406

      Data Pipelining 407

      Consequences of Multi-Threading 408

      Database Connections 408

      Order of Execution 409

      Parallel Execution in a Job 411

      Using Carte as a Slave Server 411

      The Configuration File 411

      Defining Slave Servers 412

      Remote Execution 413

      Monitoring Slave Servers 413

      Carte Security 414

      Services 414

      Clustering Transformations 417

      Defining a Cluster Schema 417

      Designing Clustered Transformations 418

      Execution and Monitoring 420

      Metadata Transformations 421

      Rules 422

      Data Pipelining 425

      Partitioning 425

      Defining a Partitioning Schema 425

      Objectives of Partitioning 427

      Implementing Partitioning 428

      Internal Variables 428

      Database Partitions 429

      Partitioning in a Clustered Transformation 430

      Summary 430

      Chapter 17 Dynamic Clustering in the Cloud 433

      Dynamic Clustering 433

      Setting Up a Dynamic Cluster 434

      Using the Dynamic Cluster 436

      Cloud Computing 437

      EC2 438

      Getting Started with EC2 438

      Costs 438

      Customizing an AMI 439

      Packaging a New AMI 442

      Terminating an AMI 442

      Running a Master 442

      Running the Slaves 443

      Using the EC2 Cluster 444

      Monitoring 445

      The Lightweight Principle and Persistence Options 446

      Summary 447

      Chapter 18 Real-Time Data Integration 449

      Introduction to Real-Time ETL 449

      Real-Time Challenges 450

      Requirements 451

      Transformation Streaming 452

      A Practical Example of Transformation Streaming 454

      Debugging 457

      Third-Party Software and Real-Time Integration 458

      Java Message Service 459

      Creating a JMS Connection and Session 459

      Consuming Messages 460

      Producing Messages 460

      Closing Shop 460

      Summary 461

      Part V Advanced Topics 463

      Chapter 19 Data Vault Management 465

      Introduction to Data Vault Modeling 466

      Do You Need a Data Vault? 466

      Data Vault Building Blocks 467

      Hubs 467

      Links 468

      Satellites 469

      Data Vault Characteristics 471

      Building a Data Vault 471

      Transforming Sakila to the Data Vault Model 472

      Sakila Hubs 472

      Sakila Links 473

      Sakila Satellites 474

      Loading the Data Vault: A Sample ETL Solution 477

      Installing the Sakila Data Vault 477

      Setting Up the ETL Solution 477

      Creating a Database Account 477

      The Sample ETL Data Vault Solution 478

      Sample Hub: hub_actor 478

      Sample Link: link_customer_store 480

      Sample Satellite: sat_actor 483

      Loading the Data Vault Tables 485

      Updating a Data Mart from a Data Vault 486

      The Sample ETL Solution 486

      The dim_actor Transformation 486

      The dim_customer Transformation 488

      The dim_film Transformation 492

      The dim_film_actor_bridge Transformation 492

      The fact_rental Transformation 493

      Loading the Star Schema Tables 495

      Summary 495

      Chapter 20 Handling Complex Data Formats 497

      Non-Relational and Non-Tabular Data Formats 498

      Non-Relational Tabular Formats 498

      Handling Multi-Valued Attributes 498

      Using the Split Field to Rows Step 499

      Handling Repeating Groups 500

      Using the Row Normaliser Step 500

      Semi- and Unstructured Data 501

      Kettle Regular Expression Example 503

      Configuring the Regex Evaluation Step 504

      Verifying the Match 507

      Key/Value Pairs 508

      Kettle Key/Value Pairs Example 509

      Text File Input 509

      Regex Evaluation 510

      Grouping Lines into Records 511

      Denormaliser: Turning Rows into Columns 512

      Summary 513

      Chapter 21 Web Services 515

      Web Pages and Web Services 515

      Kettle Web Features 516

      General HTTP Steps 516

      Simple Object Access Protocol 517

      Really Simple Syndication 517

      Apache Virtual File System Integration 517

      Data Formats 517

      XML 518

      Kettle Steps for Working with XML 518

      Kettle Job Entries for XML 519

      HTML 520

      JavaScript Object Notation 520

      Syntax 521

      JSON, Kettle, and ETL/DI 522

      XML Examples 523

      Example XML Document 523

      XML Document Structure 523

      Mapping to the Sakila Sample Database 524

      Extracting Data from XML 525

      Overall Design: The import_xml_into_db Transformation 526

      Using the XSD Validator Step 528

      Using the “Get Data from XML” Step 530

      Generating XML Documents 537

      Overall Design: The export_xml_from_db Transformation 537

      Generating XML with the Add XML Step 538

      Using the XML Join Step 541

      SOAP Examples 544

      Using the “Web services lookup” Step 544

      Configuring the “Web services lookup” Step 544

      Accessing SOAP Services Directly 546

      JSON Example 549

      The Freebase Project 549

      Freebase Versus Wikipedia 549

      Freebase Web Services 550

      The Freebase Read Service 550

      The Metaweb Query Language 551

      Extracting Freebase Data with Kettle 553

      Generate Rows 554

      Issuing a Freebase Read Request 555

      Processing the Freebase Result Envelope 556

      Filtering Out the Original Row 557

      Storing to File 558

      RSS 558

      RSS Structure 558

      Channel 558

      Item 559

      RSS Support in Kettle 560

      RSS Input 561

      RSS Output 562

      Summary 567

      Chapter 22 Kettle Integration 569

      The Kettle API 569

      The LGPL License 569

      The Kettle Java API 570

      Source Code 570

      Building Kettle 571

      Building javadoc 571

      Libraries and the Class Path 571

      Executing Existing Transformations and Jobs 571

      Executing a Transformation 572

      Executing a Job 573

      Embedding Kettle 574

      Pentaho Reporting 574

      Putting Data into a Transformation 576

      Dynamic Transformations 580

      Dynamic Template 583

      Dynamic Jobs 584

      Executing Dynamic ETL in Kettle 586

      Result 587

      Replacing Metadata 588

      Direct Changes with the API 589

      Using a Shared Objects File 589

      OEM Versions and Forks 590

      Creating an OEM Version of PDI 590

      Forking Kettle 591

      Summary 592

      Chapter 23 Extending Kettle 593

      Plugin Architecture Overview 593

      Plugin Types 594

      Architecture 595

      Prerequisites 596

      Kettle API Documentation 596

      Libraries 596

      Integrated Development Environment 596

      Eclipse Project Setup 597

      Examples 598

      Transformation Step Plugins 599

      StepMetaInterface 599

      Value Metadata 605

      Row Metadata 606

      StepDataInterface 607

      StepDialogInterface 607

      Eclipse SWT 607

      Form Layout 607

      Kettle UI Elements 609

      Hello World Example Dialog 609

      StepInterface 614

      Reading Rows from Specific Steps 616

      Writing Rows to Specific Steps 616

      Writing Rows to Error Handling 617

      Identifying a Step Copy 617

      Result Feedback 618

      Variable Substitution 618

      Apache VFS 619

      Step Plugin Deployment 619

      The User-Defined Java Class Step 620

      Passing Metadata 620

      Accessing Input and Fields 620

      Snippets 620

      Example 620

      Job Entry Plugins 621

      JobEntryInterface 622

      JobEntryDialogInterface 624

      Partitioning Method Plugins 624

      Partitioner 625

      Repository Type Plugins 626

      Database Type Plugins 627

      Summary 628

      Appendix A The Kettle Ecosystem 629

      Kettle Development and Versions 629

      The Pentaho Community Wiki 631

      Using the Forums 631

      Jira 632

      ##pentaho 633

      Appendix B Kettle Enterprise Edition Features 635

      Appendix C Built-in Variables and Properties Reference 637

      Internal Variables 637

      Kettle Variables 640

      Variables for Configuring VFS 641

      Noteworthy JRE Variables 642

      Index 643

    Recently viewed products

    © 2025 Book Curl

      • American Express
      • Apple Pay
      • Diners Club
      • Discover
      • Google Pay
      • Maestro
      • Mastercard
      • PayPal
      • Shop Pay
      • Union Pay
      • Visa

      Login

      Forgot your password?

      Don't have an account yet?
      Create account