Description

Book Synopsis

Learn methods of data analysis and their application to real-world data sets

This updated second edition serves as an introduction to data mining methods and models, including association rules, clustering, neural networks, logistic regression, and multivariate analysis. The authors apply a unified white box approach to data mining methods and models. This approach is designed to walk readers through the operations and nuances of the various methods, using small data sets, so readers can gain an insight into the inner workings of the method under review. Chapters provide readers with hands-on analysis problems, representing an opportunity for readers to apply their newly-acquired data mining expertise to solving real problems using large, real-world data sets.

Data Mining and Predictive Analytics:

  • Offers comprehensive coverage of association rules, clustering, neural networks, logistic regression, multivariate analysis, and R statistical prog

    Table of Contents

    PREFACE xxi

    ACKNOWLEDGMENTS xxix

    PART I DATA PREPARATION 1

    CHAPTER 1 AN INTRODUCTION TO DATA MINING AND PREDICTIVE ANALYTICS 3

    1.1 What is Data Mining? What is Predictive Analytics? 3

    1.2 Wanted: Data Miners 5

    1.3 The Need for Human Direction of Data Mining 6

    1.4 The Cross-Industry Standard Process for Data Mining: CRISP-DM 6

    1.4.1 CRISP-DM: The Six Phases 7

    1.5 Fallacies of Data Mining 9

    1.6 What Tasks Can Data Mining Accomplish 10

    CHAPTER 2 DATA PREPROCESSING 20

    2.1 Why do We Need to Preprocess the Data? 20

    2.2 Data Cleaning 21

    2.3 Handling Missing Data 22

    2.4 Identifying Misclassifications 25

    2.5 Graphical Methods for Identifying Outliers 26

    2.6 Measures of Center and Spread 27

    2.7 Data Transformation 30

    2.8 Min–Max Normalization 30

    2.9 Z-Score Standardization 31

    2.10 Decimal Scaling 32

    2.11 Transformations to Achieve Normality 32

    2.12 Numerical Methods for Identifying Outliers 38

    2.13 Flag Variables 39

    2.14 Transforming Categorical Variables into Numerical Variables 40

    2.15 Binning Numerical Variables 41

    2.16 Reclassifying Categorical Variables 42

    2.17 Adding an Index Field 43

    2.18 Removing Variables that are not Useful 43

    2.19 Variables that Should Probably not be Removed 43

    2.20 Removal of Duplicate Records 44

    2.21 A Word About ID Fields 45

    CHAPTER 3 EXPLORATORY DATA ANALYSIS 54

    3.1 Hypothesis Testing Versus Exploratory Data Analysis 54

    3.2 Getting to Know the Data Set 54

    3.3 Exploring Categorical Variables 56

    3.4 Exploring Numeric Variables 64

    3.5 Exploring Multivariate Relationships 69

    3.6 Selecting Interesting Subsets of the Data for Further Investigation 70

    3.7 Using EDA to Uncover Anomalous Fields 71

    3.8 Binning Based on Predictive Value 72

    3.9 Deriving New Variables: Flag Variables 75

    3.10 Deriving New Variables: Numerical Variables 77

    3.11 Using EDA to Investigate Correlated Predictor Variables 78

    3.12 Summary of Our EDA 81

    CHAPTER 4 DIMENSION-REDUCTION METHODS 92

    4.1 Need for Dimension-Reduction in Data Mining 92

    4.2 Principal Components Analysis 93

    4.3 Applying PCA to the Houses Data Set 96

    4.4 How Many Components Should We Extract? 102

    4.5 Profiling the Principal Components 105

    4.6 Communalities 108

    4.7 Validation of the Principal Components 110

    4.8 Factor Analysis 110

    4.9 Applying Factor Analysis to the Adult Data Set 111

    4.10 Factor Rotation 114

    4.11 User-Defined Composites 117

    4.12 An Example of a User-Defined Composite 118

    PART II STATISTICAL ANALYSIS 129

    CHAPTER 5 UNIVARIATE STATISTICAL ANALYSIS 131

    5.1 Data Mining Tasks in Discovering Knowledge in Data 131

    5.2 Statistical Approaches to Estimation and Prediction 131

    5.3 Statistical Inference 132

    5.4 How Confident are We in Our Estimates? 133

    5.5 Confidence Interval Estimation of the Mean 134

    5.6 How to Reduce the Margin of Error 136

    5.7 Confidence Interval Estimation of the Proportion 137

    5.8 Hypothesis Testing for the Mean 138

    5.9 Assessing the Strength of Evidence Against the Null Hypothesis 140

    5.10 Using Confidence Intervals to Perform Hypothesis Tests 141

    5.11 Hypothesis Testing for the Proportion 143

    CHAPTER 6 MULTIVARIATE STATISTICS 148

    6.1 Two-Sample t-Test for Difference in Means 148

    6.2 Two-Sample Z-Test for Difference in Proportions 149

    6.3 Test for the Homogeneity of Proportions 150

    6.4 Chi-Square Test for Goodness of Fit of Multinomial Data 152

    6.5 Analysis of Variance 153

    CHAPTER 7 PREPARING TO MODEL THE DATA 160

    7.1 Supervised Versus Unsupervised Methods 160

    7.2 Statistical Methodology and Data Mining Methodology 161

    7.3 Cross-Validation 161

    7.4 Overfitting 163

    7.5 Bias–Variance Trade-Off 164

    7.6 Balancing the Training Data Set 166

    7.7 Establishing Baseline Performance 167

    CHAPTER 8 SIMPLE LINEAR REGRESSION 171

    8.1 An Example of Simple Linear Regression 171

    8.2 Dangers of Extrapolation 177

    8.3 How Useful is the Regression? The Coefficient of Determination, r2 178

    8.4 Standard Error of the Estimate, s 183

    8.5 Correlation Coefficient r 184

    8.6 Anova Table for Simple Linear Regression 186

    8.7 Outliers, High Leverage Points, and Influential Observations 186

    8.8 Population Regression Equation 195

    8.9 Verifying the Regression Assumptions 198

    8.10 Inference in Regression 203

    8.11 t-Test for the Relationship Between x and y 204

    8.12 Confidence Interval for the Slope of the Regression Line 206

    8.13 Confidence Interval for the Correlation Coefficient p 208

    8.14 Confidence Interval for the Mean Value of y Given x 210

    8.15 Prediction Interval for a Randomly Chosen Value of y Given x 211

    8.16 Transformations to Achieve Linearity 213

    8.17 Box–Cox Transformations 220

    CHAPTER 9 MULTIPLE REGRESSION AND MODEL BUILDING 236

    9.1 An Example of Multiple Regression 236

    9.2 The Population Multiple Regression Equation 242

    9.3 Inference in Multiple Regression 243

    9.4 Regression with Categorical Predictors, Using Indicator Variables 249

    9.5 Adjusting R2: Penalizing Models for Including Predictors that are not Useful 256

    9.6 Sequential Sums of Squares 257

    9.7 Multicollinearity 258

    9.8 Variable Selection Methods 266

    9.9 Gas Mileage Data Set 270

    9.10 An Application of Variable Selection Methods 271

    9.11 Using the Principal Components as Predictors in Multiple Regression 279

    PART III CLASSIFICATION 299

    CHAPTER 10 k-NEAREST NEIGHBOR ALGORITHM 301

    10.1 Classification Task 301

    10.2 k-Nearest Neighbor Algorithm 302

    10.3 Distance Function 305

    10.4 Combination Function 307

    10.5 Quantifying Attribute Relevance: Stretching the Axes 309

    10.6 Database Considerations 310

    10.7 k-Nearest Neighbor Algorithm for Estimation and Prediction 310

    10.8 Choosing k 311

    10.9 Application of k-Nearest Neighbor Algorithm Using IBM/SPSS Modeler 312

    CHAPTER 11 DECISION TREES 317

    11.1 What is a Decision Tree? 317

    11.2 Requirements for Using Decision Trees 319

    11.3 Classification and Regression Trees 319

    11.4 C4.5 Algorithm 326

    11.5 Decision Rules 332

    11.6 Comparison of the C5.0 and CART Algorithms Applied to Real Data 332

    CHAPTER 12 NEURAL NETWORKS 339

    12.1 Input and Output Encoding 339

    12.2 Neural Networks for Estimation and Prediction 342

    12.3 Simple Example of a Neural Network 342

    12.4 Sigmoid Activation Function 344

    12.5 Back-Propagation 345

    12.6 Gradient-Descent Method 346

    12.7 Back-Propagation Rules 347

    12.8 Example of Back-Propagation 347

    12.9 Termination Criteria 349

    12.10 Learning Rate 350

    12.11 Momentum Term 351

    12.12 Sensitivity Analysis 353

    12.13 Application of Neural Network Modeling 353

    CHAPTER 13 LOGISTIC REGRESSION 359

    13.1 Simple Example of Logistic Regression 359

    13.2 Maximum Likelihood Estimation 361

    13.3 Interpreting Logistic Regression Output 362

    13.4 Inference: are the Predictors Significant? 363

    13.5 Odds Ratio and Relative Risk 365

    13.6 Interpreting Logistic Regression for a Dichotomous Predictor 367

    13.7 Interpreting Logistic Regression for a Polychotomous Predictor 370

    13.8 Interpreting Logistic Regression for a Continuous Predictor 374

    13.9 Assumption of Linearity 378

    13.10 Zero-Cell Problem 382

    13.11 Multiple Logistic Regression 384

    13.12 Introducing Higher Order Terms to Handle Nonlinearity 388

    13.13 Validating the Logistic Regression Model 395

    13.14 WEKA: Hands-On Analysis Using Logistic Regression 399

    CHAPTER 14 NAÏVE BAYES AND BAYESIAN NETWORKS 414

    14.1 Bayesian Approach 414

    14.2 Maximum a Posteriori (Map) Classification 416

    14.3 Posterior Odds Ratio 420

    14.4 Balancing the Data 422

    14.5 Naïve Bayes Classification 423

    14.6 Interpreting the Log Posterior Odds Ratio 426

    14.7 Zero-Cell Problem 428

    14.8 Numeric Predictors for Naïve Bayes Classification 429

    14.9 WEKA: Hands-on Analysis Using Naïve Bayes 432

    14.10 Bayesian Belief Networks 436

    14.11 Clothing Purchase Example 436

    14.12 Using the Bayesian Network to Find Probabilities 439

    CHAPTER 15 MODEL EVALUATION TECHNIQUES 451

    15.1 Model Evaluation Techniques for the Description Task 451

    15.2 Model Evaluation Techniques for the Estimation and Prediction Tasks 452

    15.3 Model Evaluation Measures for the Classification Task 454

    15.4 Accuracy and Overall Error Rate 456

    15.5 Sensitivity and Specificity 457

    15.6 False-Positive Rate and False-Negative Rate 458

    15.7 Proportions of True Positives, True Negatives, False Positives, and False Negatives 458

    15.8 Misclassification Cost Adjustment to Reflect Real-World Concerns 460

    15.9 Decision Cost/Benefit Analysis 462

    15.10 Lift Charts and Gains Charts 463

    15.11 Interweaving Model Evaluation with Model Building 466

    15.12 Confluence of Results: Applying a Suite of Models 466

    CHAPTER 16 COST-BENEFIT ANALYSIS USING DATA-DRIVEN COSTS 471

    16.1 Decision Invariance Under Row Adjustment 471

    16.2 Positive Classification Criterion 473

    16.3 Demonstration of the Positive Classification Criterion 474

    16.4 Constructing the Cost Matrix 474

    16.5 Decision Invariance Under Scaling 476

    16.6 Direct Costs and Opportunity Costs 478

    16.7 Case Study: Cost-Benefit Analysis Using Data-Driven Misclassification Costs 478

    16.8 Rebalancing as a Surrogate for Misclassification Costs 483

    CHAPTER 17 COST-BENEFIT ANALYSIS FOR TRINARY AND k-NARY CLASSIFICATION MODELS 491

    17.1 Classification Evaluation Measures for a Generic Trinary Target 491

    17.2 Application of Evaluation Measures for Trinary Classification to the Loan Approval Problem 494

    17.3 Data-Driven Cost-Benefit Analysis for Trinary Loan Classification Problem 498

    17.4 Comparing Cart Models with and without Data-Driven Misclassification Costs 500

    17.5 Classification Evaluation Measures for a Generic k-Nary Target 503

    17.6 Example of Evaluation Measures and Data-Driven Misclassification Costs for k-Nary Classification 504

    CHAPTER 18 GRAPHICAL EVALUATION OF CLASSIFICATION MODELS 510

    18.1 Review of Lift Charts and Gains Charts 510

    18.2 Lift Charts and Gains Charts Using Misclassification Costs 510

    18.3 Response Charts 511

    18.4 Profits Charts 512

    18.5 Return on Investment (ROI) Charts 514

    PART IV CLUSTERING 521

    CHAPTER 19 HIERARCHICAL AND k-MEANS CLUSTERING 523

    19.1 The Clustering Task 523

    19.2 Hierarchical Clustering Methods 525

    19.3 Single-Linkage Clustering 526

    19.4 Complete-Linkage Clustering 527

    19.5 k-Means Clustering 529

    19.6 Example of k-Means Clustering at Work 530

    19.7 Behavior of MSB, MSE, and Pseudo-F as the k-Means Algorithm Proceeds 533

    19.8 Application of k-Means Clustering Using SAS Enterprise Miner 534

    19.9 Using Cluster Membership to Predict Churn 537

    CHAPTER 20 KOHONEN NETWORKS 542

    20.1 Self-Organizing Maps 542

    20.2 Kohonen Networks 544

    20.3 Example of a Kohonen Network Study 545

    20.4 Cluster Validity 549

    20.5 Application of Clustering Using Kohonen Networks 549

    20.6 Interpreting The Clusters 551

    20.7 Using Cluster Membership as Input to Downstream Data Mining Models 556

    CHAPTER 21 BIRCH CLUSTERING 560

    21.1 Rationale for Birch Clustering 560

    21.2 Cluster Features 561

    21.3 Cluster Feature Tree 562

    21.4 Phase 1: Building the CF Tree 562

    21.5 Phase 2: Clustering the Sub-Clusters 564

    21.6 Example of Birch Clustering, Phase 1: Building the CF Tree 565

    21.7 Example of Birch Clustering, Phase 2: Clustering the Sub-Clusters 570

    21.8 Evaluating the Candidate Cluster Solutions 571

    21.9 Case Study: Applying Birch Clustering to the Bank Loans Data Set 571

    CHAPTER 22 MEASURING CLUSTER GOODNESS 582

    22.1 Rationale for Measuring Cluster Goodness 582

    22.2 The Silhouette Method 583

    22.3 Silhouette Example 584

    22.4 Silhouette Analysis of the IRIS Data Set 585

    22.5 The Pseudo-F Statistic 590

    22.6 Example of the Pseudo-F Statistic 591

    22.7 Pseudo-F Statistic Applied to the IRIS Data Set 592

    22.8 Cluster Validation 593

    22.9 Cluster Validation Applied to the Loans Data Set 594

    PART V ASSOCIATION RULES 601

    CHAPTER 23 ASSOCIATION RULES 603

    23.1 Affinity Analysis and Market Basket Analysis 603

    23.2 Support, Confidence, Frequent Itemsets, and the a Priori Property 605

    23.3 How Does the A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets 607

    23.4 How Does the A Priori Algorithm Work (Part 2)? Generating Association Rules 608

    23.5 Extension from Flag Data to General Categorical Data 611

    23.6 Information-Theoretic Approach: Generalized Rule Induction Method 612

    23.7 Association Rules are Easy to do Badly 614

    23.8 How can we Measure the Usefulness of Association Rules? 615

    23.9 Do Association Rules Represent Supervised or Unsupervised Learning? 616

    23.10 Local Patterns Versus Global Models 617

    PART VI ENHANCING MODEL PERFORMANCE 623

    CHAPTER 24 SEGMENTATION MODELS 625

    24.1 The Segmentation Modeling Process 625

    24.2 Segmentation Modeling Using EDA to Identify the Segments 627

    24.3 Segmentation Modeling using Clustering to Identify the Segments 629

    CHAPTER 25 ENSEMBLE METHODS: BAGGING AND BOOSTING 637

    25.1 Rationale for Using an Ensemble of Classification Models 637

    25.2 Bias, Variance, and Noise 639

    25.3 When to Apply, and not to apply, Bagging 640

    25.4 Bagging 641

    25.5 Boosting 643

    25.6 Application of Bagging and Boosting Using IBM/SPSS Modeler 647

    CHAPTER 26 MODEL VOTING AND PROPENSITY AVERAGING 653

    26.1 Simple Model Voting 653

    26.2 Alternative Voting Methods 654

    26.3 Model Voting Process 655

    26.4 An Application of Model Voting 656

    26.5 What is Propensity Averaging? 660

    26.6 Propensity Averaging Process 661

    26.7 An Application of Propensity Averaging 661

    PART VII FURTHER TOPICS 669

    CHAPTER 27 GENETIC ALGORITHMS 671

    27.1 Introduction To Genetic Algorithms 671

    27.2 Basic Framework of a Genetic Algorithm 672

    27.3 Simple Example of a Genetic Algorithm at Work 673

    27.4 Modifications and Enhancements: Selection 676

    27.5 Modifications and Enhancements: Crossover 678

    27.6 Genetic Algorithms for Real-Valued Variables 679

    27.7 Using Genetic Algorithms to Train a Neural Network 681

    27.8 WEKA: Hands-On Analysis Using Genetic Algorithms 684

    CHAPTER 28 IMPUTATION OF MISSING DATA 695

    28.1 Need for Imputation of Missing Data 695

    28.2 Imputation of Missing Data: Continuous Variables 696

    28.3 Standard Error of the Imputation 699

    28.4 Imputation of Missing Data: Categorical Variables 700

    28.5 Handling Patterns in Missingness 701

    PART VIII CASE STUDY: PREDICTING RESPONSE TO DIRECT-MAIL MARKETING 705

    CHAPTER 29 CASE STUDY, PART 1: BUSINESS UNDERSTANDING, DATA PREPARATION, AND EDA 707

    29.1 Cross-Industry Standard Practice for Data Mining 707

    29.2 Business Understanding Phase 709

    29.3 Data Understanding Phase, Part 1: Getting a Feel for the Data Set 710

    29.4 Data Preparation Phase 714

    29.5 Data Understanding Phase, Part 2: Exploratory Data Analysis 721

    CHAPTER 30 CASE STUDY, PART 2: CLUSTERING AND PRINCIPAL COMPONENTS ANALYSIS 732

    30.1 Partitioning the Data 732

    30.2 Developing the Principal Components 733

    30.3 Validating the Principal Components 737

    30.4 Profiling the Principal Components 737

    30.5 Choosing the Optimal Number of Clusters Using Birch Clustering 742

    30.6 Choosing the Optimal Number of Clusters Using k-Means Clustering 744

    30.7 Application of k-Means Clustering 745

    30.8 Validating the Clusters 745

    30.9 Profiling the Clusters 745

    CHAPTER 31 CASE STUDY, PART 3: MODELING AND EVALUATION FOR PERFORMANCE AND INTERPRETABILITY 749

    31.1 Do you Prefer the Best Model Performance, or a Combination of Performance and Interpretability? 749

    31.2 Modeling and Evaluation Overview 750

    31.3 Cost-Benefit Analysis Using Data-Driven Costs 751

    31.4 Variables to be Input to the Models 753

    31.5 Establishing the Baseline Model Performance 754

    31.6 Models that use Misclassification Costs 755

    31.7 Models that Need Rebalancing as a Surrogate for Misclassification Costs 756

    31.8 Combining Models Using Voting and Propensity Averaging 757

    31.9 Interpreting the Most Profitable Model 758

    CHAPTER 32 CASE STUDY, PART 4: MODELING AND EVALUATION FOR HIGH PERFORMANCE ONLY 762

    32.1 Variables to be Input to the Models 762

    32.2 Models that use Misclassification Costs 762

    32.3 Models that Need Rebalancing as a Surrogate for Misclassification Costs 764

    32.4 Combining Models using Voting and Propensity Averaging 765

    32.5 Lessons Learned 766

    32.6 Conclusions 766

    APPENDIX A DATA SUMMARIZATION AND VISUALIZATION 768

    Part 1: Summarization 1: Building Blocks of Data Analysis 768

    Part 2: Visualization: Graphs and Tables for Summarizing and Organizing Data 770

    Part 3: Summarization 2: Measures of Center, Variability, and Position 774

    Part 4: Summarization and Visualization of Bivariate Relationships 777

    INDEX 781

Data Mining and Predictive Analytics

    Product form

    £107.06

    Includes FREE delivery

    RRP £118.95 – you save £11.89 (9%)

    Order before 4pm tomorrow for delivery by Sat 4 Jul 2026.

    A Hardback by Daniel T. Larose

      Trusted by thousands of customers. See 2,385+ Customer Reviews

      View other formats and editions of Data Mining and Predictive Analytics by Daniel T. Larose

      Publisher: John Wiley & Sons Inc
      Publication Date: 24/04/2015
      ISBN13: 9781118116197, 978-1118116197
      ISBN10: 1118116194

      Description

      Book Synopsis

      Learn methods of data analysis and their application to real-world data sets

      This updated second edition serves as an introduction to data mining methods and models, including association rules, clustering, neural networks, logistic regression, and multivariate analysis. The authors apply a unified white box approach to data mining methods and models. This approach is designed to walk readers through the operations and nuances of the various methods, using small data sets, so readers can gain an insight into the inner workings of the method under review. Chapters provide readers with hands-on analysis problems, representing an opportunity for readers to apply their newly-acquired data mining expertise to solving real problems using large, real-world data sets.

      Data Mining and Predictive Analytics:

      • Offers comprehensive coverage of association rules, clustering, neural networks, logistic regression, multivariate analysis, and R statistical prog

        Table of Contents

        PREFACE xxi

        ACKNOWLEDGMENTS xxix

        PART I DATA PREPARATION 1

        CHAPTER 1 AN INTRODUCTION TO DATA MINING AND PREDICTIVE ANALYTICS 3

        1.1 What is Data Mining? What is Predictive Analytics? 3

        1.2 Wanted: Data Miners 5

        1.3 The Need for Human Direction of Data Mining 6

        1.4 The Cross-Industry Standard Process for Data Mining: CRISP-DM 6

        1.4.1 CRISP-DM: The Six Phases 7

        1.5 Fallacies of Data Mining 9

        1.6 What Tasks Can Data Mining Accomplish 10

        CHAPTER 2 DATA PREPROCESSING 20

        2.1 Why do We Need to Preprocess the Data? 20

        2.2 Data Cleaning 21

        2.3 Handling Missing Data 22

        2.4 Identifying Misclassifications 25

        2.5 Graphical Methods for Identifying Outliers 26

        2.6 Measures of Center and Spread 27

        2.7 Data Transformation 30

        2.8 Min–Max Normalization 30

        2.9 Z-Score Standardization 31

        2.10 Decimal Scaling 32

        2.11 Transformations to Achieve Normality 32

        2.12 Numerical Methods for Identifying Outliers 38

        2.13 Flag Variables 39

        2.14 Transforming Categorical Variables into Numerical Variables 40

        2.15 Binning Numerical Variables 41

        2.16 Reclassifying Categorical Variables 42

        2.17 Adding an Index Field 43

        2.18 Removing Variables that are not Useful 43

        2.19 Variables that Should Probably not be Removed 43

        2.20 Removal of Duplicate Records 44

        2.21 A Word About ID Fields 45

        CHAPTER 3 EXPLORATORY DATA ANALYSIS 54

        3.1 Hypothesis Testing Versus Exploratory Data Analysis 54

        3.2 Getting to Know the Data Set 54

        3.3 Exploring Categorical Variables 56

        3.4 Exploring Numeric Variables 64

        3.5 Exploring Multivariate Relationships 69

        3.6 Selecting Interesting Subsets of the Data for Further Investigation 70

        3.7 Using EDA to Uncover Anomalous Fields 71

        3.8 Binning Based on Predictive Value 72

        3.9 Deriving New Variables: Flag Variables 75

        3.10 Deriving New Variables: Numerical Variables 77

        3.11 Using EDA to Investigate Correlated Predictor Variables 78

        3.12 Summary of Our EDA 81

        CHAPTER 4 DIMENSION-REDUCTION METHODS 92

        4.1 Need for Dimension-Reduction in Data Mining 92

        4.2 Principal Components Analysis 93

        4.3 Applying PCA to the Houses Data Set 96

        4.4 How Many Components Should We Extract? 102

        4.5 Profiling the Principal Components 105

        4.6 Communalities 108

        4.7 Validation of the Principal Components 110

        4.8 Factor Analysis 110

        4.9 Applying Factor Analysis to the Adult Data Set 111

        4.10 Factor Rotation 114

        4.11 User-Defined Composites 117

        4.12 An Example of a User-Defined Composite 118

        PART II STATISTICAL ANALYSIS 129

        CHAPTER 5 UNIVARIATE STATISTICAL ANALYSIS 131

        5.1 Data Mining Tasks in Discovering Knowledge in Data 131

        5.2 Statistical Approaches to Estimation and Prediction 131

        5.3 Statistical Inference 132

        5.4 How Confident are We in Our Estimates? 133

        5.5 Confidence Interval Estimation of the Mean 134

        5.6 How to Reduce the Margin of Error 136

        5.7 Confidence Interval Estimation of the Proportion 137

        5.8 Hypothesis Testing for the Mean 138

        5.9 Assessing the Strength of Evidence Against the Null Hypothesis 140

        5.10 Using Confidence Intervals to Perform Hypothesis Tests 141

        5.11 Hypothesis Testing for the Proportion 143

        CHAPTER 6 MULTIVARIATE STATISTICS 148

        6.1 Two-Sample t-Test for Difference in Means 148

        6.2 Two-Sample Z-Test for Difference in Proportions 149

        6.3 Test for the Homogeneity of Proportions 150

        6.4 Chi-Square Test for Goodness of Fit of Multinomial Data 152

        6.5 Analysis of Variance 153

        CHAPTER 7 PREPARING TO MODEL THE DATA 160

        7.1 Supervised Versus Unsupervised Methods 160

        7.2 Statistical Methodology and Data Mining Methodology 161

        7.3 Cross-Validation 161

        7.4 Overfitting 163

        7.5 Bias–Variance Trade-Off 164

        7.6 Balancing the Training Data Set 166

        7.7 Establishing Baseline Performance 167

        CHAPTER 8 SIMPLE LINEAR REGRESSION 171

        8.1 An Example of Simple Linear Regression 171

        8.2 Dangers of Extrapolation 177

        8.3 How Useful is the Regression? The Coefficient of Determination, r2 178

        8.4 Standard Error of the Estimate, s 183

        8.5 Correlation Coefficient r 184

        8.6 Anova Table for Simple Linear Regression 186

        8.7 Outliers, High Leverage Points, and Influential Observations 186

        8.8 Population Regression Equation 195

        8.9 Verifying the Regression Assumptions 198

        8.10 Inference in Regression 203

        8.11 t-Test for the Relationship Between x and y 204

        8.12 Confidence Interval for the Slope of the Regression Line 206

        8.13 Confidence Interval for the Correlation Coefficient p 208

        8.14 Confidence Interval for the Mean Value of y Given x 210

        8.15 Prediction Interval for a Randomly Chosen Value of y Given x 211

        8.16 Transformations to Achieve Linearity 213

        8.17 Box–Cox Transformations 220

        CHAPTER 9 MULTIPLE REGRESSION AND MODEL BUILDING 236

        9.1 An Example of Multiple Regression 236

        9.2 The Population Multiple Regression Equation 242

        9.3 Inference in Multiple Regression 243

        9.4 Regression with Categorical Predictors, Using Indicator Variables 249

        9.5 Adjusting R2: Penalizing Models for Including Predictors that are not Useful 256

        9.6 Sequential Sums of Squares 257

        9.7 Multicollinearity 258

        9.8 Variable Selection Methods 266

        9.9 Gas Mileage Data Set 270

        9.10 An Application of Variable Selection Methods 271

        9.11 Using the Principal Components as Predictors in Multiple Regression 279

        PART III CLASSIFICATION 299

        CHAPTER 10 k-NEAREST NEIGHBOR ALGORITHM 301

        10.1 Classification Task 301

        10.2 k-Nearest Neighbor Algorithm 302

        10.3 Distance Function 305

        10.4 Combination Function 307

        10.5 Quantifying Attribute Relevance: Stretching the Axes 309

        10.6 Database Considerations 310

        10.7 k-Nearest Neighbor Algorithm for Estimation and Prediction 310

        10.8 Choosing k 311

        10.9 Application of k-Nearest Neighbor Algorithm Using IBM/SPSS Modeler 312

        CHAPTER 11 DECISION TREES 317

        11.1 What is a Decision Tree? 317

        11.2 Requirements for Using Decision Trees 319

        11.3 Classification and Regression Trees 319

        11.4 C4.5 Algorithm 326

        11.5 Decision Rules 332

        11.6 Comparison of the C5.0 and CART Algorithms Applied to Real Data 332

        CHAPTER 12 NEURAL NETWORKS 339

        12.1 Input and Output Encoding 339

        12.2 Neural Networks for Estimation and Prediction 342

        12.3 Simple Example of a Neural Network 342

        12.4 Sigmoid Activation Function 344

        12.5 Back-Propagation 345

        12.6 Gradient-Descent Method 346

        12.7 Back-Propagation Rules 347

        12.8 Example of Back-Propagation 347

        12.9 Termination Criteria 349

        12.10 Learning Rate 350

        12.11 Momentum Term 351

        12.12 Sensitivity Analysis 353

        12.13 Application of Neural Network Modeling 353

        CHAPTER 13 LOGISTIC REGRESSION 359

        13.1 Simple Example of Logistic Regression 359

        13.2 Maximum Likelihood Estimation 361

        13.3 Interpreting Logistic Regression Output 362

        13.4 Inference: are the Predictors Significant? 363

        13.5 Odds Ratio and Relative Risk 365

        13.6 Interpreting Logistic Regression for a Dichotomous Predictor 367

        13.7 Interpreting Logistic Regression for a Polychotomous Predictor 370

        13.8 Interpreting Logistic Regression for a Continuous Predictor 374

        13.9 Assumption of Linearity 378

        13.10 Zero-Cell Problem 382

        13.11 Multiple Logistic Regression 384

        13.12 Introducing Higher Order Terms to Handle Nonlinearity 388

        13.13 Validating the Logistic Regression Model 395

        13.14 WEKA: Hands-On Analysis Using Logistic Regression 399

        CHAPTER 14 NAÏVE BAYES AND BAYESIAN NETWORKS 414

        14.1 Bayesian Approach 414

        14.2 Maximum a Posteriori (Map) Classification 416

        14.3 Posterior Odds Ratio 420

        14.4 Balancing the Data 422

        14.5 Naïve Bayes Classification 423

        14.6 Interpreting the Log Posterior Odds Ratio 426

        14.7 Zero-Cell Problem 428

        14.8 Numeric Predictors for Naïve Bayes Classification 429

        14.9 WEKA: Hands-on Analysis Using Naïve Bayes 432

        14.10 Bayesian Belief Networks 436

        14.11 Clothing Purchase Example 436

        14.12 Using the Bayesian Network to Find Probabilities 439

        CHAPTER 15 MODEL EVALUATION TECHNIQUES 451

        15.1 Model Evaluation Techniques for the Description Task 451

        15.2 Model Evaluation Techniques for the Estimation and Prediction Tasks 452

        15.3 Model Evaluation Measures for the Classification Task 454

        15.4 Accuracy and Overall Error Rate 456

        15.5 Sensitivity and Specificity 457

        15.6 False-Positive Rate and False-Negative Rate 458

        15.7 Proportions of True Positives, True Negatives, False Positives, and False Negatives 458

        15.8 Misclassification Cost Adjustment to Reflect Real-World Concerns 460

        15.9 Decision Cost/Benefit Analysis 462

        15.10 Lift Charts and Gains Charts 463

        15.11 Interweaving Model Evaluation with Model Building 466

        15.12 Confluence of Results: Applying a Suite of Models 466

        CHAPTER 16 COST-BENEFIT ANALYSIS USING DATA-DRIVEN COSTS 471

        16.1 Decision Invariance Under Row Adjustment 471

        16.2 Positive Classification Criterion 473

        16.3 Demonstration of the Positive Classification Criterion 474

        16.4 Constructing the Cost Matrix 474

        16.5 Decision Invariance Under Scaling 476

        16.6 Direct Costs and Opportunity Costs 478

        16.7 Case Study: Cost-Benefit Analysis Using Data-Driven Misclassification Costs 478

        16.8 Rebalancing as a Surrogate for Misclassification Costs 483

        CHAPTER 17 COST-BENEFIT ANALYSIS FOR TRINARY AND k-NARY CLASSIFICATION MODELS 491

        17.1 Classification Evaluation Measures for a Generic Trinary Target 491

        17.2 Application of Evaluation Measures for Trinary Classification to the Loan Approval Problem 494

        17.3 Data-Driven Cost-Benefit Analysis for Trinary Loan Classification Problem 498

        17.4 Comparing Cart Models with and without Data-Driven Misclassification Costs 500

        17.5 Classification Evaluation Measures for a Generic k-Nary Target 503

        17.6 Example of Evaluation Measures and Data-Driven Misclassification Costs for k-Nary Classification 504

        CHAPTER 18 GRAPHICAL EVALUATION OF CLASSIFICATION MODELS 510

        18.1 Review of Lift Charts and Gains Charts 510

        18.2 Lift Charts and Gains Charts Using Misclassification Costs 510

        18.3 Response Charts 511

        18.4 Profits Charts 512

        18.5 Return on Investment (ROI) Charts 514

        PART IV CLUSTERING 521

        CHAPTER 19 HIERARCHICAL AND k-MEANS CLUSTERING 523

        19.1 The Clustering Task 523

        19.2 Hierarchical Clustering Methods 525

        19.3 Single-Linkage Clustering 526

        19.4 Complete-Linkage Clustering 527

        19.5 k-Means Clustering 529

        19.6 Example of k-Means Clustering at Work 530

        19.7 Behavior of MSB, MSE, and Pseudo-F as the k-Means Algorithm Proceeds 533

        19.8 Application of k-Means Clustering Using SAS Enterprise Miner 534

        19.9 Using Cluster Membership to Predict Churn 537

        CHAPTER 20 KOHONEN NETWORKS 542

        20.1 Self-Organizing Maps 542

        20.2 Kohonen Networks 544

        20.3 Example of a Kohonen Network Study 545

        20.4 Cluster Validity 549

        20.5 Application of Clustering Using Kohonen Networks 549

        20.6 Interpreting The Clusters 551

        20.7 Using Cluster Membership as Input to Downstream Data Mining Models 556

        CHAPTER 21 BIRCH CLUSTERING 560

        21.1 Rationale for Birch Clustering 560

        21.2 Cluster Features 561

        21.3 Cluster Feature Tree 562

        21.4 Phase 1: Building the CF Tree 562

        21.5 Phase 2: Clustering the Sub-Clusters 564

        21.6 Example of Birch Clustering, Phase 1: Building the CF Tree 565

        21.7 Example of Birch Clustering, Phase 2: Clustering the Sub-Clusters 570

        21.8 Evaluating the Candidate Cluster Solutions 571

        21.9 Case Study: Applying Birch Clustering to the Bank Loans Data Set 571

        CHAPTER 22 MEASURING CLUSTER GOODNESS 582

        22.1 Rationale for Measuring Cluster Goodness 582

        22.2 The Silhouette Method 583

        22.3 Silhouette Example 584

        22.4 Silhouette Analysis of the IRIS Data Set 585

        22.5 The Pseudo-F Statistic 590

        22.6 Example of the Pseudo-F Statistic 591

        22.7 Pseudo-F Statistic Applied to the IRIS Data Set 592

        22.8 Cluster Validation 593

        22.9 Cluster Validation Applied to the Loans Data Set 594

        PART V ASSOCIATION RULES 601

        CHAPTER 23 ASSOCIATION RULES 603

        23.1 Affinity Analysis and Market Basket Analysis 603

        23.2 Support, Confidence, Frequent Itemsets, and the a Priori Property 605

        23.3 How Does the A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets 607

        23.4 How Does the A Priori Algorithm Work (Part 2)? Generating Association Rules 608

        23.5 Extension from Flag Data to General Categorical Data 611

        23.6 Information-Theoretic Approach: Generalized Rule Induction Method 612

        23.7 Association Rules are Easy to do Badly 614

        23.8 How can we Measure the Usefulness of Association Rules? 615

        23.9 Do Association Rules Represent Supervised or Unsupervised Learning? 616

        23.10 Local Patterns Versus Global Models 617

        PART VI ENHANCING MODEL PERFORMANCE 623

        CHAPTER 24 SEGMENTATION MODELS 625

        24.1 The Segmentation Modeling Process 625

        24.2 Segmentation Modeling Using EDA to Identify the Segments 627

        24.3 Segmentation Modeling using Clustering to Identify the Segments 629

        CHAPTER 25 ENSEMBLE METHODS: BAGGING AND BOOSTING 637

        25.1 Rationale for Using an Ensemble of Classification Models 637

        25.2 Bias, Variance, and Noise 639

        25.3 When to Apply, and not to apply, Bagging 640

        25.4 Bagging 641

        25.5 Boosting 643

        25.6 Application of Bagging and Boosting Using IBM/SPSS Modeler 647

        CHAPTER 26 MODEL VOTING AND PROPENSITY AVERAGING 653

        26.1 Simple Model Voting 653

        26.2 Alternative Voting Methods 654

        26.3 Model Voting Process 655

        26.4 An Application of Model Voting 656

        26.5 What is Propensity Averaging? 660

        26.6 Propensity Averaging Process 661

        26.7 An Application of Propensity Averaging 661

        PART VII FURTHER TOPICS 669

        CHAPTER 27 GENETIC ALGORITHMS 671

        27.1 Introduction To Genetic Algorithms 671

        27.2 Basic Framework of a Genetic Algorithm 672

        27.3 Simple Example of a Genetic Algorithm at Work 673

        27.4 Modifications and Enhancements: Selection 676

        27.5 Modifications and Enhancements: Crossover 678

        27.6 Genetic Algorithms for Real-Valued Variables 679

        27.7 Using Genetic Algorithms to Train a Neural Network 681

        27.8 WEKA: Hands-On Analysis Using Genetic Algorithms 684

        CHAPTER 28 IMPUTATION OF MISSING DATA 695

        28.1 Need for Imputation of Missing Data 695

        28.2 Imputation of Missing Data: Continuous Variables 696

        28.3 Standard Error of the Imputation 699

        28.4 Imputation of Missing Data: Categorical Variables 700

        28.5 Handling Patterns in Missingness 701

        PART VIII CASE STUDY: PREDICTING RESPONSE TO DIRECT-MAIL MARKETING 705

        CHAPTER 29 CASE STUDY, PART 1: BUSINESS UNDERSTANDING, DATA PREPARATION, AND EDA 707

        29.1 Cross-Industry Standard Practice for Data Mining 707

        29.2 Business Understanding Phase 709

        29.3 Data Understanding Phase, Part 1: Getting a Feel for the Data Set 710

        29.4 Data Preparation Phase 714

        29.5 Data Understanding Phase, Part 2: Exploratory Data Analysis 721

        CHAPTER 30 CASE STUDY, PART 2: CLUSTERING AND PRINCIPAL COMPONENTS ANALYSIS 732

        30.1 Partitioning the Data 732

        30.2 Developing the Principal Components 733

        30.3 Validating the Principal Components 737

        30.4 Profiling the Principal Components 737

        30.5 Choosing the Optimal Number of Clusters Using Birch Clustering 742

        30.6 Choosing the Optimal Number of Clusters Using k-Means Clustering 744

        30.7 Application of k-Means Clustering 745

        30.8 Validating the Clusters 745

        30.9 Profiling the Clusters 745

        CHAPTER 31 CASE STUDY, PART 3: MODELING AND EVALUATION FOR PERFORMANCE AND INTERPRETABILITY 749

        31.1 Do you Prefer the Best Model Performance, or a Combination of Performance and Interpretability? 749

        31.2 Modeling and Evaluation Overview 750

        31.3 Cost-Benefit Analysis Using Data-Driven Costs 751

        31.4 Variables to be Input to the Models 753

        31.5 Establishing the Baseline Model Performance 754

        31.6 Models that use Misclassification Costs 755

        31.7 Models that Need Rebalancing as a Surrogate for Misclassification Costs 756

        31.8 Combining Models Using Voting and Propensity Averaging 757

        31.9 Interpreting the Most Profitable Model 758

        CHAPTER 32 CASE STUDY, PART 4: MODELING AND EVALUATION FOR HIGH PERFORMANCE ONLY 762

        32.1 Variables to be Input to the Models 762

        32.2 Models that use Misclassification Costs 762

        32.3 Models that Need Rebalancing as a Surrogate for Misclassification Costs 764

        32.4 Combining Models using Voting and Propensity Averaging 765

        32.5 Lessons Learned 766

        32.6 Conclusions 766

        APPENDIX A DATA SUMMARIZATION AND VISUALIZATION 768

        Part 1: Summarization 1: Building Blocks of Data Analysis 768

        Part 2: Visualization: Graphs and Tables for Summarizing and Organizing Data 770

        Part 3: Summarization 2: Measures of Center, Variability, and Position 774

        Part 4: Summarization and Visualization of Bivariate Relationships 777

        INDEX 781

      Recently viewed products

      © 2026 Book Curl

        • American Express
        • Apple Pay
        • Diners Club
        • Discover
        • Google Pay
        • Maestro
        • Mastercard
        • PayPal
        • Shop Pay
        • Union Pay
        • Visa

        Login

        Forgot your password?

        Don't have an account yet?
        Create account