Description

Book Synopsis


Table of Contents

Introduction xxi

Chapter 1 The Two Essential Algorithms for Making Predictions 1

Why are These Two Algorithms So Useful? 2

What are Penalized Regression Methods? 7

What are Ensemble Methods? 9

How to Decide Which Algorithm to Use 11

The Process Steps for Building a Predictive Model 13

Framing a Machine Learning Problem 15

Feature Extraction and Feature Engineering 17

Determining Performance of a Trained Model 18

Chapter Contents and Dependencies 18

Summary 20

Chapter 2 Understand the Problem by Understanding the Data 23

The Anatomy of a New Problem 24

Different Types of Attributes and Labels Drive Modeling Choices 26

Things to Notice about Your New Data Set 27

Classification Problems: Detecting Unexploded Mines Using Sonar 28

Physical Characteristics of the Rocks Versus Mines Data Set 29

Statistical Summaries of the Rocks Versus Mines Data Set 32

Visualization of Outliers Using a Quantile-Quantile Plot 34

Statistical Characterization of Categorical Attributes 35

How to Use Python Pandas to Summarize the Rocks Versus Mines Data Set 36

Visualizing Properties of the Rocks Versus Mines Data Set 39

Visualizing with Parallel Coordinates Plots 39

Visualizing Interrelationships between Attributes and Labels 41

Visualizing Attribute and Label Correlations Using a Heat Map 48

Summarizing the Process for Understanding the Rocks Versus Mines Data Set 50

Real-Valued Predictions with Factor Variables: How Old is Your Abalone? 50

Parallel Coordinates for Regression Problems—Visualize Variable Relationships for the Abalone Problem 55

How to Use a Correlation Heat Map for Regression—Visualize Pair-Wise Correlations for the Abalone Problem 59

Real-Valued Predictions Using Real-Valued Attributes: Calculate How Your Wine Tastes 61

Multiclass Classification Problem: What Type of Glass is That? 67

Using PySpark to Understand Large Data Sets 72

Summary 75

Chapter 3 Predictive Model Building: Balancing Performance, Complexity, and Big Data 77

The Basic Problem: Understanding Function Approximation 78

Working with Training Data 79

Assessing Performance of Predictive Models 81

Factors Driving Algorithm Choices and Performance—Complexity and Data 82

Contrast between a Simple Problem and a Complex Problem 82

Contrast between a Simple Model and a Complex Model 85

Factors Driving Predictive Algorithm Performance 89

Choosing an Algorithm: Linear or Nonlinear? 90

Measuring the Performance of Predictive Models 91

Performance Measures for Different Types of Problems 91

Simulating Performance of Deployed Models 105

Achieving Harmony between Model and Data 107

Choosing a Model to Balance Problem Complexity, Model Complexity, and Data Set Size 107

Using Forward Stepwise Regression to Control Overfitting 109

Evaluating and Understanding Your Predictive Model 114

Control Overfitting by Penalizing Regression Coefficients—Ridge Regression 116

Using PySpark for Training Penalized Regression Models on Extremely Large Data Sets 124

Summary 127

Chapter 4 Penalized Linear Regression 129

Why Penalized Linear Regression Methods are So Useful 130

Extremely Fast Coefficient Estimation 130

Variable Importance Information 131

Extremely Fast Evaluation When Deployed 131

Reliable Performance 131

Sparse Solutions 132

Problem May Require Linear Model 132

When to Use Ensemble Methods 132

Penalized Linear Regression: Regulating Linear Regression for Optimum Performance 132

Training Linear Models: Minimizing Errors and More 135

Adding a Coefficient Penalty to the OLS Formulation 136

Other Useful Coefficient Penalties—Manhattan and ElasticNet 137

Why Lasso Penalty Leads to Sparse Coefficient Vectors 138

ElasticNet Penalty Includes Both Lasso and Ridge 140

Solving the Penalized Linear Regression Problem 141

Understanding Least Angle Regression and Its Relationship to Forward Stepwise Regression 141

How LARS Generates Hundreds of Models of Varying Complexity 145

Choosing the Best Model from the Hundreds LARS Generates 147

Using Glmnet: Very Fast and Very General 152

Comparison of the Mechanics of Glmnet and LARS Algorithms 153

Initializing and Iterating the Glmnet Algorithm 153

Extension of Linear Regression to Classification Problems 157

Solving Classification Problems with Penalized Regression 157

Working with Classification Problems Having More Than Two Outcomes 161

Understanding Basis Expansion: Using Linear Methods on Nonlinear Problems 161

Incorporating Non-Numeric Attributes into Linear Methods 163

Summary 166

Chapter 5 Building Predictive Models Using Penalized Linear Methods 169

Python Packages for Penalized Linear Regression 170

Multivariable Regression: Predicting Wine Taste 171

Building and Testing a Model to Predict Wine Taste 172

Training on the Whole Data Set before Deployment 175

Basis Expansion: Improving Performance by Creating New Variables from Old Ones 179

Binary Classification: Using Penalized Linear Regression to Detect Unexploded Mines 182

Build a Rocks Versus Mines Classifier for Deployment 191

Multiclass Classification: Classifying Crime Scene Glass Samples 200

Linear Regression and Classification Using PySpark 203

Using PySpark to Predict Wine Taste 204

Logistic Regression with PySpark: Rocks Versus Mines 208

Incorporating Categorical Variables in a PySpark Model: Predicting Abalone Rings 213

Multiclass Logistic Regression with Meta Parameter Optimization 217

Summary 219

Chapter 6 Ensemble Methods 221

Binary Decision Trees 222

How a Binary Decision Tree Generates Predictions 224

How to Train a Binary Decision Tree 225

Tree Training Equals Split Point Selection 227

How Split Point Selection Affects Predictions 228

Algorithm for Selecting Split Points 229

Multivariable Tree Training—Which Attribute to Split? 229

Recursive Splitting for More Tree Depth 230

Overfitting Binary Trees 231

Measuring Overfit with Binary Trees 231

Balancing Binary Tree Complexity for Best Performance 232

Modifi cations for Classification and Categorical Features 235

Bootstrap Aggregation: “Bagging” 235

How Does the Bagging Algorithm Work? 236

Bagging Performance—Bias Versus Variance 239

How Bagging Behaves on Multivariable Problem 241

Bagging Needs Tree Depth for Performance 245

Summary of Bagging 246

Gradient Boosting 246

Basic Principle of Gradient Boosting Algorithm 246

Parameter Settings for Gradient Boosting 249

How Gradient Boosting Iterates toward a Predictive Model 249

Getting the Best Performance from Gradient Boosting 250

Gradient Boosting on a Multivariable Problem 253

Summary for Gradient Boosting 256

Random Forests 256

Random Forests: Bagging Plus Random Attribute Subsets 259

Random Forests Performance Drivers 260

Random Forests Summary 261

Summary 262

Chapter 7 Building Ensemble Models with Python 265

Solving Regression Problems with Python Ensemble Packages 265

Using Gradient Boosting to Predict Wine Taste 266

Using the Class Constructor for GradientBoostingRegressor 266

Using GradientBoostingRegressor to Implement a Regression Model 268

Assessing the Performance of a Gradient Boosting Model 271

Building a Random Forest Model to Predict Wine Taste 272

Constructing a RandomForestRegressor Object 273

Modeling Wine Taste with RandomForestRegressor 275

Visualizing the Performance of a Random Forest Regression Model 279

Incorporating Non-Numeric Attributes in Python Ensemble Models 279

Coding the Sex of Abalone for Gradient Boosting Regression in Python 280

Assessing Performance and the Importance of Coded Variables with Gradient Boosting 282

Coding the Sex of Abalone for Input to Random Forest Regression in Python 284

Assessing Performance and the Importance of Coded Variables 287

Solving Binary Classification Problems with Python Ensemble Methods 288

Detecting Unexploded Mines with Python Gradient Boosting 288

Determining the Performance of a Gradient Boosting Classifier 291

Detecting Unexploded Mines with Python Random Forest 292

Constructing a Random Forest Model to Detect Unexploded Mines 294

Determining the Performance of a Random Forest Classifier 298

Solving Multiclass Classification Problems with Python Ensemble Methods 300

Dealing with Class Imbalances 301

Classifying Glass Using Gradient Boosting 301

Determining the Performance of the Gradient Boosting Model on Glass Classification 306

Classifying Glass with Random Forests 307

Determining the Performance of the Random Forest Model on Glass Classification 310

Solving Regression Problems with PySpark Ensemble Packages 311

Predicting Wine Taste with PySpark Ensemble Methods 312

Predicting Abalone Age with PySpark Ensemble Methods 317

Distinguishing Mines from Rocks with PySpark

Ensemble Methods 321

Identifying Glass Types with PySpark Ensemble Methods 325

Summary 327

Index 329

Machine Learning with Spark and Python

    Product form

    £30.39

    Includes FREE delivery

    RRP £37.99 – you save £7.60 (20%)

    Order before 4pm today for delivery by Mon 6 Jul 2026.

    A Paperback / softback by Michael Bowles

    2 in stock

      Trusted by thousands of customers. See 2,385+ Customer Reviews

      View other formats and editions of Machine Learning with Spark and Python by Michael Bowles

      Publisher: John Wiley & Sons Inc
      Publication Date: 05/12/2019
      ISBN13: 9781119561934, 978-1119561934
      ISBN10: 1119561930
      Also in:
      Computer science

      Description

      Book Synopsis


      Table of Contents

      Introduction xxi

      Chapter 1 The Two Essential Algorithms for Making Predictions 1

      Why are These Two Algorithms So Useful? 2

      What are Penalized Regression Methods? 7

      What are Ensemble Methods? 9

      How to Decide Which Algorithm to Use 11

      The Process Steps for Building a Predictive Model 13

      Framing a Machine Learning Problem 15

      Feature Extraction and Feature Engineering 17

      Determining Performance of a Trained Model 18

      Chapter Contents and Dependencies 18

      Summary 20

      Chapter 2 Understand the Problem by Understanding the Data 23

      The Anatomy of a New Problem 24

      Different Types of Attributes and Labels Drive Modeling Choices 26

      Things to Notice about Your New Data Set 27

      Classification Problems: Detecting Unexploded Mines Using Sonar 28

      Physical Characteristics of the Rocks Versus Mines Data Set 29

      Statistical Summaries of the Rocks Versus Mines Data Set 32

      Visualization of Outliers Using a Quantile-Quantile Plot 34

      Statistical Characterization of Categorical Attributes 35

      How to Use Python Pandas to Summarize the Rocks Versus Mines Data Set 36

      Visualizing Properties of the Rocks Versus Mines Data Set 39

      Visualizing with Parallel Coordinates Plots 39

      Visualizing Interrelationships between Attributes and Labels 41

      Visualizing Attribute and Label Correlations Using a Heat Map 48

      Summarizing the Process for Understanding the Rocks Versus Mines Data Set 50

      Real-Valued Predictions with Factor Variables: How Old is Your Abalone? 50

      Parallel Coordinates for Regression Problems—Visualize Variable Relationships for the Abalone Problem 55

      How to Use a Correlation Heat Map for Regression—Visualize Pair-Wise Correlations for the Abalone Problem 59

      Real-Valued Predictions Using Real-Valued Attributes: Calculate How Your Wine Tastes 61

      Multiclass Classification Problem: What Type of Glass is That? 67

      Using PySpark to Understand Large Data Sets 72

      Summary 75

      Chapter 3 Predictive Model Building: Balancing Performance, Complexity, and Big Data 77

      The Basic Problem: Understanding Function Approximation 78

      Working with Training Data 79

      Assessing Performance of Predictive Models 81

      Factors Driving Algorithm Choices and Performance—Complexity and Data 82

      Contrast between a Simple Problem and a Complex Problem 82

      Contrast between a Simple Model and a Complex Model 85

      Factors Driving Predictive Algorithm Performance 89

      Choosing an Algorithm: Linear or Nonlinear? 90

      Measuring the Performance of Predictive Models 91

      Performance Measures for Different Types of Problems 91

      Simulating Performance of Deployed Models 105

      Achieving Harmony between Model and Data 107

      Choosing a Model to Balance Problem Complexity, Model Complexity, and Data Set Size 107

      Using Forward Stepwise Regression to Control Overfitting 109

      Evaluating and Understanding Your Predictive Model 114

      Control Overfitting by Penalizing Regression Coefficients—Ridge Regression 116

      Using PySpark for Training Penalized Regression Models on Extremely Large Data Sets 124

      Summary 127

      Chapter 4 Penalized Linear Regression 129

      Why Penalized Linear Regression Methods are So Useful 130

      Extremely Fast Coefficient Estimation 130

      Variable Importance Information 131

      Extremely Fast Evaluation When Deployed 131

      Reliable Performance 131

      Sparse Solutions 132

      Problem May Require Linear Model 132

      When to Use Ensemble Methods 132

      Penalized Linear Regression: Regulating Linear Regression for Optimum Performance 132

      Training Linear Models: Minimizing Errors and More 135

      Adding a Coefficient Penalty to the OLS Formulation 136

      Other Useful Coefficient Penalties—Manhattan and ElasticNet 137

      Why Lasso Penalty Leads to Sparse Coefficient Vectors 138

      ElasticNet Penalty Includes Both Lasso and Ridge 140

      Solving the Penalized Linear Regression Problem 141

      Understanding Least Angle Regression and Its Relationship to Forward Stepwise Regression 141

      How LARS Generates Hundreds of Models of Varying Complexity 145

      Choosing the Best Model from the Hundreds LARS Generates 147

      Using Glmnet: Very Fast and Very General 152

      Comparison of the Mechanics of Glmnet and LARS Algorithms 153

      Initializing and Iterating the Glmnet Algorithm 153

      Extension of Linear Regression to Classification Problems 157

      Solving Classification Problems with Penalized Regression 157

      Working with Classification Problems Having More Than Two Outcomes 161

      Understanding Basis Expansion: Using Linear Methods on Nonlinear Problems 161

      Incorporating Non-Numeric Attributes into Linear Methods 163

      Summary 166

      Chapter 5 Building Predictive Models Using Penalized Linear Methods 169

      Python Packages for Penalized Linear Regression 170

      Multivariable Regression: Predicting Wine Taste 171

      Building and Testing a Model to Predict Wine Taste 172

      Training on the Whole Data Set before Deployment 175

      Basis Expansion: Improving Performance by Creating New Variables from Old Ones 179

      Binary Classification: Using Penalized Linear Regression to Detect Unexploded Mines 182

      Build a Rocks Versus Mines Classifier for Deployment 191

      Multiclass Classification: Classifying Crime Scene Glass Samples 200

      Linear Regression and Classification Using PySpark 203

      Using PySpark to Predict Wine Taste 204

      Logistic Regression with PySpark: Rocks Versus Mines 208

      Incorporating Categorical Variables in a PySpark Model: Predicting Abalone Rings 213

      Multiclass Logistic Regression with Meta Parameter Optimization 217

      Summary 219

      Chapter 6 Ensemble Methods 221

      Binary Decision Trees 222

      How a Binary Decision Tree Generates Predictions 224

      How to Train a Binary Decision Tree 225

      Tree Training Equals Split Point Selection 227

      How Split Point Selection Affects Predictions 228

      Algorithm for Selecting Split Points 229

      Multivariable Tree Training—Which Attribute to Split? 229

      Recursive Splitting for More Tree Depth 230

      Overfitting Binary Trees 231

      Measuring Overfit with Binary Trees 231

      Balancing Binary Tree Complexity for Best Performance 232

      Modifi cations for Classification and Categorical Features 235

      Bootstrap Aggregation: “Bagging” 235

      How Does the Bagging Algorithm Work? 236

      Bagging Performance—Bias Versus Variance 239

      How Bagging Behaves on Multivariable Problem 241

      Bagging Needs Tree Depth for Performance 245

      Summary of Bagging 246

      Gradient Boosting 246

      Basic Principle of Gradient Boosting Algorithm 246

      Parameter Settings for Gradient Boosting 249

      How Gradient Boosting Iterates toward a Predictive Model 249

      Getting the Best Performance from Gradient Boosting 250

      Gradient Boosting on a Multivariable Problem 253

      Summary for Gradient Boosting 256

      Random Forests 256

      Random Forests: Bagging Plus Random Attribute Subsets 259

      Random Forests Performance Drivers 260

      Random Forests Summary 261

      Summary 262

      Chapter 7 Building Ensemble Models with Python 265

      Solving Regression Problems with Python Ensemble Packages 265

      Using Gradient Boosting to Predict Wine Taste 266

      Using the Class Constructor for GradientBoostingRegressor 266

      Using GradientBoostingRegressor to Implement a Regression Model 268

      Assessing the Performance of a Gradient Boosting Model 271

      Building a Random Forest Model to Predict Wine Taste 272

      Constructing a RandomForestRegressor Object 273

      Modeling Wine Taste with RandomForestRegressor 275

      Visualizing the Performance of a Random Forest Regression Model 279

      Incorporating Non-Numeric Attributes in Python Ensemble Models 279

      Coding the Sex of Abalone for Gradient Boosting Regression in Python 280

      Assessing Performance and the Importance of Coded Variables with Gradient Boosting 282

      Coding the Sex of Abalone for Input to Random Forest Regression in Python 284

      Assessing Performance and the Importance of Coded Variables 287

      Solving Binary Classification Problems with Python Ensemble Methods 288

      Detecting Unexploded Mines with Python Gradient Boosting 288

      Determining the Performance of a Gradient Boosting Classifier 291

      Detecting Unexploded Mines with Python Random Forest 292

      Constructing a Random Forest Model to Detect Unexploded Mines 294

      Determining the Performance of a Random Forest Classifier 298

      Solving Multiclass Classification Problems with Python Ensemble Methods 300

      Dealing with Class Imbalances 301

      Classifying Glass Using Gradient Boosting 301

      Determining the Performance of the Gradient Boosting Model on Glass Classification 306

      Classifying Glass with Random Forests 307

      Determining the Performance of the Random Forest Model on Glass Classification 310

      Solving Regression Problems with PySpark Ensemble Packages 311

      Predicting Wine Taste with PySpark Ensemble Methods 312

      Predicting Abalone Age with PySpark Ensemble Methods 317

      Distinguishing Mines from Rocks with PySpark

      Ensemble Methods 321

      Identifying Glass Types with PySpark Ensemble Methods 325

      Summary 327

      Index 329

      Recently viewed products

      © 2026 Book Curl

        • American Express
        • Apple Pay
        • Diners Club
        • Discover
        • Google Pay
        • Maestro
        • Mastercard
        • PayPal
        • Shop Pay
        • Union Pay
        • Visa

        Login

        Forgot your password?

        Don't have an account yet?
        Create account