Description

Book Synopsis
PEEK UNDER THE HOOD OF BIG DATA ANALYTICS The world of big data analytics grows ever more complex. And while many people can work superficially with specific frameworks, far fewer understand the fundamental principles of large-scale, distributed data processing systems and how they operate. In Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood, renowned big-data experts and computer scientists Drs. Supun Kamburugamuve and Saliya Ekanayake deliver a practical guide to applying the principles of big data to software development for optimal performance. The authors discuss foundational components of large-scale data systems and walk readers through the major software design decisions that define performance, application type, and usability. You???ll learn how to recognize problems in your applications resulting in performance and distributed operation issues, diagnose them, and effectively eliminate them by relying on the bedrock big data principles explained within. Moving beyond individual frameworks and APIs for data processing, this book unlocks the theoretical ideas that operate under the hood of every big data processing system. Ideal for data scientists, data architects, dev-ops engineers, and developers, Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood shows readers how to: Identify the foundations of large-scale, distributed data processing systemsMake major software design decisions that optimize performanceDiagnose performance problems and distributed operation issuesUnderstand state-of-the-art research in big dataExplain and use the major big data frameworks and understand what underpins themUse big data analytics in the real world to solve practical problems

Table of Contents

Introduction xxvii

Chapter 1 Data Intensive Applications 1

Anatomy of a Data-Intensive Application 1

A Histogram Example 2

Program 2

Process Management 3

Communication 4

Execution 5

Data Structures 6

Putting It Together 6

Application 6

Resource Management 6

Messaging 7

Data Structures 7

Tasks and Execution 8

Fault Tolerance 8

Remote Execution 8

Parallel Applications 9

Serial Applications 9

Lloyd’s K-Means Algorithm 9

Parallelizing Algorithms 11

Decomposition 11

Task Assignment 12

Orchestration 12

Mapping 13

K-Means

Algorithm 13

Parallel and Distributed Computing 15

Memory Abstractions 16

Shared Memory 16

Distributed Memory 18

Hybrid (Shared + Distributed) Memory 20

Partitioned Global Address Space Memory 21

Application Classes and Frameworks 22

Parallel Interaction Patterns 22

Pleasingly Parallel 23

Dataflow 23

Iterative 23

Irregular 23

Data Abstractions 24

Data-Intensive

Frameworks 24

Components 24

Workflows 25

An Example 25

What Makes It Difficult? 26

Developing Applications 27

Concurrency 27

Data Partitioning 28

Debugging 28

Diverse Environments 28

Computer Networks 29

Synchronization 29

Thread Synchronization 29

Data Synchronization 30

Ordering of Events 31

Faults 31

Consensus 31

Summary 32

References 32

Chapter 2 Data and Storage 35

Storage Systems 35

Storage for Distributed Systems 36

Direct-Attached Storage 37

Storage Area Network 37

Network-Attached Storage 38

DAS or SAN or NAS? 38

Storage Abstractions 39

Block Storage 39

File Systems 40

Object Storage 41

Data Formats 41

XML 42

JSON 43

CSV 44

Apache Parquet 45

Apache Avro 47

Avro Data Definitions (Schema) 48

Code Generation 49

Without Code Generation 49

Avro File 49

Schema Evolution 49

Protocol Buffers, Flat Buffers, and Thrift 50

Data Replication 51

Synchronous and Asynchronous Replication 52

Single-Leader and Multileader Replication 52

Data Locality 53

Disadvantages of Replication 54

Data Partitioning 54

Vertical Partitioning 55

Horizontal Partitioning (Sharding) 55

Hybrid Partitioning 56

Considerations for Partitioning 57

NoSQL Databases 58

Data Models 58

Key-Value Databases 58

Document Databases 59

Wide Column Databases 59

Graph Databases 59

CAP Theorem 60

Message Queuing 61

Message Processing Guarantees 63

Durability of Messages 64

Acknowledgments 64

Storage First Brokers and Transient Brokers 65

Summary 66

References 66

Chapter 3 Computing Resources 69

A Demonstration 71

Computer Clusters 72

Anatomy of a Computer Cluster 73

Data Analytics in Clusters 74

Dedicated Clusters 76

Classic Parallel Systems 76

Big Data Systems 77

Shared Clusters 79

OpenMPI on a Slurm Cluster 79

Spark on a Yarn Cluster 80

Distributed Application Life Cycle 80

Life Cycle Steps 80

Step 1: Preparation of the Job Package 81

Step 2: Resource Acquisition 81

Step 3: Distributing the Application (Job) Artifacts 81

Step 4: Bootstrapping the Distributed Environment 82

Step 5: Monitoring 82

Step 6: Termination 83

Computing Resources 83

Data Centers 83

Physical Machines 85

Network 85

Virtual Machines 87

Containers 87

Processor, Random Access Memory, and Cache 88

Cache 89

Multiple Processors in a Computer 90

Nonuniform Memory Access 90

Uniform Memory Access 91

Hard Disk 92

GPUs 92

Mapping Resources to Applications 92

Cluster Resource Managers 93

Kubernetes 94

Kubernetes Architecture 94

Kubernetes Application Concepts 96

Data-Intensive Applications on Kubernetes 96

Slurm 98

Yarn 99

Job Scheduling 99

Scheduling Policy 101

Objective Functions 101

Throughput and Latency 101

Priorities 102

Lowering Distance Among the Processes 102

Data Locality 102

Completion Deadline 102

Algorithms 103

First in First Out 103

Gang Scheduling 103

List Scheduling 103

Backfill Scheduling 104

Summary 104

References 104

Chapter 4 Data Structures 107

Virtual Memory 108

Paging and TLB 109

Cache 111

The Need for Data Structures 112

Cache and Memory Layout 112

Memory Fragmentation 114

Data Transfer 115

Data Transfer Between Frameworks 115

Cross-Language Data Transfer 115

Object and Text Data 116

Serialization 116

Vectors and Matrices 117

1D Vectors 118

Matrices 118

Row-Major and Column-Major Formats 119

N-Dimensional Arrays/Tensors 122

NumPy 123

Memory Representation 125

K-means with NumPy 126

Sparse Matrices 127

Table 128

Table Formats 129

Column Data Format 129

Row Data Format 130

Apache Arrow 130

Arrow Data Format 131

Primitive Types 131

Variable-Length Data 132

Arrow Serialization 133

Arrow Example 133

Pandas DataFrame 134

Column vs. Row Tables 136

Summary 136

References 136

Chapter 5 Programming Models 139

Introduction 139

Parallel Programming Models 140

Parallel Process Interaction 140

Problem Decomposition 140

Data Structures 140

Data Structures and Operations 141

Data Types 141

Local Operations 143

Distributed Operations 143

Array 144

Tensor 145

Indexing 145

Slicing 146

Broadcasting 146

Table 146

Graph Data 148

Message Passing Model 150

Model 151

Message Passing Frameworks 151

Message Passing Interface 151

Bulk Synchronous Parallel 153

K-Means 154

Distributed Data Model 157

Eager Model 157

Dataflow Model 158

Data Frames, Datasets, and Tables 159

Input and Output 160

Task Graphs (Dataflow Graphs) 160

Model 161

User Program to Task Graph 161

Tasks and Functions 162

Source Task 162

Compute Task 163

Implicit vs. Explicit Parallel Models 163

Remote Execution 163

Components 164

Batch Dataflow 165

Data Abstractions 165

Table Abstraction 165

Matrix/Tensors 165

Functions 166

Source 166

Compute 167

Sink 168

An Example 168

Caching State 169

Evaluation Strategy 170

Lazy Evaluation 171

Eager Evaluation 171

Iterative Computations 172

DOALL Parallel 172

DOACROSS Parallel 172

Pipeline Parallel 173

Task Graph Models for Iterative Computations 173

K-Means Algorithm 174

Streaming Dataflow 176

Data Abstractions 177

Streams 177

Distributed Operations 178

Streaming Functions 178

Sources 178

Compute 179

Sink 179

An Example 179

Windowing 180

Windowing Strategies 181

Operations on Windows 182

Handling Late Events 182

SQL 182

Queries 183

Summary 184

References 184

Chapter 6 Messaging 187

Network Services 188

TCP/IP 188

RDMA 189

Messaging for Data Analytics 189

Anatomy of a Message 190

Data Packing 190

Protocol 191

Message Types 192

Control Messages 192

External Data Sources 192

Data Transfer Messages 192

Distributed Operations 194

How Are They Used? 194

Task Graph 194

Parallel Processes 195

Anatomy of a Distributed Operation 198

Data Abstractions 198

Distributed Operation API 198

Streaming and Batch Operations 199

Streaming Operations 199

Batch Operations 199

Distributed Operations on Arrays 200

Broadcast 200

Reduce and AllReduce 201

Gather and AllGather 202

Scatter 203

AllToAll 204

Optimized Operations 204

Broadcast 205

Reduce 206

AllReduce 206

Gather and AllGather Collective Algorithms 208

Scatter and AllToAll Collective Algorithms 208

Distributed Operations on Tables 209

Shuffle 209

Partitioning Data 211

Handling Large Data 212

Fetch-Based Algorithm (Asynchronous Algorithm) 213

Distributed Synchronization Algorithm 214

GroupBy 214

Aggregate 215

Join 216

Join Algorithms 219

Distributed Joins 221

Performance of Joins 223

More Operations 223

Advanced Topics 224

Data Packing 224

Memory Considerations 224

Message Coalescing 224

Compression 225

Stragglers 225

Nonblocking vs. Blocking Operations 225

Blocking Operations 226

Nonblocking Operations 226

Summary 227

References 227

Chapter 7 Parallel Tasks 229

CPUs 229

Cache 229

False Sharing 230

Vectorization 231

Threads and Processes 234

Concurrency and Parallelism 234

Context Switches and Scheduling 234

Mutual Exclusion 235

User-Level Threads 236

Process Affinity 236

NUMA-Aware Programming 237

Accelerators 237

Task Execution 238

Scheduling 240

Static Scheduling 240

Dynamic Scheduling 240

Loosely Synchronous and Asynchronous Execution 241

Loosely Synchronous Parallel System 242

Asynchronous Parallel System (Fully Distributed) 243

Actor Model 244

Actor 244

Asynchronous Messages 244

Actor Frameworks 245

Execution Models 245

Process Model 246

Thread Model 246

Remote Execution 246

Tasks for Data Analytics 248

SPMD and MPMD Execution 248

Batch Tasks 249

Data Partitions 249

Operations 251

Task Graph Scheduling 253

Threads, CPU Cores, and Partitions 254

Data Locality 255

Execution 257

Streaming Execution 257

State 257

Immutable Data 258

State in Driver 258

Distributed State 259

Streaming Tasks 259

Streams and Data Partitioning 260

Partitions 260

Operations 261

Scheduling 262

Uniform Resources 263

Resource-Aware Scheduling 264

Execution 264

Dynamic Scaling 264

Back Pressure (Flow Control) 265

Rate-Based Flow Control 266

Credit-Based Flow Control 266

State 267

Summary 268

References 268

Chapter 8 Case Studies 271

Apache Hadoop 271

Programming Model 272

Architecture 274

Cluster Resource Management 275

Apache Spark 275

Programming Model 275

RDD API 276

SQL, DataFrames, and DataSets 277

Architecture 278

Resource Managers 278

Task Schedulers 279

Executors 279

Communication Operations 280

Apache Spark Streaming 280

Apache Storm 282

Programming Model 282

Architecture 284

Cluster Resource Managers 285

Communication Operations 286

Kafka Streams 286

Programming Model 286

Architecture 287

PyTorch 288

Programming Model 288

Execution 292

Cylon 295

Programming Model 296

Architecture 296

Execution 297

Communication Operations 298

Rapids cuDF 298

Programming Model 298

Architecture 299

Summary 300

References 300

Chapter 9 Fault Tolerance 303

Dependable Systems and Failures 303

Fault Tolerance is Not Free 304

Dependable Systems 305

Failures 306

Process Failures 306

Network Failures 307

Node Failures 307

Byzantine Faults 307

Failure Models 308

Failure Detection 308

Recovering from Faults 309

Recovery Methods 310

Stateless Programs 310

Batch Systems 311

Streaming Systems 311

Processing Guarantees 311

Role of Cluster Resource Managers 312

Checkpointing 313

State 313

Consistent Global State 313

Uncoordinated Checkpointing 314

Coordinated Checkpointing 315

Chandy-Lamport Algorithm 315

Batch Systems 316

When to Checkpoint? 317

Snapshot Data 318

Streaming Systems 319

Case Study: Apache Storm 319

Message Tracking 320

Failure Recovery 321

Case Study: Apache Flink 321

Checkpointing 322

Failure Recovery 324

Batch Systems 324

Iterative Programs 324

Case Study: Apache Spark 325

RDD Recomputing 326

Checkpointing 326

Recovery from Failures 327

Summary 327

References 327

Chapter 10 Performance and Productivity 329

Performance Metrics 329

System Performance Metrics 330

Parallel Performance Metrics 330

Speedup 330

Strong Scaling 331

Weak Scaling 332

Parallel Efficiency 332

Amdahl’s Law 333

Gustafson’s Law 334

Throughput 334

Latency 335

Benchmarks 336

LINPACK Benchmark 336

NAS Parallel Benchmark 336

BigDataBench 336

TPC Benchmarks 337

HiBench 337

Performance Factors 337

Memory 337

Execution 338

Distributed Operators 338

Disk I/O 339

Garbage Collection 339

Finding Issues 342

Serial Programs 342

Profiling 342

Scaling 343

Strong Scaling 343

Weak Scaling 344

Debugging Distributed Applications 344

Programming Languages 345

C/C++ 346

Java 346

Memory Management 347

Data Structures 348

Interfacing with Python 348

Python 350

C/C++ Code integration 350

Productivity 351

Choice of Frameworks 351

Operating Environment 353

CPUs and GPUs 353

Public Clouds 355

Future of Data-Intensive Applications 358

Summary 358

References 359

Index 361

Foundations of Data Intensive Applications

    Product form

    £38.25

    Includes FREE delivery

    RRP £42.50 – you save £4.25 (10%)

    Order before 4pm today for delivery by Mon 13 Jul 2026.

    A Paperback / softback by Supun Kamburugamuve, Saliya Ekanayake

    10 in stock

      Trusted by thousands of customers. See 2,385+ Customer Reviews

      View other formats and editions of Foundations of Data Intensive Applications by Supun Kamburugamuve

      Publisher: John Wiley & Sons Inc
      Publication Date: 15/11/2021
      ISBN13: 9781119713029, 978-1119713029
      ISBN10: 1119713021
      Also in:
      Data mining

      Description

      Book Synopsis
      PEEK UNDER THE HOOD OF BIG DATA ANALYTICS The world of big data analytics grows ever more complex. And while many people can work superficially with specific frameworks, far fewer understand the fundamental principles of large-scale, distributed data processing systems and how they operate. In Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood, renowned big-data experts and computer scientists Drs. Supun Kamburugamuve and Saliya Ekanayake deliver a practical guide to applying the principles of big data to software development for optimal performance. The authors discuss foundational components of large-scale data systems and walk readers through the major software design decisions that define performance, application type, and usability. You???ll learn how to recognize problems in your applications resulting in performance and distributed operation issues, diagnose them, and effectively eliminate them by relying on the bedrock big data principles explained within. Moving beyond individual frameworks and APIs for data processing, this book unlocks the theoretical ideas that operate under the hood of every big data processing system. Ideal for data scientists, data architects, dev-ops engineers, and developers, Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood shows readers how to: Identify the foundations of large-scale, distributed data processing systemsMake major software design decisions that optimize performanceDiagnose performance problems and distributed operation issuesUnderstand state-of-the-art research in big dataExplain and use the major big data frameworks and understand what underpins themUse big data analytics in the real world to solve practical problems

      Table of Contents

      Introduction xxvii

      Chapter 1 Data Intensive Applications 1

      Anatomy of a Data-Intensive Application 1

      A Histogram Example 2

      Program 2

      Process Management 3

      Communication 4

      Execution 5

      Data Structures 6

      Putting It Together 6

      Application 6

      Resource Management 6

      Messaging 7

      Data Structures 7

      Tasks and Execution 8

      Fault Tolerance 8

      Remote Execution 8

      Parallel Applications 9

      Serial Applications 9

      Lloyd’s K-Means Algorithm 9

      Parallelizing Algorithms 11

      Decomposition 11

      Task Assignment 12

      Orchestration 12

      Mapping 13

      K-Means

      Algorithm 13

      Parallel and Distributed Computing 15

      Memory Abstractions 16

      Shared Memory 16

      Distributed Memory 18

      Hybrid (Shared + Distributed) Memory 20

      Partitioned Global Address Space Memory 21

      Application Classes and Frameworks 22

      Parallel Interaction Patterns 22

      Pleasingly Parallel 23

      Dataflow 23

      Iterative 23

      Irregular 23

      Data Abstractions 24

      Data-Intensive

      Frameworks 24

      Components 24

      Workflows 25

      An Example 25

      What Makes It Difficult? 26

      Developing Applications 27

      Concurrency 27

      Data Partitioning 28

      Debugging 28

      Diverse Environments 28

      Computer Networks 29

      Synchronization 29

      Thread Synchronization 29

      Data Synchronization 30

      Ordering of Events 31

      Faults 31

      Consensus 31

      Summary 32

      References 32

      Chapter 2 Data and Storage 35

      Storage Systems 35

      Storage for Distributed Systems 36

      Direct-Attached Storage 37

      Storage Area Network 37

      Network-Attached Storage 38

      DAS or SAN or NAS? 38

      Storage Abstractions 39

      Block Storage 39

      File Systems 40

      Object Storage 41

      Data Formats 41

      XML 42

      JSON 43

      CSV 44

      Apache Parquet 45

      Apache Avro 47

      Avro Data Definitions (Schema) 48

      Code Generation 49

      Without Code Generation 49

      Avro File 49

      Schema Evolution 49

      Protocol Buffers, Flat Buffers, and Thrift 50

      Data Replication 51

      Synchronous and Asynchronous Replication 52

      Single-Leader and Multileader Replication 52

      Data Locality 53

      Disadvantages of Replication 54

      Data Partitioning 54

      Vertical Partitioning 55

      Horizontal Partitioning (Sharding) 55

      Hybrid Partitioning 56

      Considerations for Partitioning 57

      NoSQL Databases 58

      Data Models 58

      Key-Value Databases 58

      Document Databases 59

      Wide Column Databases 59

      Graph Databases 59

      CAP Theorem 60

      Message Queuing 61

      Message Processing Guarantees 63

      Durability of Messages 64

      Acknowledgments 64

      Storage First Brokers and Transient Brokers 65

      Summary 66

      References 66

      Chapter 3 Computing Resources 69

      A Demonstration 71

      Computer Clusters 72

      Anatomy of a Computer Cluster 73

      Data Analytics in Clusters 74

      Dedicated Clusters 76

      Classic Parallel Systems 76

      Big Data Systems 77

      Shared Clusters 79

      OpenMPI on a Slurm Cluster 79

      Spark on a Yarn Cluster 80

      Distributed Application Life Cycle 80

      Life Cycle Steps 80

      Step 1: Preparation of the Job Package 81

      Step 2: Resource Acquisition 81

      Step 3: Distributing the Application (Job) Artifacts 81

      Step 4: Bootstrapping the Distributed Environment 82

      Step 5: Monitoring 82

      Step 6: Termination 83

      Computing Resources 83

      Data Centers 83

      Physical Machines 85

      Network 85

      Virtual Machines 87

      Containers 87

      Processor, Random Access Memory, and Cache 88

      Cache 89

      Multiple Processors in a Computer 90

      Nonuniform Memory Access 90

      Uniform Memory Access 91

      Hard Disk 92

      GPUs 92

      Mapping Resources to Applications 92

      Cluster Resource Managers 93

      Kubernetes 94

      Kubernetes Architecture 94

      Kubernetes Application Concepts 96

      Data-Intensive Applications on Kubernetes 96

      Slurm 98

      Yarn 99

      Job Scheduling 99

      Scheduling Policy 101

      Objective Functions 101

      Throughput and Latency 101

      Priorities 102

      Lowering Distance Among the Processes 102

      Data Locality 102

      Completion Deadline 102

      Algorithms 103

      First in First Out 103

      Gang Scheduling 103

      List Scheduling 103

      Backfill Scheduling 104

      Summary 104

      References 104

      Chapter 4 Data Structures 107

      Virtual Memory 108

      Paging and TLB 109

      Cache 111

      The Need for Data Structures 112

      Cache and Memory Layout 112

      Memory Fragmentation 114

      Data Transfer 115

      Data Transfer Between Frameworks 115

      Cross-Language Data Transfer 115

      Object and Text Data 116

      Serialization 116

      Vectors and Matrices 117

      1D Vectors 118

      Matrices 118

      Row-Major and Column-Major Formats 119

      N-Dimensional Arrays/Tensors 122

      NumPy 123

      Memory Representation 125

      K-means with NumPy 126

      Sparse Matrices 127

      Table 128

      Table Formats 129

      Column Data Format 129

      Row Data Format 130

      Apache Arrow 130

      Arrow Data Format 131

      Primitive Types 131

      Variable-Length Data 132

      Arrow Serialization 133

      Arrow Example 133

      Pandas DataFrame 134

      Column vs. Row Tables 136

      Summary 136

      References 136

      Chapter 5 Programming Models 139

      Introduction 139

      Parallel Programming Models 140

      Parallel Process Interaction 140

      Problem Decomposition 140

      Data Structures 140

      Data Structures and Operations 141

      Data Types 141

      Local Operations 143

      Distributed Operations 143

      Array 144

      Tensor 145

      Indexing 145

      Slicing 146

      Broadcasting 146

      Table 146

      Graph Data 148

      Message Passing Model 150

      Model 151

      Message Passing Frameworks 151

      Message Passing Interface 151

      Bulk Synchronous Parallel 153

      K-Means 154

      Distributed Data Model 157

      Eager Model 157

      Dataflow Model 158

      Data Frames, Datasets, and Tables 159

      Input and Output 160

      Task Graphs (Dataflow Graphs) 160

      Model 161

      User Program to Task Graph 161

      Tasks and Functions 162

      Source Task 162

      Compute Task 163

      Implicit vs. Explicit Parallel Models 163

      Remote Execution 163

      Components 164

      Batch Dataflow 165

      Data Abstractions 165

      Table Abstraction 165

      Matrix/Tensors 165

      Functions 166

      Source 166

      Compute 167

      Sink 168

      An Example 168

      Caching State 169

      Evaluation Strategy 170

      Lazy Evaluation 171

      Eager Evaluation 171

      Iterative Computations 172

      DOALL Parallel 172

      DOACROSS Parallel 172

      Pipeline Parallel 173

      Task Graph Models for Iterative Computations 173

      K-Means Algorithm 174

      Streaming Dataflow 176

      Data Abstractions 177

      Streams 177

      Distributed Operations 178

      Streaming Functions 178

      Sources 178

      Compute 179

      Sink 179

      An Example 179

      Windowing 180

      Windowing Strategies 181

      Operations on Windows 182

      Handling Late Events 182

      SQL 182

      Queries 183

      Summary 184

      References 184

      Chapter 6 Messaging 187

      Network Services 188

      TCP/IP 188

      RDMA 189

      Messaging for Data Analytics 189

      Anatomy of a Message 190

      Data Packing 190

      Protocol 191

      Message Types 192

      Control Messages 192

      External Data Sources 192

      Data Transfer Messages 192

      Distributed Operations 194

      How Are They Used? 194

      Task Graph 194

      Parallel Processes 195

      Anatomy of a Distributed Operation 198

      Data Abstractions 198

      Distributed Operation API 198

      Streaming and Batch Operations 199

      Streaming Operations 199

      Batch Operations 199

      Distributed Operations on Arrays 200

      Broadcast 200

      Reduce and AllReduce 201

      Gather and AllGather 202

      Scatter 203

      AllToAll 204

      Optimized Operations 204

      Broadcast 205

      Reduce 206

      AllReduce 206

      Gather and AllGather Collective Algorithms 208

      Scatter and AllToAll Collective Algorithms 208

      Distributed Operations on Tables 209

      Shuffle 209

      Partitioning Data 211

      Handling Large Data 212

      Fetch-Based Algorithm (Asynchronous Algorithm) 213

      Distributed Synchronization Algorithm 214

      GroupBy 214

      Aggregate 215

      Join 216

      Join Algorithms 219

      Distributed Joins 221

      Performance of Joins 223

      More Operations 223

      Advanced Topics 224

      Data Packing 224

      Memory Considerations 224

      Message Coalescing 224

      Compression 225

      Stragglers 225

      Nonblocking vs. Blocking Operations 225

      Blocking Operations 226

      Nonblocking Operations 226

      Summary 227

      References 227

      Chapter 7 Parallel Tasks 229

      CPUs 229

      Cache 229

      False Sharing 230

      Vectorization 231

      Threads and Processes 234

      Concurrency and Parallelism 234

      Context Switches and Scheduling 234

      Mutual Exclusion 235

      User-Level Threads 236

      Process Affinity 236

      NUMA-Aware Programming 237

      Accelerators 237

      Task Execution 238

      Scheduling 240

      Static Scheduling 240

      Dynamic Scheduling 240

      Loosely Synchronous and Asynchronous Execution 241

      Loosely Synchronous Parallel System 242

      Asynchronous Parallel System (Fully Distributed) 243

      Actor Model 244

      Actor 244

      Asynchronous Messages 244

      Actor Frameworks 245

      Execution Models 245

      Process Model 246

      Thread Model 246

      Remote Execution 246

      Tasks for Data Analytics 248

      SPMD and MPMD Execution 248

      Batch Tasks 249

      Data Partitions 249

      Operations 251

      Task Graph Scheduling 253

      Threads, CPU Cores, and Partitions 254

      Data Locality 255

      Execution 257

      Streaming Execution 257

      State 257

      Immutable Data 258

      State in Driver 258

      Distributed State 259

      Streaming Tasks 259

      Streams and Data Partitioning 260

      Partitions 260

      Operations 261

      Scheduling 262

      Uniform Resources 263

      Resource-Aware Scheduling 264

      Execution 264

      Dynamic Scaling 264

      Back Pressure (Flow Control) 265

      Rate-Based Flow Control 266

      Credit-Based Flow Control 266

      State 267

      Summary 268

      References 268

      Chapter 8 Case Studies 271

      Apache Hadoop 271

      Programming Model 272

      Architecture 274

      Cluster Resource Management 275

      Apache Spark 275

      Programming Model 275

      RDD API 276

      SQL, DataFrames, and DataSets 277

      Architecture 278

      Resource Managers 278

      Task Schedulers 279

      Executors 279

      Communication Operations 280

      Apache Spark Streaming 280

      Apache Storm 282

      Programming Model 282

      Architecture 284

      Cluster Resource Managers 285

      Communication Operations 286

      Kafka Streams 286

      Programming Model 286

      Architecture 287

      PyTorch 288

      Programming Model 288

      Execution 292

      Cylon 295

      Programming Model 296

      Architecture 296

      Execution 297

      Communication Operations 298

      Rapids cuDF 298

      Programming Model 298

      Architecture 299

      Summary 300

      References 300

      Chapter 9 Fault Tolerance 303

      Dependable Systems and Failures 303

      Fault Tolerance is Not Free 304

      Dependable Systems 305

      Failures 306

      Process Failures 306

      Network Failures 307

      Node Failures 307

      Byzantine Faults 307

      Failure Models 308

      Failure Detection 308

      Recovering from Faults 309

      Recovery Methods 310

      Stateless Programs 310

      Batch Systems 311

      Streaming Systems 311

      Processing Guarantees 311

      Role of Cluster Resource Managers 312

      Checkpointing 313

      State 313

      Consistent Global State 313

      Uncoordinated Checkpointing 314

      Coordinated Checkpointing 315

      Chandy-Lamport Algorithm 315

      Batch Systems 316

      When to Checkpoint? 317

      Snapshot Data 318

      Streaming Systems 319

      Case Study: Apache Storm 319

      Message Tracking 320

      Failure Recovery 321

      Case Study: Apache Flink 321

      Checkpointing 322

      Failure Recovery 324

      Batch Systems 324

      Iterative Programs 324

      Case Study: Apache Spark 325

      RDD Recomputing 326

      Checkpointing 326

      Recovery from Failures 327

      Summary 327

      References 327

      Chapter 10 Performance and Productivity 329

      Performance Metrics 329

      System Performance Metrics 330

      Parallel Performance Metrics 330

      Speedup 330

      Strong Scaling 331

      Weak Scaling 332

      Parallel Efficiency 332

      Amdahl’s Law 333

      Gustafson’s Law 334

      Throughput 334

      Latency 335

      Benchmarks 336

      LINPACK Benchmark 336

      NAS Parallel Benchmark 336

      BigDataBench 336

      TPC Benchmarks 337

      HiBench 337

      Performance Factors 337

      Memory 337

      Execution 338

      Distributed Operators 338

      Disk I/O 339

      Garbage Collection 339

      Finding Issues 342

      Serial Programs 342

      Profiling 342

      Scaling 343

      Strong Scaling 343

      Weak Scaling 344

      Debugging Distributed Applications 344

      Programming Languages 345

      C/C++ 346

      Java 346

      Memory Management 347

      Data Structures 348

      Interfacing with Python 348

      Python 350

      C/C++ Code integration 350

      Productivity 351

      Choice of Frameworks 351

      Operating Environment 353

      CPUs and GPUs 353

      Public Clouds 355

      Future of Data-Intensive Applications 358

      Summary 358

      References 359

      Index 361

      Recently viewed products

      © 2026 Book Curl

        • American Express
        • Apple Pay
        • Diners Club
        • Discover
        • Google Pay
        • Maestro
        • Mastercard
        • PayPal
        • Shop Pay
        • Union Pay
        • Visa

        Login

        Forgot your password?

        Don't have an account yet?
        Create account