Description

Book Synopsis
Professional CUDA Programming in C provides down to earth coverage of the complex topic of parallel computing, a topic increasingly essential in every day computing. This entry-level programming book for professionals turns complex subjects into easy-to-comprehend concepts and easy-to-follows steps.

Table of Contents

Foreword xvii

Preface xix

Introduction xxi

Chapter 1: Heterogeneous Parallel Computing with CUDA 1

Parallel Computing 2

Sequential and Parallel Programming 3

Parallelism 4

Computer Architecture 6

Heterogeneous Computing 8

Heterogeneous Architecture 9

Paradigm of Heterogeneous Computing 12

CUDA: A Platform for Heterogeneous Computing 14

Hello World from GPU 17

Is CUDA C Programming Difficult? 20

Summary 21

Chapter 2: CUDA Programming Model 23

Introducing the CUDA Programming Model 23

CUDA Programming Structure 25

Managing Memory 26

Organizing Threads 30

Launching a CUDA Kernel 36

Writing Your Kernel 37

Verifying Your Kernel 39

Handling Errors 40

Compiling and Executing 40

Timing Your Kernel 43

Timing with CPU Timer 44

Timing with nvprof 47

Organizing Parallel Threads 49

Indexing Matrices with Blocks and Threads 49

Summing Matrices with a 2D Grid and 2D Blocks 53

Summing Matrices with a 1D Grid and 1D Blocks 57

Summing Matrices with a 2D Grid and 1D Blocks 58

Managing Devices 60

Using the Runtime API to Query GPU Information 61

Determining the Best GPU 63

Using nvidia-smi to Query GPU Information 63

Setting Devices at Runtime 64

Summary 65

Chapter 3: CUDA Execution Model 67

Introducing the CUDA Execution Model 67

GPU Architecture Overview 68

The Fermi Architecture 71

The Kepler Architecture 73

Profile-Driven Optimization 78

Understanding the Nature of Warp Execution 80

Warps and Thread Blocks 80

Warp Divergence 82

Resource Partitioning 87

Latency Hiding 90

Occupancy 93

Synchronization 97

Scalability 98

Exposing Parallelism 98

Checking Active Warps with nvprof 100

Checking Memory Operations with nvprof 100

Exposing More Parallelism 101

Avoiding Branch Divergence 104

The Parallel Reduction Problem 104

Divergence in Parallel Reduction 106

Improving Divergence in Parallel Reduction 110

Reducing with Interleaved Pairs 112

Unrolling Loops 114

Reducing with Unrolling 115

Reducing with Unrolled Warps 117

Reducing with Complete Unrolling 119

Reducing with Template Functions 120

Dynamic Parallelism 122

Nested Execution 123

Nested Hello World on the GPU 124

Nested Reduction 128

Summary 132

Chapter 4: Global Memory 135

Introducing the CUDA Memory Model 136

Benefits of a Memory Hierarchy 136

CUDA Memory Model 137

Memory Management 145

Memory Allocation and Deallocation 146

Memory Transfer 146

Pinned Memory 148

Zero-Copy Memory 150

Unified Virtual Addressing 156

Unified Memory 157

Memory Access Patterns 158

Aligned and Coalesced Access 158

Global Memory Reads 160

Global Memory Writes 169

Array of Structures versus Structure of Arrays 171

Performance Tuning 176

What Bandwidth Can a Kernel Achieve? 179

Memory Bandwidth 179

Matrix Transpose Problem 180

Matrix Addition with Unified Memory 195

Summary 199

Chapter 5: Shared Memory and Constant Memory 203

Introducing CUDA Shared Memory 204

Shared Memory 204

Shared Memory Allocation 206

Shared Memory Banks and Access Mode 206

Configuring the Amount of Shared Memory 212

Synchronization 214

Checking the Data Layout of Shared Memory 216

Square Shared Memory 217

Rectangular Shared Memory 225

Reducing Global Memory Access 232

Parallel Reduction with Shared Memory 232

Parallel Reduction with Unrolling 236

Parallel Reduction with Dynamic Shared Memory 238

Effective Bandwidth 239

Coalescing Global Memory Accesses 239

Baseline Transpose Kernel 240

Matrix Transpose with Shared Memory 241

Matrix Transpose with Padded Shared Memory 245

Matrix Transpose with Unrolling 246

Exposing More Parallelism 249

Constant Memory 250

Implementing a 1D Stencil with Constant Memory 250

Comparing with the Read-Only Cache 253

The Warp Shuffle Instruction 255

Variants of the Warp Shuffle Instruction 256

Sharing Data within a Warp 258

Parallel Reduction Using the Warp Shuffle Instruction 262

Summary 264

Chapter 6: Streams and Concurrency 267

Introducing Streams and Events 268

CUDA Streams 269

Stream Scheduling 271

Stream Priorities 273

CUDA Events 273

Stream Synchronization 275

Concurrent Kernel Execution 279

Concurrent Kernels in Non-NULL Streams 279

False Dependencies on Fermi GPUs 281

Dispatching Operations with OpenMP 283

Adjusting Stream Behavior Using Environment Variables 284

Concurrency-Limiting GPU Resources 286

Blocking Behavior of the Default Stream 287

Creating Inter-Stream Dependencies 288

Overlapping Kernel Execution and Data Transfer 289

Overlap Using Depth-First Scheduling 289

Overlap Using Breadth-First Scheduling 293

Overlapping GPU and CPU Execution 294

Stream Callbacks 295

Summary 297

Chapter 7: Tuning Instruction-Level Primitives 299

Introducing CUDA Instructions 300

Floating-Point Instructions 301

Intrinsic and Standard Functions 303

Atomic Instructions 304

Optimizing Instructions for Your Application 306

Single-Precision vs. Double-Precision 306

Standard vs. Intrinsic Functions 309

Understanding Atomic Instructions 315

Bringing It All Together 322

Summary 324

Chapter 8: GPU-Accelerated CUDA Libraries and OpenACC 327

Introducing the CUDA Libraries 328

Supported Domains for CUDA Libraries 329

A Common Library Workflow 330

The CUSPARSE Library 332

cuSPARSE Data Storage Formats 333

Formatting Conversion with cuSPARSE 337

Demonstrating cuSPARSE 338

Important Topics in cuSPARSE Development 340

cuSPARSE Summary 341

The cuBLAS Library 341

Managing cuBLAS Data 342

Demonstrating cuBLAS 343

Important Topics in cuBLAS Development 345

cuBLAS Summary 346

The cuFFT Library 346

Using the cuFFT API 347

Demonstrating cuFFT 348

cuFFT Summary 349

The cuRAND Library 349

Choosing Pseudo- or Quasi- Random Numbers 349

Overview of the cuRAND Library 350

Demonstrating cuRAND 354

Important Topics in cuRAND Development 357

CUDA Library Features Introduced in CUDA 6 358

Drop-In CUDA Libraries 358

Multi-GPU Libraries 359

A Survey of CUDA Library Performance 361

cuSPARSE versus MKL 361

cuBLAS versus MKL BLAS 362

cuFFT versus FFTW versus MKL 363

CUDA Library Performance Summary 364

Using OpenACC 365

Using OpenACC Compute Directives 367

Using OpenACC Data Directives 375

The OpenACC Runtime API 380

Combining OpenACC and the CUDA Libraries 382

Summary of OpenACC 384

Summary 384

Chapter 9: Multi-GPU Programming 387

Moving to Multiple GPUs 388

Executing on Multiple GPUs 389

Peer-to-Peer Communication 391

Synchronizing across Multi-GPUs 392

Subdividing Computation across Multiple GPUs 393

Allocating Memory on Multiple Devices 393

Distributing Work from a Single Host Thread 394

Compiling and Executing 395

Peer-to-Peer Communication on Multiple GPUs 396

Enabling Peer-to-Peer Access 396

Peer-to-Peer Memory Copy 396

Peer-to-Peer Memory Access with Unified Virtual Addressing 398

Finite Difference on Multi-GPU 400

Stencil Calculation for 2D Wave Equation 400

Typical Patterns for Multi-GPU Programs 401

2D Stencil Computation with Multiple GPUs 403

Overlapping Computation and Communication 405

Compiling and Executing 406

Scaling Applications across GPU Clusters 409

CPU-to-CPU Data Transfer 410

GPU-to-GPU Data Transfer Using Traditional MPI 413

GPU-to-GPU Data Transfer with CUDA-aware MPI 416

Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI 417

Adjusting Message Chunk Size 418

GPU to GPU Data Transfer with GPUDirect RDMA 419

Summary 422

Chapter 10: Implementation Considerations 425

The CUDA C Development Process 426

APOD Development Cycle 426

Optimization Opportunities 429

CUDA Code Compilation 432

CUDA Error Handling 437

Profile-Driven Optimization 438

Finding Optimization Opportunities Using nvprof 439

Guiding Optimization Using nvvp 443

NVIDIA Tools Extension 446

CUDA Debugging 448

Kernel Debugging 448

Memory Debugging 456

Debugging Summary 462

A Case Study in Porting C Programs to CUDA C 462

Assessing crypt 463

Parallelizing crypt 464

Optimizing crypt 465

Deploying Crypt 472

Summary of Porting crypt 475

Summary 476

Appendix: Suggested Readings 477

Index 481

Professional CUDA C Programming

Product form

£35.62

Includes FREE delivery

RRP £47.50 – you save £11.88 (25%)

Order before 4pm today for delivery by Tue 23 Dec 2025.

A Paperback / softback by John Cheng, Max Grossman, Ty McKercher

15 in stock


    View other formats and editions of Professional CUDA C Programming by John Cheng

    Publisher: John Wiley & Sons Inc
    Publication Date: 07/10/2014
    ISBN13: 9781118739327, 978-1118739327
    ISBN10: 1118739329

    Description

    Book Synopsis
    Professional CUDA Programming in C provides down to earth coverage of the complex topic of parallel computing, a topic increasingly essential in every day computing. This entry-level programming book for professionals turns complex subjects into easy-to-comprehend concepts and easy-to-follows steps.

    Table of Contents

    Foreword xvii

    Preface xix

    Introduction xxi

    Chapter 1: Heterogeneous Parallel Computing with CUDA 1

    Parallel Computing 2

    Sequential and Parallel Programming 3

    Parallelism 4

    Computer Architecture 6

    Heterogeneous Computing 8

    Heterogeneous Architecture 9

    Paradigm of Heterogeneous Computing 12

    CUDA: A Platform for Heterogeneous Computing 14

    Hello World from GPU 17

    Is CUDA C Programming Difficult? 20

    Summary 21

    Chapter 2: CUDA Programming Model 23

    Introducing the CUDA Programming Model 23

    CUDA Programming Structure 25

    Managing Memory 26

    Organizing Threads 30

    Launching a CUDA Kernel 36

    Writing Your Kernel 37

    Verifying Your Kernel 39

    Handling Errors 40

    Compiling and Executing 40

    Timing Your Kernel 43

    Timing with CPU Timer 44

    Timing with nvprof 47

    Organizing Parallel Threads 49

    Indexing Matrices with Blocks and Threads 49

    Summing Matrices with a 2D Grid and 2D Blocks 53

    Summing Matrices with a 1D Grid and 1D Blocks 57

    Summing Matrices with a 2D Grid and 1D Blocks 58

    Managing Devices 60

    Using the Runtime API to Query GPU Information 61

    Determining the Best GPU 63

    Using nvidia-smi to Query GPU Information 63

    Setting Devices at Runtime 64

    Summary 65

    Chapter 3: CUDA Execution Model 67

    Introducing the CUDA Execution Model 67

    GPU Architecture Overview 68

    The Fermi Architecture 71

    The Kepler Architecture 73

    Profile-Driven Optimization 78

    Understanding the Nature of Warp Execution 80

    Warps and Thread Blocks 80

    Warp Divergence 82

    Resource Partitioning 87

    Latency Hiding 90

    Occupancy 93

    Synchronization 97

    Scalability 98

    Exposing Parallelism 98

    Checking Active Warps with nvprof 100

    Checking Memory Operations with nvprof 100

    Exposing More Parallelism 101

    Avoiding Branch Divergence 104

    The Parallel Reduction Problem 104

    Divergence in Parallel Reduction 106

    Improving Divergence in Parallel Reduction 110

    Reducing with Interleaved Pairs 112

    Unrolling Loops 114

    Reducing with Unrolling 115

    Reducing with Unrolled Warps 117

    Reducing with Complete Unrolling 119

    Reducing with Template Functions 120

    Dynamic Parallelism 122

    Nested Execution 123

    Nested Hello World on the GPU 124

    Nested Reduction 128

    Summary 132

    Chapter 4: Global Memory 135

    Introducing the CUDA Memory Model 136

    Benefits of a Memory Hierarchy 136

    CUDA Memory Model 137

    Memory Management 145

    Memory Allocation and Deallocation 146

    Memory Transfer 146

    Pinned Memory 148

    Zero-Copy Memory 150

    Unified Virtual Addressing 156

    Unified Memory 157

    Memory Access Patterns 158

    Aligned and Coalesced Access 158

    Global Memory Reads 160

    Global Memory Writes 169

    Array of Structures versus Structure of Arrays 171

    Performance Tuning 176

    What Bandwidth Can a Kernel Achieve? 179

    Memory Bandwidth 179

    Matrix Transpose Problem 180

    Matrix Addition with Unified Memory 195

    Summary 199

    Chapter 5: Shared Memory and Constant Memory 203

    Introducing CUDA Shared Memory 204

    Shared Memory 204

    Shared Memory Allocation 206

    Shared Memory Banks and Access Mode 206

    Configuring the Amount of Shared Memory 212

    Synchronization 214

    Checking the Data Layout of Shared Memory 216

    Square Shared Memory 217

    Rectangular Shared Memory 225

    Reducing Global Memory Access 232

    Parallel Reduction with Shared Memory 232

    Parallel Reduction with Unrolling 236

    Parallel Reduction with Dynamic Shared Memory 238

    Effective Bandwidth 239

    Coalescing Global Memory Accesses 239

    Baseline Transpose Kernel 240

    Matrix Transpose with Shared Memory 241

    Matrix Transpose with Padded Shared Memory 245

    Matrix Transpose with Unrolling 246

    Exposing More Parallelism 249

    Constant Memory 250

    Implementing a 1D Stencil with Constant Memory 250

    Comparing with the Read-Only Cache 253

    The Warp Shuffle Instruction 255

    Variants of the Warp Shuffle Instruction 256

    Sharing Data within a Warp 258

    Parallel Reduction Using the Warp Shuffle Instruction 262

    Summary 264

    Chapter 6: Streams and Concurrency 267

    Introducing Streams and Events 268

    CUDA Streams 269

    Stream Scheduling 271

    Stream Priorities 273

    CUDA Events 273

    Stream Synchronization 275

    Concurrent Kernel Execution 279

    Concurrent Kernels in Non-NULL Streams 279

    False Dependencies on Fermi GPUs 281

    Dispatching Operations with OpenMP 283

    Adjusting Stream Behavior Using Environment Variables 284

    Concurrency-Limiting GPU Resources 286

    Blocking Behavior of the Default Stream 287

    Creating Inter-Stream Dependencies 288

    Overlapping Kernel Execution and Data Transfer 289

    Overlap Using Depth-First Scheduling 289

    Overlap Using Breadth-First Scheduling 293

    Overlapping GPU and CPU Execution 294

    Stream Callbacks 295

    Summary 297

    Chapter 7: Tuning Instruction-Level Primitives 299

    Introducing CUDA Instructions 300

    Floating-Point Instructions 301

    Intrinsic and Standard Functions 303

    Atomic Instructions 304

    Optimizing Instructions for Your Application 306

    Single-Precision vs. Double-Precision 306

    Standard vs. Intrinsic Functions 309

    Understanding Atomic Instructions 315

    Bringing It All Together 322

    Summary 324

    Chapter 8: GPU-Accelerated CUDA Libraries and OpenACC 327

    Introducing the CUDA Libraries 328

    Supported Domains for CUDA Libraries 329

    A Common Library Workflow 330

    The CUSPARSE Library 332

    cuSPARSE Data Storage Formats 333

    Formatting Conversion with cuSPARSE 337

    Demonstrating cuSPARSE 338

    Important Topics in cuSPARSE Development 340

    cuSPARSE Summary 341

    The cuBLAS Library 341

    Managing cuBLAS Data 342

    Demonstrating cuBLAS 343

    Important Topics in cuBLAS Development 345

    cuBLAS Summary 346

    The cuFFT Library 346

    Using the cuFFT API 347

    Demonstrating cuFFT 348

    cuFFT Summary 349

    The cuRAND Library 349

    Choosing Pseudo- or Quasi- Random Numbers 349

    Overview of the cuRAND Library 350

    Demonstrating cuRAND 354

    Important Topics in cuRAND Development 357

    CUDA Library Features Introduced in CUDA 6 358

    Drop-In CUDA Libraries 358

    Multi-GPU Libraries 359

    A Survey of CUDA Library Performance 361

    cuSPARSE versus MKL 361

    cuBLAS versus MKL BLAS 362

    cuFFT versus FFTW versus MKL 363

    CUDA Library Performance Summary 364

    Using OpenACC 365

    Using OpenACC Compute Directives 367

    Using OpenACC Data Directives 375

    The OpenACC Runtime API 380

    Combining OpenACC and the CUDA Libraries 382

    Summary of OpenACC 384

    Summary 384

    Chapter 9: Multi-GPU Programming 387

    Moving to Multiple GPUs 388

    Executing on Multiple GPUs 389

    Peer-to-Peer Communication 391

    Synchronizing across Multi-GPUs 392

    Subdividing Computation across Multiple GPUs 393

    Allocating Memory on Multiple Devices 393

    Distributing Work from a Single Host Thread 394

    Compiling and Executing 395

    Peer-to-Peer Communication on Multiple GPUs 396

    Enabling Peer-to-Peer Access 396

    Peer-to-Peer Memory Copy 396

    Peer-to-Peer Memory Access with Unified Virtual Addressing 398

    Finite Difference on Multi-GPU 400

    Stencil Calculation for 2D Wave Equation 400

    Typical Patterns for Multi-GPU Programs 401

    2D Stencil Computation with Multiple GPUs 403

    Overlapping Computation and Communication 405

    Compiling and Executing 406

    Scaling Applications across GPU Clusters 409

    CPU-to-CPU Data Transfer 410

    GPU-to-GPU Data Transfer Using Traditional MPI 413

    GPU-to-GPU Data Transfer with CUDA-aware MPI 416

    Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI 417

    Adjusting Message Chunk Size 418

    GPU to GPU Data Transfer with GPUDirect RDMA 419

    Summary 422

    Chapter 10: Implementation Considerations 425

    The CUDA C Development Process 426

    APOD Development Cycle 426

    Optimization Opportunities 429

    CUDA Code Compilation 432

    CUDA Error Handling 437

    Profile-Driven Optimization 438

    Finding Optimization Opportunities Using nvprof 439

    Guiding Optimization Using nvvp 443

    NVIDIA Tools Extension 446

    CUDA Debugging 448

    Kernel Debugging 448

    Memory Debugging 456

    Debugging Summary 462

    A Case Study in Porting C Programs to CUDA C 462

    Assessing crypt 463

    Parallelizing crypt 464

    Optimizing crypt 465

    Deploying Crypt 472

    Summary of Porting crypt 475

    Summary 476

    Appendix: Suggested Readings 477

    Index 481

    Recently viewed products

    © 2025 Book Curl

      • American Express
      • Apple Pay
      • Diners Club
      • Discover
      • Google Pay
      • Maestro
      • Mastercard
      • PayPal
      • Shop Pay
      • Union Pay
      • Visa

      Login

      Forgot your password?

      Don't have an account yet?
      Create account