{"product_id":"foundations-of-data-intensive-applications-9781119713029","title":"Foundations of Data Intensive Applications","description":"\u003cb\u003eBook Synopsis\u003c\/b\u003e\u003cbr\u003ePEEK UNDER THE HOOD OF BIG DATA ANALYTICS The world of big data analytics grows ever more complex. And while many people can work  superficially with specific frameworks, far fewer understand the fundamental principles of large-scale, distributed data processing systems and how they operate. In Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood, renowned big-data experts and computer scientists Drs. Supun Kamburugamuve and Saliya Ekanayake deliver a practical guide to applying the principles of big data to software development for optimal performance. The authors discuss foundational components of large-scale data systems and walk readers through the major software design decisions that define performance, application type, and usability. You???ll learn how to recognize problems in your applications resulting in performance and distributed operation issues, diagnose them, and effectively eliminate them by relying on the bedrock big data principles explained within. Moving beyond individual frameworks and APIs for data processing, this book unlocks the theoretical ideas that operate under the hood of every big data processing system. Ideal for data scientists, data architects, dev-ops engineers, and developers, Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood shows readers how to: Identify the foundations of large-scale, distributed data processing systemsMake major software design decisions that optimize performanceDiagnose performance problems and distributed operation issuesUnderstand state-of-the-art research in big dataExplain and use the major big data frameworks and understand what underpins themUse big data analytics in the real world to solve practical problems\u003cbr\u003e\u003cbr\u003e\u003cb\u003eTable of Contents\u003c\/b\u003e\u003cbr\u003e\u003cp\u003eIntroduction xxvii\u003c\/p\u003e \u003cp\u003e\u003cb\u003eChapter 1 Data Intensive Applications 1\u003c\/b\u003e\u003c\/p\u003e \u003cp\u003eAnatomy of a Data-Intensive Application 1\u003c\/p\u003e \u003cp\u003eA Histogram Example 2\u003c\/p\u003e \u003cp\u003eProgram 2\u003c\/p\u003e \u003cp\u003eProcess Management 3\u003c\/p\u003e \u003cp\u003eCommunication 4\u003c\/p\u003e \u003cp\u003eExecution 5\u003c\/p\u003e \u003cp\u003eData Structures 6\u003c\/p\u003e \u003cp\u003ePutting It Together 6\u003c\/p\u003e \u003cp\u003eApplication 6\u003c\/p\u003e \u003cp\u003eResource Management 6\u003c\/p\u003e \u003cp\u003eMessaging 7\u003c\/p\u003e \u003cp\u003eData Structures 7\u003c\/p\u003e \u003cp\u003eTasks and Execution 8\u003c\/p\u003e \u003cp\u003eFault Tolerance 8\u003c\/p\u003e \u003cp\u003eRemote Execution 8\u003c\/p\u003e \u003cp\u003eParallel Applications 9\u003c\/p\u003e \u003cp\u003eSerial Applications 9\u003c\/p\u003e \u003cp\u003eLloyd’s K-Means Algorithm 9\u003c\/p\u003e \u003cp\u003eParallelizing Algorithms 11\u003c\/p\u003e \u003cp\u003eDecomposition 11\u003c\/p\u003e \u003cp\u003eTask Assignment 12\u003c\/p\u003e \u003cp\u003eOrchestration 12\u003c\/p\u003e \u003cp\u003eMapping 13\u003c\/p\u003e \u003cp\u003eK-Means\u003c\/p\u003e \u003cp\u003eAlgorithm 13\u003c\/p\u003e \u003cp\u003eParallel and Distributed Computing 15\u003c\/p\u003e \u003cp\u003eMemory Abstractions 16\u003c\/p\u003e \u003cp\u003eShared Memory 16\u003c\/p\u003e \u003cp\u003eDistributed Memory 18\u003c\/p\u003e \u003cp\u003eHybrid (Shared + Distributed) Memory 20\u003c\/p\u003e \u003cp\u003ePartitioned Global Address Space Memory 21\u003c\/p\u003e \u003cp\u003eApplication Classes and Frameworks 22\u003c\/p\u003e \u003cp\u003eParallel Interaction Patterns 22\u003c\/p\u003e \u003cp\u003ePleasingly Parallel 23\u003c\/p\u003e \u003cp\u003eDataflow 23\u003c\/p\u003e \u003cp\u003eIterative 23\u003c\/p\u003e \u003cp\u003eIrregular 23\u003c\/p\u003e \u003cp\u003eData Abstractions 24\u003c\/p\u003e \u003cp\u003eData-Intensive\u003c\/p\u003e \u003cp\u003eFrameworks 24\u003c\/p\u003e \u003cp\u003eComponents 24\u003c\/p\u003e \u003cp\u003eWorkflows 25\u003c\/p\u003e \u003cp\u003eAn Example 25\u003c\/p\u003e \u003cp\u003eWhat Makes It Difficult? 26\u003c\/p\u003e \u003cp\u003eDeveloping Applications 27\u003c\/p\u003e \u003cp\u003eConcurrency 27\u003c\/p\u003e \u003cp\u003eData Partitioning 28\u003c\/p\u003e \u003cp\u003eDebugging 28\u003c\/p\u003e \u003cp\u003eDiverse Environments 28\u003c\/p\u003e \u003cp\u003eComputer Networks 29\u003c\/p\u003e \u003cp\u003eSynchronization 29\u003c\/p\u003e \u003cp\u003eThread Synchronization 29\u003c\/p\u003e \u003cp\u003eData Synchronization 30\u003c\/p\u003e \u003cp\u003eOrdering of Events 31\u003c\/p\u003e \u003cp\u003eFaults 31\u003c\/p\u003e \u003cp\u003eConsensus 31\u003c\/p\u003e \u003cp\u003eSummary 32\u003c\/p\u003e \u003cp\u003eReferences 32\u003c\/p\u003e \u003cp\u003e\u003cb\u003eChapter 2 Data and Storage 35\u003c\/b\u003e\u003c\/p\u003e \u003cp\u003eStorage Systems 35\u003c\/p\u003e \u003cp\u003eStorage for Distributed Systems 36\u003c\/p\u003e \u003cp\u003eDirect-Attached Storage 37\u003c\/p\u003e \u003cp\u003eStorage Area Network 37\u003c\/p\u003e \u003cp\u003eNetwork-Attached Storage 38\u003c\/p\u003e \u003cp\u003eDAS or SAN or NAS? 38\u003c\/p\u003e \u003cp\u003eStorage Abstractions 39\u003c\/p\u003e \u003cp\u003eBlock Storage 39\u003c\/p\u003e \u003cp\u003eFile Systems 40\u003c\/p\u003e \u003cp\u003eObject Storage 41\u003c\/p\u003e \u003cp\u003eData Formats 41\u003c\/p\u003e \u003cp\u003eXML 42\u003c\/p\u003e \u003cp\u003eJSON 43\u003c\/p\u003e \u003cp\u003eCSV 44\u003c\/p\u003e \u003cp\u003eApache Parquet 45\u003c\/p\u003e \u003cp\u003eApache Avro 47\u003c\/p\u003e \u003cp\u003eAvro Data Definitions (Schema) 48\u003c\/p\u003e \u003cp\u003eCode Generation 49\u003c\/p\u003e \u003cp\u003eWithout Code Generation 49\u003c\/p\u003e \u003cp\u003eAvro File 49\u003c\/p\u003e \u003cp\u003eSchema Evolution 49\u003c\/p\u003e \u003cp\u003eProtocol Buffers, Flat Buffers, and Thrift 50\u003c\/p\u003e \u003cp\u003eData Replication 51\u003c\/p\u003e \u003cp\u003eSynchronous and Asynchronous Replication 52\u003c\/p\u003e \u003cp\u003eSingle-Leader and Multileader Replication 52\u003c\/p\u003e \u003cp\u003eData Locality 53\u003c\/p\u003e \u003cp\u003eDisadvantages of Replication 54\u003c\/p\u003e \u003cp\u003eData Partitioning 54\u003c\/p\u003e \u003cp\u003eVertical Partitioning 55\u003c\/p\u003e \u003cp\u003eHorizontal Partitioning (Sharding) 55\u003c\/p\u003e \u003cp\u003eHybrid Partitioning 56\u003c\/p\u003e \u003cp\u003eConsiderations for Partitioning 57\u003c\/p\u003e \u003cp\u003eNoSQL Databases 58\u003c\/p\u003e \u003cp\u003eData Models 58\u003c\/p\u003e \u003cp\u003eKey-Value Databases 58\u003c\/p\u003e \u003cp\u003eDocument Databases 59\u003c\/p\u003e \u003cp\u003eWide Column Databases 59\u003c\/p\u003e \u003cp\u003eGraph Databases 59\u003c\/p\u003e \u003cp\u003eCAP Theorem 60\u003c\/p\u003e \u003cp\u003eMessage Queuing 61\u003c\/p\u003e \u003cp\u003eMessage Processing Guarantees 63\u003c\/p\u003e \u003cp\u003eDurability of Messages 64\u003c\/p\u003e \u003cp\u003eAcknowledgments 64\u003c\/p\u003e \u003cp\u003eStorage First Brokers and Transient Brokers 65\u003c\/p\u003e \u003cp\u003eSummary 66\u003c\/p\u003e \u003cp\u003eReferences 66\u003c\/p\u003e \u003cp\u003e\u003cb\u003eChapter 3 Computing Resources 69\u003c\/b\u003e\u003c\/p\u003e \u003cp\u003eA Demonstration 71\u003c\/p\u003e \u003cp\u003eComputer Clusters 72\u003c\/p\u003e \u003cp\u003eAnatomy of a Computer Cluster 73\u003c\/p\u003e \u003cp\u003eData Analytics in Clusters 74\u003c\/p\u003e \u003cp\u003eDedicated Clusters 76\u003c\/p\u003e \u003cp\u003eClassic Parallel Systems 76\u003c\/p\u003e \u003cp\u003eBig Data Systems 77\u003c\/p\u003e \u003cp\u003eShared Clusters 79\u003c\/p\u003e \u003cp\u003eOpenMPI on a Slurm Cluster 79\u003c\/p\u003e \u003cp\u003eSpark on a Yarn Cluster 80\u003c\/p\u003e \u003cp\u003eDistributed Application Life Cycle 80\u003c\/p\u003e \u003cp\u003eLife Cycle Steps 80\u003c\/p\u003e \u003cp\u003eStep 1: Preparation of the Job Package 81\u003c\/p\u003e \u003cp\u003eStep 2: Resource Acquisition 81\u003c\/p\u003e \u003cp\u003eStep 3: Distributing the Application (Job) Artifacts 81\u003c\/p\u003e \u003cp\u003eStep 4: Bootstrapping the Distributed Environment 82\u003c\/p\u003e \u003cp\u003eStep 5: Monitoring 82\u003c\/p\u003e \u003cp\u003eStep 6: Termination 83\u003c\/p\u003e \u003cp\u003eComputing Resources 83\u003c\/p\u003e \u003cp\u003eData Centers 83\u003c\/p\u003e \u003cp\u003ePhysical Machines 85\u003c\/p\u003e \u003cp\u003eNetwork 85\u003c\/p\u003e \u003cp\u003eVirtual Machines 87\u003c\/p\u003e \u003cp\u003eContainers 87\u003c\/p\u003e \u003cp\u003eProcessor, Random Access Memory, and Cache 88\u003c\/p\u003e \u003cp\u003eCache 89\u003c\/p\u003e \u003cp\u003eMultiple Processors in a Computer 90\u003c\/p\u003e \u003cp\u003eNonuniform Memory Access 90\u003c\/p\u003e \u003cp\u003eUniform Memory Access 91\u003c\/p\u003e \u003cp\u003eHard Disk 92\u003c\/p\u003e \u003cp\u003eGPUs 92\u003c\/p\u003e \u003cp\u003eMapping Resources to Applications 92\u003c\/p\u003e \u003cp\u003eCluster Resource Managers 93\u003c\/p\u003e \u003cp\u003eKubernetes 94\u003c\/p\u003e \u003cp\u003eKubernetes Architecture 94\u003c\/p\u003e \u003cp\u003eKubernetes Application Concepts 96\u003c\/p\u003e \u003cp\u003eData-Intensive Applications on Kubernetes 96\u003c\/p\u003e \u003cp\u003eSlurm 98\u003c\/p\u003e \u003cp\u003eYarn 99\u003c\/p\u003e \u003cp\u003eJob Scheduling 99\u003c\/p\u003e \u003cp\u003eScheduling Policy 101\u003c\/p\u003e \u003cp\u003eObjective Functions 101\u003c\/p\u003e \u003cp\u003eThroughput and Latency 101\u003c\/p\u003e \u003cp\u003ePriorities 102\u003c\/p\u003e \u003cp\u003eLowering Distance Among the Processes 102\u003c\/p\u003e \u003cp\u003eData Locality 102\u003c\/p\u003e \u003cp\u003eCompletion Deadline 102\u003c\/p\u003e \u003cp\u003eAlgorithms 103\u003c\/p\u003e \u003cp\u003eFirst in First Out 103\u003c\/p\u003e \u003cp\u003eGang Scheduling 103\u003c\/p\u003e \u003cp\u003eList Scheduling 103\u003c\/p\u003e \u003cp\u003eBackfill Scheduling 104\u003c\/p\u003e \u003cp\u003eSummary 104\u003c\/p\u003e \u003cp\u003eReferences 104\u003c\/p\u003e \u003cp\u003e\u003cb\u003eChapter 4 Data Structures 107\u003c\/b\u003e\u003c\/p\u003e \u003cp\u003eVirtual Memory 108\u003c\/p\u003e \u003cp\u003ePaging and TLB 109\u003c\/p\u003e \u003cp\u003eCache 111\u003c\/p\u003e \u003cp\u003eThe Need for Data Structures 112\u003c\/p\u003e \u003cp\u003eCache and Memory Layout 112\u003c\/p\u003e \u003cp\u003eMemory Fragmentation 114\u003c\/p\u003e \u003cp\u003eData Transfer 115\u003c\/p\u003e \u003cp\u003eData Transfer Between Frameworks 115\u003c\/p\u003e \u003cp\u003eCross-Language Data Transfer 115\u003c\/p\u003e \u003cp\u003eObject and Text Data 116\u003c\/p\u003e \u003cp\u003eSerialization 116\u003c\/p\u003e \u003cp\u003eVectors and Matrices 117\u003c\/p\u003e \u003cp\u003e1D Vectors 118\u003c\/p\u003e \u003cp\u003eMatrices 118\u003c\/p\u003e \u003cp\u003eRow-Major and Column-Major Formats 119\u003c\/p\u003e \u003cp\u003e\u003ci\u003eN\u003c\/i\u003e-Dimensional Arrays\/Tensors 122\u003c\/p\u003e \u003cp\u003eNumPy 123\u003c\/p\u003e \u003cp\u003eMemory Representation 125\u003c\/p\u003e \u003cp\u003eK-means with NumPy 126\u003c\/p\u003e \u003cp\u003eSparse Matrices 127\u003c\/p\u003e \u003cp\u003eTable 128\u003c\/p\u003e \u003cp\u003eTable Formats 129\u003c\/p\u003e \u003cp\u003eColumn Data Format 129\u003c\/p\u003e \u003cp\u003eRow Data Format 130\u003c\/p\u003e \u003cp\u003eApache Arrow 130\u003c\/p\u003e \u003cp\u003eArrow Data Format 131\u003c\/p\u003e \u003cp\u003ePrimitive Types 131\u003c\/p\u003e \u003cp\u003eVariable-Length Data 132\u003c\/p\u003e \u003cp\u003eArrow Serialization 133\u003c\/p\u003e \u003cp\u003eArrow Example 133\u003c\/p\u003e \u003cp\u003ePandas DataFrame 134\u003c\/p\u003e \u003cp\u003eColumn vs. Row Tables 136\u003c\/p\u003e \u003cp\u003eSummary 136\u003c\/p\u003e \u003cp\u003eReferences 136\u003c\/p\u003e \u003cp\u003e\u003cb\u003eChapter 5 Programming Models 139\u003c\/b\u003e\u003c\/p\u003e \u003cp\u003eIntroduction 139\u003c\/p\u003e \u003cp\u003eParallel Programming Models 140\u003c\/p\u003e \u003cp\u003eParallel Process Interaction 140\u003c\/p\u003e \u003cp\u003eProblem Decomposition 140\u003c\/p\u003e \u003cp\u003eData Structures 140\u003c\/p\u003e \u003cp\u003eData Structures and Operations 141\u003c\/p\u003e \u003cp\u003eData Types 141\u003c\/p\u003e \u003cp\u003eLocal Operations 143\u003c\/p\u003e \u003cp\u003eDistributed Operations 143\u003c\/p\u003e \u003cp\u003eArray 144\u003c\/p\u003e \u003cp\u003eTensor 145\u003c\/p\u003e \u003cp\u003eIndexing 145\u003c\/p\u003e \u003cp\u003eSlicing 146\u003c\/p\u003e \u003cp\u003eBroadcasting 146\u003c\/p\u003e \u003cp\u003eTable 146\u003c\/p\u003e \u003cp\u003eGraph Data 148\u003c\/p\u003e \u003cp\u003eMessage Passing Model 150\u003c\/p\u003e \u003cp\u003eModel 151\u003c\/p\u003e \u003cp\u003eMessage Passing Frameworks 151\u003c\/p\u003e \u003cp\u003eMessage Passing Interface 151\u003c\/p\u003e \u003cp\u003eBulk Synchronous Parallel 153\u003c\/p\u003e \u003cp\u003eK-Means 154\u003c\/p\u003e \u003cp\u003eDistributed Data Model 157\u003c\/p\u003e \u003cp\u003eEager Model 157\u003c\/p\u003e \u003cp\u003eDataflow Model 158\u003c\/p\u003e \u003cp\u003eData Frames, Datasets, and Tables 159\u003c\/p\u003e \u003cp\u003eInput and Output 160\u003c\/p\u003e \u003cp\u003eTask Graphs (Dataflow Graphs) 160\u003c\/p\u003e \u003cp\u003eModel 161\u003c\/p\u003e \u003cp\u003eUser Program to Task Graph 161\u003c\/p\u003e \u003cp\u003eTasks and Functions 162\u003c\/p\u003e \u003cp\u003eSource Task 162\u003c\/p\u003e \u003cp\u003eCompute Task 163\u003c\/p\u003e \u003cp\u003eImplicit vs. Explicit Parallel Models 163\u003c\/p\u003e \u003cp\u003eRemote Execution 163\u003c\/p\u003e \u003cp\u003eComponents 164\u003c\/p\u003e \u003cp\u003eBatch Dataflow 165\u003c\/p\u003e \u003cp\u003eData Abstractions 165\u003c\/p\u003e \u003cp\u003eTable Abstraction 165\u003c\/p\u003e \u003cp\u003eMatrix\/Tensors 165\u003c\/p\u003e \u003cp\u003eFunctions 166\u003c\/p\u003e \u003cp\u003eSource 166\u003c\/p\u003e \u003cp\u003eCompute 167\u003c\/p\u003e \u003cp\u003eSink 168\u003c\/p\u003e \u003cp\u003eAn Example 168\u003c\/p\u003e \u003cp\u003eCaching State 169\u003c\/p\u003e \u003cp\u003eEvaluation Strategy 170\u003c\/p\u003e \u003cp\u003eLazy Evaluation 171\u003c\/p\u003e \u003cp\u003eEager Evaluation 171\u003c\/p\u003e \u003cp\u003eIterative Computations 172\u003c\/p\u003e \u003cp\u003eDOALL Parallel 172\u003c\/p\u003e \u003cp\u003eDOACROSS Parallel 172\u003c\/p\u003e \u003cp\u003ePipeline Parallel 173\u003c\/p\u003e \u003cp\u003eTask Graph Models for Iterative Computations 173\u003c\/p\u003e \u003cp\u003eK-Means Algorithm 174\u003c\/p\u003e \u003cp\u003eStreaming Dataflow 176\u003c\/p\u003e \u003cp\u003eData Abstractions 177\u003c\/p\u003e \u003cp\u003eStreams 177\u003c\/p\u003e \u003cp\u003eDistributed Operations 178\u003c\/p\u003e \u003cp\u003eStreaming Functions 178\u003c\/p\u003e \u003cp\u003eSources 178\u003c\/p\u003e \u003cp\u003eCompute 179\u003c\/p\u003e \u003cp\u003eSink 179\u003c\/p\u003e \u003cp\u003eAn Example 179\u003c\/p\u003e \u003cp\u003eWindowing 180\u003c\/p\u003e \u003cp\u003eWindowing Strategies 181\u003c\/p\u003e \u003cp\u003eOperations on Windows 182\u003c\/p\u003e \u003cp\u003eHandling Late Events 182\u003c\/p\u003e \u003cp\u003eSQL 182\u003c\/p\u003e \u003cp\u003eQueries 183\u003c\/p\u003e \u003cp\u003eSummary 184\u003c\/p\u003e \u003cp\u003eReferences 184\u003c\/p\u003e \u003cp\u003e\u003cb\u003eChapter 6 Messaging 187\u003c\/b\u003e\u003c\/p\u003e \u003cp\u003eNetwork Services 188\u003c\/p\u003e \u003cp\u003eTCP\/IP 188\u003c\/p\u003e \u003cp\u003eRDMA 189\u003c\/p\u003e \u003cp\u003eMessaging for Data Analytics 189\u003c\/p\u003e \u003cp\u003eAnatomy of a Message 190\u003c\/p\u003e \u003cp\u003eData Packing 190\u003c\/p\u003e \u003cp\u003eProtocol 191\u003c\/p\u003e \u003cp\u003eMessage Types 192\u003c\/p\u003e \u003cp\u003eControl Messages 192\u003c\/p\u003e \u003cp\u003eExternal Data Sources 192\u003c\/p\u003e \u003cp\u003eData Transfer Messages 192\u003c\/p\u003e \u003cp\u003eDistributed Operations 194\u003c\/p\u003e \u003cp\u003eHow Are They Used? 194\u003c\/p\u003e \u003cp\u003eTask Graph 194\u003c\/p\u003e \u003cp\u003eParallel Processes 195\u003c\/p\u003e \u003cp\u003eAnatomy of a Distributed Operation 198\u003c\/p\u003e \u003cp\u003eData Abstractions 198\u003c\/p\u003e \u003cp\u003eDistributed Operation API 198\u003c\/p\u003e \u003cp\u003eStreaming and Batch Operations 199\u003c\/p\u003e \u003cp\u003eStreaming Operations 199\u003c\/p\u003e \u003cp\u003eBatch Operations 199\u003c\/p\u003e \u003cp\u003eDistributed Operations on Arrays 200\u003c\/p\u003e \u003cp\u003eBroadcast 200\u003c\/p\u003e \u003cp\u003eReduce and AllReduce 201\u003c\/p\u003e \u003cp\u003eGather and AllGather 202\u003c\/p\u003e \u003cp\u003eScatter 203\u003c\/p\u003e \u003cp\u003eAllToAll 204\u003c\/p\u003e \u003cp\u003eOptimized Operations 204\u003c\/p\u003e \u003cp\u003eBroadcast 205\u003c\/p\u003e \u003cp\u003eReduce 206\u003c\/p\u003e \u003cp\u003eAllReduce 206\u003c\/p\u003e \u003cp\u003eGather and AllGather Collective Algorithms 208\u003c\/p\u003e \u003cp\u003eScatter and AllToAll Collective Algorithms 208\u003c\/p\u003e \u003cp\u003eDistributed Operations on Tables 209\u003c\/p\u003e \u003cp\u003eShuffle 209\u003c\/p\u003e \u003cp\u003ePartitioning Data 211\u003c\/p\u003e \u003cp\u003eHandling Large Data 212\u003c\/p\u003e \u003cp\u003eFetch-Based Algorithm (Asynchronous Algorithm) 213\u003c\/p\u003e \u003cp\u003eDistributed Synchronization Algorithm 214\u003c\/p\u003e \u003cp\u003eGroupBy 214\u003c\/p\u003e \u003cp\u003eAggregate 215\u003c\/p\u003e \u003cp\u003eJoin 216\u003c\/p\u003e \u003cp\u003eJoin Algorithms 219\u003c\/p\u003e \u003cp\u003eDistributed Joins 221\u003c\/p\u003e \u003cp\u003ePerformance of Joins 223\u003c\/p\u003e \u003cp\u003eMore Operations 223\u003c\/p\u003e \u003cp\u003eAdvanced Topics 224\u003c\/p\u003e \u003cp\u003eData Packing 224\u003c\/p\u003e \u003cp\u003eMemory Considerations 224\u003c\/p\u003e \u003cp\u003eMessage Coalescing 224\u003c\/p\u003e \u003cp\u003eCompression 225\u003c\/p\u003e \u003cp\u003eStragglers 225\u003c\/p\u003e \u003cp\u003eNonblocking vs. Blocking Operations 225\u003c\/p\u003e \u003cp\u003eBlocking Operations 226\u003c\/p\u003e \u003cp\u003eNonblocking Operations 226\u003c\/p\u003e \u003cp\u003eSummary 227\u003c\/p\u003e \u003cp\u003eReferences 227\u003c\/p\u003e \u003cp\u003e\u003cb\u003eChapter 7 Parallel Tasks 229\u003c\/b\u003e\u003c\/p\u003e \u003cp\u003eCPUs 229\u003c\/p\u003e \u003cp\u003eCache 229\u003c\/p\u003e \u003cp\u003eFalse Sharing 230\u003c\/p\u003e \u003cp\u003eVectorization 231\u003c\/p\u003e \u003cp\u003eThreads and Processes 234\u003c\/p\u003e \u003cp\u003eConcurrency and Parallelism 234\u003c\/p\u003e \u003cp\u003eContext Switches and Scheduling 234\u003c\/p\u003e \u003cp\u003eMutual Exclusion 235\u003c\/p\u003e \u003cp\u003eUser-Level Threads 236\u003c\/p\u003e \u003cp\u003eProcess Affinity 236\u003c\/p\u003e \u003cp\u003eNUMA-Aware Programming 237\u003c\/p\u003e \u003cp\u003eAccelerators 237\u003c\/p\u003e \u003cp\u003eTask Execution 238\u003c\/p\u003e \u003cp\u003eScheduling 240\u003c\/p\u003e \u003cp\u003eStatic Scheduling 240\u003c\/p\u003e \u003cp\u003eDynamic Scheduling 240\u003c\/p\u003e \u003cp\u003eLoosely Synchronous and Asynchronous Execution 241\u003c\/p\u003e \u003cp\u003eLoosely Synchronous Parallel System 242\u003c\/p\u003e \u003cp\u003eAsynchronous Parallel System (Fully Distributed) 243\u003c\/p\u003e \u003cp\u003eActor Model 244\u003c\/p\u003e \u003cp\u003eActor 244\u003c\/p\u003e \u003cp\u003eAsynchronous Messages 244\u003c\/p\u003e \u003cp\u003eActor Frameworks 245\u003c\/p\u003e \u003cp\u003eExecution Models 245\u003c\/p\u003e \u003cp\u003eProcess Model 246\u003c\/p\u003e \u003cp\u003eThread Model 246\u003c\/p\u003e \u003cp\u003eRemote Execution 246\u003c\/p\u003e \u003cp\u003eTasks for Data Analytics 248\u003c\/p\u003e \u003cp\u003eSPMD and MPMD Execution 248\u003c\/p\u003e \u003cp\u003eBatch Tasks 249\u003c\/p\u003e \u003cp\u003eData Partitions 249\u003c\/p\u003e \u003cp\u003eOperations 251\u003c\/p\u003e \u003cp\u003eTask Graph Scheduling 253\u003c\/p\u003e \u003cp\u003eThreads, CPU Cores, and Partitions 254\u003c\/p\u003e \u003cp\u003eData Locality 255\u003c\/p\u003e \u003cp\u003eExecution 257\u003c\/p\u003e \u003cp\u003eStreaming Execution 257\u003c\/p\u003e \u003cp\u003eState 257\u003c\/p\u003e \u003cp\u003eImmutable Data 258\u003c\/p\u003e \u003cp\u003eState in Driver 258\u003c\/p\u003e \u003cp\u003eDistributed State 259\u003c\/p\u003e \u003cp\u003eStreaming Tasks 259\u003c\/p\u003e \u003cp\u003eStreams and Data Partitioning 260\u003c\/p\u003e \u003cp\u003ePartitions 260\u003c\/p\u003e \u003cp\u003eOperations 261\u003c\/p\u003e \u003cp\u003eScheduling 262\u003c\/p\u003e \u003cp\u003eUniform Resources 263\u003c\/p\u003e \u003cp\u003eResource-Aware Scheduling 264\u003c\/p\u003e \u003cp\u003eExecution 264\u003c\/p\u003e \u003cp\u003eDynamic Scaling 264\u003c\/p\u003e \u003cp\u003eBack Pressure (Flow Control) 265\u003c\/p\u003e \u003cp\u003eRate-Based Flow Control 266\u003c\/p\u003e \u003cp\u003eCredit-Based Flow Control 266\u003c\/p\u003e \u003cp\u003eState 267\u003c\/p\u003e \u003cp\u003eSummary 268\u003c\/p\u003e \u003cp\u003eReferences 268\u003c\/p\u003e \u003cp\u003e\u003cb\u003eChapter 8 Case Studies 271\u003c\/b\u003e\u003c\/p\u003e \u003cp\u003eApache Hadoop 271\u003c\/p\u003e \u003cp\u003eProgramming Model 272\u003c\/p\u003e \u003cp\u003eArchitecture 274\u003c\/p\u003e \u003cp\u003eCluster Resource Management 275\u003c\/p\u003e \u003cp\u003eApache Spark 275\u003c\/p\u003e \u003cp\u003eProgramming Model 275\u003c\/p\u003e \u003cp\u003eRDD API 276\u003c\/p\u003e \u003cp\u003eSQL, DataFrames, and DataSets 277\u003c\/p\u003e \u003cp\u003eArchitecture 278\u003c\/p\u003e \u003cp\u003eResource Managers 278\u003c\/p\u003e \u003cp\u003eTask Schedulers 279\u003c\/p\u003e \u003cp\u003eExecutors 279\u003c\/p\u003e \u003cp\u003eCommunication Operations 280\u003c\/p\u003e \u003cp\u003eApache Spark Streaming 280\u003c\/p\u003e \u003cp\u003eApache Storm 282\u003c\/p\u003e \u003cp\u003eProgramming Model 282\u003c\/p\u003e \u003cp\u003eArchitecture 284\u003c\/p\u003e \u003cp\u003eCluster Resource Managers 285\u003c\/p\u003e \u003cp\u003eCommunication Operations 286\u003c\/p\u003e \u003cp\u003eKafka Streams 286\u003c\/p\u003e \u003cp\u003eProgramming Model 286\u003c\/p\u003e \u003cp\u003eArchitecture 287\u003c\/p\u003e \u003cp\u003ePyTorch 288\u003c\/p\u003e \u003cp\u003eProgramming Model 288\u003c\/p\u003e \u003cp\u003eExecution 292\u003c\/p\u003e \u003cp\u003eCylon 295\u003c\/p\u003e \u003cp\u003eProgramming Model 296\u003c\/p\u003e \u003cp\u003eArchitecture 296\u003c\/p\u003e \u003cp\u003eExecution 297\u003c\/p\u003e \u003cp\u003eCommunication Operations 298\u003c\/p\u003e \u003cp\u003eRapids cuDF 298\u003c\/p\u003e \u003cp\u003eProgramming Model 298\u003c\/p\u003e \u003cp\u003eArchitecture 299\u003c\/p\u003e \u003cp\u003eSummary 300\u003c\/p\u003e \u003cp\u003eReferences 300\u003c\/p\u003e \u003cp\u003e\u003cb\u003eChapter 9 Fault Tolerance 303\u003c\/b\u003e\u003c\/p\u003e \u003cp\u003eDependable Systems and Failures 303\u003c\/p\u003e \u003cp\u003eFault Tolerance is Not Free 304\u003c\/p\u003e \u003cp\u003eDependable Systems 305\u003c\/p\u003e \u003cp\u003eFailures 306\u003c\/p\u003e \u003cp\u003eProcess Failures 306\u003c\/p\u003e \u003cp\u003eNetwork Failures 307\u003c\/p\u003e \u003cp\u003eNode Failures 307\u003c\/p\u003e \u003cp\u003eByzantine Faults 307\u003c\/p\u003e \u003cp\u003eFailure Models 308\u003c\/p\u003e \u003cp\u003eFailure Detection 308\u003c\/p\u003e \u003cp\u003eRecovering from Faults 309\u003c\/p\u003e \u003cp\u003eRecovery Methods 310\u003c\/p\u003e \u003cp\u003eStateless Programs 310\u003c\/p\u003e \u003cp\u003eBatch Systems 311\u003c\/p\u003e \u003cp\u003eStreaming Systems 311\u003c\/p\u003e \u003cp\u003eProcessing Guarantees 311\u003c\/p\u003e \u003cp\u003eRole of Cluster Resource Managers 312\u003c\/p\u003e \u003cp\u003eCheckpointing 313\u003c\/p\u003e \u003cp\u003eState 313\u003c\/p\u003e \u003cp\u003eConsistent Global State 313\u003c\/p\u003e \u003cp\u003eUncoordinated Checkpointing 314\u003c\/p\u003e \u003cp\u003eCoordinated Checkpointing 315\u003c\/p\u003e \u003cp\u003eChandy-Lamport Algorithm 315\u003c\/p\u003e \u003cp\u003eBatch Systems 316\u003c\/p\u003e \u003cp\u003eWhen to Checkpoint? 317\u003c\/p\u003e \u003cp\u003eSnapshot Data 318\u003c\/p\u003e \u003cp\u003eStreaming Systems 319\u003c\/p\u003e \u003cp\u003eCase Study: Apache Storm 319\u003c\/p\u003e \u003cp\u003eMessage Tracking 320\u003c\/p\u003e \u003cp\u003eFailure Recovery 321\u003c\/p\u003e \u003cp\u003eCase Study: Apache Flink 321\u003c\/p\u003e \u003cp\u003eCheckpointing 322\u003c\/p\u003e \u003cp\u003eFailure Recovery 324\u003c\/p\u003e \u003cp\u003eBatch Systems 324\u003c\/p\u003e \u003cp\u003eIterative Programs 324\u003c\/p\u003e \u003cp\u003eCase Study: Apache Spark 325\u003c\/p\u003e \u003cp\u003eRDD Recomputing 326\u003c\/p\u003e \u003cp\u003eCheckpointing 326\u003c\/p\u003e \u003cp\u003eRecovery from Failures 327\u003c\/p\u003e \u003cp\u003eSummary 327\u003c\/p\u003e \u003cp\u003eReferences 327\u003c\/p\u003e \u003cp\u003e\u003cb\u003eChapter 10 Performance and Productivity 329\u003c\/b\u003e\u003c\/p\u003e \u003cp\u003ePerformance Metrics 329\u003c\/p\u003e \u003cp\u003eSystem Performance Metrics 330\u003c\/p\u003e \u003cp\u003eParallel Performance Metrics 330\u003c\/p\u003e \u003cp\u003eSpeedup 330\u003c\/p\u003e \u003cp\u003eStrong Scaling 331\u003c\/p\u003e \u003cp\u003eWeak Scaling 332\u003c\/p\u003e \u003cp\u003eParallel Efficiency 332\u003c\/p\u003e \u003cp\u003eAmdahl’s Law 333\u003c\/p\u003e \u003cp\u003eGustafson’s Law 334\u003c\/p\u003e \u003cp\u003eThroughput 334\u003c\/p\u003e \u003cp\u003eLatency 335\u003c\/p\u003e \u003cp\u003eBenchmarks 336\u003c\/p\u003e \u003cp\u003eLINPACK Benchmark 336\u003c\/p\u003e \u003cp\u003eNAS Parallel Benchmark 336\u003c\/p\u003e \u003cp\u003eBigDataBench 336\u003c\/p\u003e \u003cp\u003eTPC Benchmarks 337\u003c\/p\u003e \u003cp\u003eHiBench 337\u003c\/p\u003e \u003cp\u003ePerformance Factors 337\u003c\/p\u003e \u003cp\u003eMemory 337\u003c\/p\u003e \u003cp\u003eExecution 338\u003c\/p\u003e \u003cp\u003eDistributed Operators 338\u003c\/p\u003e \u003cp\u003eDisk I\/O 339\u003c\/p\u003e \u003cp\u003eGarbage Collection 339\u003c\/p\u003e \u003cp\u003eFinding Issues 342\u003c\/p\u003e \u003cp\u003eSerial Programs 342\u003c\/p\u003e \u003cp\u003eProfiling 342\u003c\/p\u003e \u003cp\u003eScaling 343\u003c\/p\u003e \u003cp\u003eStrong Scaling 343\u003c\/p\u003e \u003cp\u003eWeak Scaling 344\u003c\/p\u003e \u003cp\u003eDebugging Distributed Applications 344\u003c\/p\u003e \u003cp\u003eProgramming Languages 345\u003c\/p\u003e \u003cp\u003eC\/C++ 346\u003c\/p\u003e \u003cp\u003eJava 346\u003c\/p\u003e \u003cp\u003eMemory Management 347\u003c\/p\u003e \u003cp\u003eData Structures 348\u003c\/p\u003e \u003cp\u003eInterfacing with Python 348\u003c\/p\u003e \u003cp\u003ePython 350\u003c\/p\u003e \u003cp\u003eC\/C++ Code integration 350\u003c\/p\u003e \u003cp\u003eProductivity 351\u003c\/p\u003e \u003cp\u003eChoice of Frameworks 351\u003c\/p\u003e \u003cp\u003eOperating Environment 353\u003c\/p\u003e \u003cp\u003eCPUs and GPUs 353\u003c\/p\u003e \u003cp\u003ePublic Clouds 355\u003c\/p\u003e \u003cp\u003eFuture of Data-Intensive Applications 358\u003c\/p\u003e \u003cp\u003eSummary 358\u003c\/p\u003e \u003cp\u003eReferences 359\u003c\/p\u003e \u003cp\u003eIndex 361\u003c\/p\u003e","brand":"John Wiley \u0026 Sons Inc","offers":[{"title":"Default Title","offer_id":49407127978327,"sku":"9781119713029","price":38.25,"currency_code":"GBP","in_stock":true}],"thumbnail_url":"\/\/cdn.shopify.com\/s\/files\/1\/0817\/1739\/5799\/files\/9781119713029.jpg?v=1730498278","url":"https:\/\/bookcurl.com\/products\/foundations-of-data-intensive-applications-9781119713029","provider":"Book Curl","version":"1.0","type":"link"}