Description
Book SynopsisThis book explains how to scale Apache Spark 3 to handle massive amounts of data, either via batch or streaming processing. It covers how to use Spark's structured APIs to perform complex data transformations and analyses you can use to implement end-to-end analytics workflows.This book covers Spark 3's new features, theoretical foundations, and application architecture. The first section introduces the Apache Spark ecosystem as a unified engine for large scale data analytics, and shows you how to run and fine-tune your first application in Spark. The second section centers on batch processing suited to end-of-cycle processing, and data ingestion through files and databases. It explains Spark DataFrame API as well as structured and unstructured data with Apache Spark. The last section deals with scalable, high-throughput, fault-tolerant streaming processing workloads to process real-time data. Here you'll learn about Apache Spark Streaming's execution model, the architecture of Spark S
Table of ContentsPart I. Apache Spark Batch Data ProcessingChapter 1: Introduction to Apache Spark for Large-Scale Data Analytics1.1. What is Apache Spark? 1.2. Spark Unified Analytics1.3. Batch vs Streaming Data1.4. Spark Ecosystem
Chapter 2: Getting Started with Apache Spark2.2. Scala and PySpark Interfaces
2.3. Spark Application Concepts2.4. Transformations and Actions in Apache Spark2.5. Lazy Evaluation in Apache Spark2.6. First Application in Spark2.7. Apache Spark Web UI
Chapter 3: Spark Dataframe API
Chapter 4: Spark Dataset API
Chapter 5: Structured and Unstructured Data with Apache Spark5.1. Data Sources5.2. Generic Load/Save Functions5.3. Generic File Source Options5.4. Parquet Files5.5. ORC Files5.6. JSON Files5.7. CSV Files5.8. Text Files5.9. Hive Tables5.10. JDBC To Other Databases
Chapter 6: Spark Machine Learning with MLlib
Part II. Spark Data StreamingChapter 7: Introduction to Apache Spark Streaming7.1. Apache Spark Streaming’s Execution Model7.2. Stream Processing Architectures7.3. Architecture of Spark Streaming: Discretized Streams7.4. Benefits of Discretized Stream Processing7.4.1. Dynamic Load Balancing7.4.2. Fast Failure and Straggler Recovery
Chapter 8: Structured Streaming8.1. Streaming Analytics8.2. Connecting to a Stream8.3. Preparing the Data in a Stream8.4. Operations on a Streaming Dataset
Chapter 9: Structured Streaming Sources9.1. File Sources9.2. Apache Kafka Source9.3. A Rate Source
Chapter 10: Structured Streaming Sinks10.1. Output Modes10.2. Output Sinks10.3. File Sink10.4. The Kafka Sink10.5. The Memory Sink 10.6. Streaming Table APIs10.7. Triggers10.8. Managing Streaming Queries10.9. Monitoring Streaming Queries10.9.1. Reading Metrics Interactively10.9.2. Reporting Metrics programmatically using Asynchronous APIs10.9.3. Reporting Metrics using Dropwizard10.9.4. Recovering from Failures with Checkpointing10.9.5. Recovery Semantics after Changes in a Streaming Query
Chapter 11: Future Directions for Spark Streaming11.1. Backpressure11.2. Dynamic Scaling11.3. Event time and out-of-order data11.4. UI enhancements11.5. Continuous Processing
Chapter 12: Watermarks. A deep survey of temporal progress metrics