Spark: Syllabus

Mastering Apache Spark: Scalable Distributed Data Processing with Python

Apache Spark is a powerful open-source engine for large-scale distributed data processing. This book provides a hands-on approach to mastering Spark, covering batch processing, real-time streaming, performance optimizations, and integrations with modern data lake architectures using Delta Lake and MinIO.

Module 1: Introduction to Apache Spark 3.5.5

Evolution of Apache Spark and key features in version 3.5.5
Understanding Spark’s distributed architecture (RDDs, DAGs, Executors, and Tasks)
Setting up Apache Spark with Python (PySpark)
Running Spark in local mode vs. cluster mode

Module 2: Spark Core and RDDs (Resilient Distributed Datasets)

Understanding the fundamentals of RDDs
RDD transformations and actions
Optimizing RDD operations for performance
Working with RDD persistence and caching

Module 3: DataFrames and Spark SQL

Introduction to DataFrames and their advantages over RDDs
Querying structured data using Spark SQL
Working with Parquet, JSON, and Avro file formats
Optimizing DataFrame operations with Catalyst Optimizer

Module 4: Spark Performance Optimization and Tuning

Partitioning strategies for large-scale data processing
Configuring Spark memory management and shuffle optimization
Using Broadcast variables and Accumulators for efficiency
Monitoring Spark jobs with Spark UI and structured logging

Module 5: Streaming Data Processing with Spark Structured Streaming

Introduction to real-time streaming and event-driven architecture
Processing data streams from Kafka and MinIO
Stateful aggregations and windowed operations
Fault tolerance and checkpointing in Spark Streaming

Module 6: Machine Learning with Spark MLlib

Overview of Spark MLlib and its machine learning pipeline
Feature engineering and transformations in Spark MLlib
Training and evaluating models with distributed ML algorithms
Hyperparameter tuning and cross-validation in Spark

Module 7: Graph Processing with GraphX

Introduction to GraphX and graph-based computations
Building and analyzing social network graphs
PageRank and community detection algorithms
Optimizing large-scale graph processing workloads

Module 8: Integrating Apache Spark with Delta Lake

Understanding Delta Lake’s ACID transactions and schema evolution
Using Delta Lake for scalable batch and streaming ingestion
Time travel and versioning in Delta tables
Optimizing Delta Lake performance with compaction and indexing

Module 9: Object Storage Integration with MinIO

Understanding MinIO as an S3-compatible storage solution
Reading and writing Spark DataFrames to MinIO
Implementing data lake solutions with Spark and MinIO
Configuring security and access control for MinIO storage

Module 10: Deploying Apache Spark in Production

Running Spark on Kubernetes, AWS EMR, and Databricks
Automating Spark job workflows with Apache Airflow
Implementing CI/CD pipelines for Spark applications
Monitoring, scaling, and troubleshooting production Spark jobs

Hands-On Projects

Project 1: Batch Data Processing Pipeline with Spark and Delta Lake

Load large-scale datasets into Delta Lake
Perform batch transformations and aggregations with Spark
Implement time travel and schema enforcement

Project 2: Real-Time Streaming with Kafka, Spark, and MinIO

Stream data from Kafka into Spark Structured Streaming
Store streaming data in MinIO as an object store
Process and visualize real-time analytics

Project 3: Machine Learning Pipeline with Spark MLlib

Train a large-scale ML model on distributed data
Perform feature engineering using Spark transformations
Deploy the trained model for batch inference

Project 4: Building a Scalable ETL Pipeline with Apache Spark

Extract data from multiple sources and transform using PySpark
Load processed data into Delta Lake and MinIO
Automate ETL workflow with Apache Airflow

Project 5: Deploying and Monitoring Apache Spark on Kubernetes

Set up a Spark cluster on Kubernetes
Run distributed Spark jobs with dynamic resource allocation
Monitor job execution with Prometheus and Grafana

References

PSQL: Syllabus

Toward Data Science

Datascience

Rizki Sasri Dwitama

Title here

Spark: Syllabus

Mastering Apache Spark: Scalable Distributed Data Processing with Python

Module 1: Introduction to Apache Spark 3.5.5

Module 2: Spark Core and RDDs (Resilient Distributed Datasets)

Module 3: DataFrames and Spark SQL

Module 4: Spark Performance Optimization and Tuning

Module 5: Streaming Data Processing with Spark Structured Streaming

Module 6: Machine Learning with Spark MLlib

Module 7: Graph Processing with GraphX

Module 8: Integrating Apache Spark with Delta Lake

Module 9: Object Storage Integration with MinIO

Module 10: Deploying Apache Spark in Production

Hands-On Projects

Project 1: Batch Data Processing Pipeline with Spark and Delta Lake

Project 2: Real-Time Streaming with Kafka, Spark, and MinIO

Project 3: Machine Learning Pipeline with Spark MLlib

Project 4: Building a Scalable ETL Pipeline with Apache Spark

Project 5: Deploying and Monitoring Apache Spark on Kubernetes

References

Spark: Syllabus

Mastering Apache Spark: Scalable Distributed Data Processing with Python

Module 1: Introduction to Apache Spark 3.5.5#

Module 2: Spark Core and RDDs (Resilient Distributed Datasets)#

Module 3: DataFrames and Spark SQL#

Module 4: Spark Performance Optimization and Tuning#

Module 5: Streaming Data Processing with Spark Structured Streaming#

Module 6: Machine Learning with Spark MLlib#

Module 7: Graph Processing with GraphX#

Module 8: Integrating Apache Spark with Delta Lake#

Module 9: Object Storage Integration with MinIO#

Module 10: Deploying Apache Spark in Production#

Hands-On Projects

Project 1: Batch Data Processing Pipeline with Spark and Delta Lake#

Project 2: Real-Time Streaming with Kafka, Spark, and MinIO#

Project 3: Machine Learning Pipeline with Spark MLlib#

Project 4: Building a Scalable ETL Pipeline with Apache Spark#

Project 5: Deploying and Monitoring Apache Spark on Kubernetes#

References#

Module 1: Introduction to Apache Spark 3.5.5

Module 2: Spark Core and RDDs (Resilient Distributed Datasets)

Module 3: DataFrames and Spark SQL

Module 4: Spark Performance Optimization and Tuning

Module 5: Streaming Data Processing with Spark Structured Streaming

Module 6: Machine Learning with Spark MLlib

Module 7: Graph Processing with GraphX

Module 8: Integrating Apache Spark with Delta Lake

Module 9: Object Storage Integration with MinIO

Module 10: Deploying Apache Spark in Production

Project 1: Batch Data Processing Pipeline with Spark and Delta Lake

Project 2: Real-Time Streaming with Kafka, Spark, and MinIO

Project 3: Machine Learning Pipeline with Spark MLlib

Project 4: Building a Scalable ETL Pipeline with Apache Spark

Project 5: Deploying and Monitoring Apache Spark on Kubernetes

References