Kafka - Syllabus

Mastering Apache Kafka: Real-Time Data Streaming with Python

Apache Kafka is a distributed event streaming platform used for building real-time data pipelines and applications. This book provides a hands-on approach to mastering Kafka, covering setup, message processing, fault tolerance, and integrations with MinIO, Apache Spark, and Airflow.

Module 1: Introduction to Apache Kafka and Event Streaming

  • What is Apache Kafka? Key features and use cases
  • Kafka’s architecture: Brokers, Topics, Producers, Consumers
  • Understanding partitions, offsets, and replication
  • Installing and setting up Kafka on local and cloud environments

Module 2: Producing and Consuming Data in Kafka

  • Writing Kafka Producers with Python
  • Writing Kafka Consumers with Python
  • Understanding message serialization (JSON, Avro, Protobuf)
  • Optimizing producer-consumer performance

Module 3: Kafka Topics, Partitions, and Message Retention

  • Creating and managing Kafka topics
  • Configuring partitions for scalability
  • Message retention policies and log compaction
  • Handling duplicate and out-of-order messages

Module 4: Kafka Connect and Data Integration

  • Introduction to Kafka Connect for external system integration
  • Connecting Kafka to databases, APIs, and object storage (MinIO, PostgreSQL)
  • Configuring Source and Sink Connectors
  • Custom Connector Development with Python

Module 5: Stream Processing with Kafka Streams and PySpark

  • Introduction to Kafka Streams API
  • Processing Kafka messages in real-time with Spark Structured Streaming
  • Stateful transformations, windowing, and joins in Kafka Streams
  • Handling backpressure and stream optimization

Module 6: Data Pipeline Orchestration with Apache Airflow

  • Automating Kafka workflows with Airflow DAGs
  • Using Airflow’s KafkaOperator for event-driven workflows
  • Managing ETL pipelines with Kafka, Spark, and Airflow
  • Implementing failure recovery and monitoring

Module 7: Kafka Security and Fault Tolerance

  • Securing Kafka with SSL/TLS and SASL authentication
  • Implementing ACLs and Role-Based Access Control (RBAC)
  • Kafka disaster recovery and multi-cluster replication
  • Monitoring Kafka clusters with Prometheus and Grafana

Module 8: Deploying Kafka in Production

  • Running Kafka on Kubernetes
  • Deploying Kafka in AWS, GCP, and Azure
  • Scaling Kafka clusters for high availability
  • Best practices for Kafka performance tuning

Hands-On Projects

Project 1: Real-Time Log Processing with Kafka and MinIO

  • Stream logs into Kafka from multiple sources
  • Store and retrieve event logs in MinIO
  • Analyze log patterns in real-time with Spark Streaming

Project 2: Fraud Detection System Using Kafka and Spark

  • Ingest real-time transaction data into Kafka
  • Process transactions for fraud detection using PySpark
  • Deploy an alerting system for anomalies

Project 3: Building a Real-Time ETL Pipeline with Kafka and Airflow

  • Automate data ingestion from APIs using Kafka Producers
  • Process and store data using Kafka Connect and PostgreSQL
  • Schedule and monitor ETL jobs with Apache Airflow

Project 4: IoT Sensor Data Streaming with Kafka

  • Simulate IoT sensors producing real-time data
  • Process and visualize IoT data using Kafka and Grafana
  • Implement real-time anomaly detection for sensor failures

Project 5: Deploying a Scalable Kafka Cluster on Kubernetes

  • Set up Kafka in a Kubernetes environment
  • Implement Kafka Streams for data transformation
  • Secure and monitor Kafka with industry best practices

References