Iceberg: Syllabus

Mastering Apache Iceberg: Scalable Data Lakes with MinIO

Apache Iceberg is an open-source table format designed for large-scale data lake storage, bringing SQL table functionality to data lakes while ensuring high performance, ACID compliance, and schema evolution. This book provides a hands-on, practical approach to building, managing, and optimizing Iceberg-based data lakes with MinIO as the storage backend.

Module 1: Introduction to Data Lake Architecture

Understanding the evolution of data lakes
Challenges of traditional data lakes and how Apache Iceberg solves them
Comparison of Apache Iceberg, Delta Lake, and Apache Hudi
Introduction to MinIO as an S3-compatible storage backend

Module 2: Setting Up Apache Iceberg with MinIO

Installing and configuring MinIO for Iceberg storage
Deploying Apache Iceberg with Apache Spark, Trino, and Flink
Connecting Iceberg to MinIO using the S3 API
Creating and managing Iceberg tables on MinIO

Module 3: Apache Iceberg Table Format and Transactions

Deep dive into Iceberg’s table format and metadata handling
Implementing ACID transactions in a data lake
Understanding snapshots and time travel queries
Concurrent reads and writes with optimistic concurrency control

Module 4: Data Ingestion and ETL with Apache Iceberg

Batch vs. streaming ingestion with Iceberg
Writing efficient ETL pipelines using Apache Spark and Flink
Handling schema evolution and updates in Iceberg tables
Partitioning and sorting strategies for optimized queries

Module 5: Performance Optimization in Apache Iceberg

Optimizing queries with hidden partitioning and pruning
Compacting small files and optimizing table layouts
Implementing data skipping and vectorized reads
Best practices for scaling Apache Iceberg with MinIO

Module 6: Data Governance and Security

Implementing fine-grained access control with IAM policies
Auditing Iceberg table changes and maintaining data integrity
Encryption at rest and in transit for Iceberg data on MinIO
Compliance and regulatory considerations (GDPR, HIPAA, SOC 2)

Module 7: Analytics and Machine Learning with Apache Iceberg

Querying Iceberg tables with Apache Spark, Trino, and Presto
Using Iceberg for feature engineering and ML workflows
Implementing real-time analytics with Apache Flink and Iceberg
Data versioning and rollback for reproducible ML models

Module 8: Real-Time Streaming and Change Data Capture (CDC)

Using Apache Iceberg for incremental data ingestion
Structured Streaming with Apache Spark and Flink
Implementing Change Data Capture (CDC) in Iceberg tables
Managing upserts and deletes efficiently with MERGE operations

Module 9: Cloud-Native Deployment and Scaling

Deploying Apache Iceberg on Kubernetes with MinIO
Running Iceberg in multi-cloud environments (AWS, Azure, GCP)
Automating Iceberg table management with Terraform and Ansible
Best practices for high availability and fault tolerance

Hands-On Projects

Project 1: Building a Data Lake with Apache Iceberg and MinIO

Set up MinIO as an object storage backend
Create and manage Iceberg tables for structured and semi-structured data
Implement data ingestion, transformations, and querying

Project 2: Real-Time Data Processing Pipeline with Iceberg and Spark Streaming

Stream data from Kafka into Apache Iceberg
Implement time travel and rollback features
Perform real-time analytics with Spark SQL

Project 3: Feature Store for Machine Learning Using Iceberg

Build a feature store using Iceberg tables
Integrate Iceberg with ML models in Databricks or PyTorch
Implement snapshot-based versioning for ML datasets

Project 4: Secure and Scalable Iceberg-Based Data Lake

Implement IAM policies for MinIO and Iceberg access control
Encrypt data at rest and in transit using TLS and encryption keys
Deploy Iceberg with Kubernetes for high availability and security

Project 5: Cloud-Native Data Lake with Serverless Processing

Deploy Apache Iceberg with AWS Lambda for serverless processing
Automate ETL workflows with Apache Airflow and Iceberg
Optimize storage costs with tiering and lifecycle management

References

GX: Syllabus

Jenkins: Introduction

Datascience

Rizki Sasri Dwitama

Title here

Iceberg: Syllabus

Mastering Apache Iceberg: Scalable Data Lakes with MinIO

Module 1: Introduction to Data Lake Architecture

Module 2: Setting Up Apache Iceberg with MinIO

Module 3: Apache Iceberg Table Format and Transactions

Module 4: Data Ingestion and ETL with Apache Iceberg

Module 5: Performance Optimization in Apache Iceberg

Module 6: Data Governance and Security

Module 7: Analytics and Machine Learning with Apache Iceberg

Module 8: Real-Time Streaming and Change Data Capture (CDC)

Module 9: Cloud-Native Deployment and Scaling

Hands-On Projects

Project 1: Building a Data Lake with Apache Iceberg and MinIO

Project 2: Real-Time Data Processing Pipeline with Iceberg and Spark Streaming

Project 3: Feature Store for Machine Learning Using Iceberg

Project 4: Secure and Scalable Iceberg-Based Data Lake

Project 5: Cloud-Native Data Lake with Serverless Processing

References

Iceberg: Syllabus

Mastering Apache Iceberg: Scalable Data Lakes with MinIO

Module 1: Introduction to Data Lake Architecture#

Module 2: Setting Up Apache Iceberg with MinIO#

Module 3: Apache Iceberg Table Format and Transactions#

Module 4: Data Ingestion and ETL with Apache Iceberg#

Module 5: Performance Optimization in Apache Iceberg#

Module 6: Data Governance and Security#

Module 7: Analytics and Machine Learning with Apache Iceberg#

Module 8: Real-Time Streaming and Change Data Capture (CDC)#

Module 9: Cloud-Native Deployment and Scaling#

Hands-On Projects#

Project 1: Building a Data Lake with Apache Iceberg and MinIO#

Project 2: Real-Time Data Processing Pipeline with Iceberg and Spark Streaming#

Project 3: Feature Store for Machine Learning Using Iceberg#

Project 4: Secure and Scalable Iceberg-Based Data Lake#

Project 5: Cloud-Native Data Lake with Serverless Processing#

References#

Module 1: Introduction to Data Lake Architecture

Module 2: Setting Up Apache Iceberg with MinIO

Module 3: Apache Iceberg Table Format and Transactions

Module 4: Data Ingestion and ETL with Apache Iceberg

Module 5: Performance Optimization in Apache Iceberg

Module 6: Data Governance and Security

Module 7: Analytics and Machine Learning with Apache Iceberg

Module 8: Real-Time Streaming and Change Data Capture (CDC)

Module 9: Cloud-Native Deployment and Scaling

Hands-On Projects

Project 1: Building a Data Lake with Apache Iceberg and MinIO

Project 2: Real-Time Data Processing Pipeline with Iceberg and Spark Streaming

Project 3: Feature Store for Machine Learning Using Iceberg

Project 4: Secure and Scalable Iceberg-Based Data Lake

Project 5: Cloud-Native Data Lake with Serverless Processing

References