Big Data
Apache Spark
Process big data at lightning speed with Apache Spark and PySpark.
4 DaysIntermediateTensorNova Certificate of Completion
Course Curriculum
Module 1Spark Architecture
- Driver, executors, and cluster managers
- DAG execution model
- Spark deployment modes: local, YARN, Kubernetes
- Spark UI for job monitoring
Module 2Core APIs
- RDD operations: map, filter, reduce
- DataFrames and Dataset API
- Spark SQL and views
- User-defined functions (UDFs)
Module 3Structured Streaming
- Streaming DataFrames
- Kafka as a streaming source
- Windowing and watermarking
- Output sinks: Delta Lake, Kafka, HDFS
Module 4Performance Tuning
- Partitioning and shuffles
- Caching and persistence levels
- Broadcast joins
- Adaptive Query Execution
Module 5MLlib & Delta Lake
- Spark MLlib pipelines
- Classification and clustering with MLlib
- Delta Lake ACID transactions
- Time travel and data versioning
Prerequisites
- Python or Scala programming
- Basic SQL knowledge
- Understanding of Hadoop concepts is helpful
Who Should Attend
- Data engineers building streaming pipelines
- Data scientists scaling ML workflows
- Hadoop developers migrating to Spark
Get Started
Interested in Apache Spark?
Our training advisors will help you choose the right batch format, dates, and pricing for your team or individual goals.