Apache Spark

Process big data at lightning speed with Apache Spark and PySpark.

4 DaysIntermediateTensorNova Certificate of Completion

Course Curriculum

Module 1Spark Architecture
  • Driver, executors, and cluster managers
  • DAG execution model
  • Spark deployment modes: local, YARN, Kubernetes
  • Spark UI for job monitoring
Module 2Core APIs
  • RDD operations: map, filter, reduce
  • DataFrames and Dataset API
  • Spark SQL and views
  • User-defined functions (UDFs)
Module 3Structured Streaming
  • Streaming DataFrames
  • Kafka as a streaming source
  • Windowing and watermarking
  • Output sinks: Delta Lake, Kafka, HDFS
Module 4Performance Tuning
  • Partitioning and shuffles
  • Caching and persistence levels
  • Broadcast joins
  • Adaptive Query Execution
Module 5MLlib & Delta Lake
  • Spark MLlib pipelines
  • Classification and clustering with MLlib
  • Delta Lake ACID transactions
  • Time travel and data versioning

Prerequisites

  • Python or Scala programming
  • Basic SQL knowledge
  • Understanding of Hadoop concepts is helpful

Who Should Attend

  • Data engineers building streaming pipelines
  • Data scientists scaling ML workflows
  • Hadoop developers migrating to Spark

Interested in Apache Spark?

Our training advisors will help you choose the right batch format, dates, and pricing for your team or individual goals.