Big Data

Apache Spark

Process big data at lightning speed with Apache Spark and PySpark.

4 DaysIntermediateTensorNova Certificate of Completion

Course Curriculum

Module 1Spark Architecture

Driver, executors, and cluster managers
DAG execution model
Spark deployment modes: local, YARN, Kubernetes
Spark UI for job monitoring

Module 2Core APIs

RDD operations: map, filter, reduce
DataFrames and Dataset API
Spark SQL and views
User-defined functions (UDFs)

Module 3Structured Streaming

Streaming DataFrames
Kafka as a streaming source
Windowing and watermarking
Output sinks: Delta Lake, Kafka, HDFS

Module 4Performance Tuning

Partitioning and shuffles
Caching and persistence levels
Broadcast joins
Adaptive Query Execution

Module 5MLlib & Delta Lake

Spark MLlib pipelines
Classification and clustering with MLlib
Delta Lake ACID transactions
Time travel and data versioning

Prerequisites

Python or Scala programming
Basic SQL knowledge
Understanding of Hadoop concepts is helpful

Who Should Attend

Data engineers building streaming pipelines
Data scientists scaling ML workflows
Hadoop developers migrating to Spark

Get Started

Interested in Apache Spark?

Our training advisors will help you choose the right batch format, dates, and pricing for your team or individual goals.

Request a Callback View All Courses