Section 1: Databricks Tooling |
This section describes how Delta Lake ensures data changes are all-or-nothing and permanent using logs and cloud storage; explains how Delta Lake allows multiple users to work at once and which actions might clash; outline the basic uses of the Delta clone, and use common ways to make Delta Lake faster, including partitioning, zorder, bloom filters, and file sizes. |
Section 2: Data Processing (Batch processing, Incremental processing, and Optimization) |
This section explains and compare different ways to organize data: coalesce, repartition, repartition by range, and rebalance, compare strategies for dividing data (e.g., choose the right columns for partitioning), explain how to save Pyspark dataframes while controlling individual file sizes, describes multiple ways to update one or more records in a spark table (Type 1), how to use common patterns made possible by Structured Streaming and Delta Lake, explore and improve performance using stream-static joins and Delta Lake, and put stream-static joins into practice. |
Section 3: Data Modeling |
This section explains the goal of changing data when moving it from bronze to silver, discusses how Change Data Feed (CDF) helps spread updates and deletes in Lakehouse systems, uses Delta Lake clone to see how shallow and deep clones affect source and target tables, how to create a multiplex bronze table to avoid common issues when making streaming jobs work in real-world situations, apply best methods when streaming data from multiplex bronze tables, and use step-by-step processing, quality checks, and removal of duplicates when moving data from bronze to silver. |
Section 4: Security & Governance |
This section discusses how to make Dynamic views to hide sensitive data and use dynamic views to control who can see which rows and columns. |
Section 5: Monitoring & Logging |
In this section, the focus is on parts of the Spark UI that help improve performance, fix issues, and fine-tune Spark applications, timelines and measurements for stages and jobs on a cluster, use information from Spark UI, Ganglia UI, and Cluster UI to find performance issues and fix failing applications, create systems that manage costs and speed for real-world streaming jobs, and set up and watch streaming and batch jobs. |
Section 6: Testing & Deployment |
This exam section focusess on how to change a notebook dependency pattern to use Python file dependencies, adapt Python code in Wheels to use direct imports with relative paths, fix and restart failed jobs, make Jobs based on common needs and patterns, and create a multi-task job with several dependencies. |
Databricks Tooling |
Master the Databricks workspace, including notebooks, clusters, and data storage, to efficiently develop and manage Spark workloads.
|
Data Processing |
This section is about leveraging Spark Core, Spark SQL, Delta Lake, and structured streaming for efficient and scalable data processing and transformation.
|
Data Modeling |
Understand the art of data modeling with Delta Lake, including schema design, data partitioning, and optimization techniques for query performance.
|
Security and Governance |
This section covers data security and compliance with authentication, authorization, access controls, and governance practices in Databricks.
|
Monitoring and Logging |
This section deals with skills to monitor and optimize Databricks workloads using metrics, logs, and performance tuning techniques. |
Testing and Deployment |
This section covers how to implement best practices for testing and deploying Spark applications, including CI/CD pipelines and version control.
|
Official Information |
https://www.databricks.com/learn/certification/data-engineer-professional |