The Fastest Way to Become a Backend Developer Online
Learn Excel & Financial Modeling the Way Finance Teams Actually Use Them
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Explore data validation techniques for large-scale data pipelines in this 22-minute Databricks conference talk. Learn about the importance of data validation in interconnected data pipelines and compare popular frameworks like Great Expectations with lightweight alternatives. Discover how to extend Pandas-based validation libraries to Spark workflows using Fugue, an open-source framework. Gain insights into applying different validation rules for each partition in big data scenarios, addressing a common deficiency in current frameworks. Follow along with an interactive demo that combines Fugue and Pandera to create a flexible and efficient data validation solution for Spark. Understand the trade-offs between robust features and performance, and learn how to tailor your validation approach to your specific needs.
Syllabus
Intro
Case Study
Data Validation
Common Validations
Great Expectations - Detailed Results
Great Expectations - Data Documentation
Pandera-Sample Code
Comparison of Validation Frameworks
Fugue - Basic Code
Combining Fugue and Pandera
Example Data - Food Sloth's Pricing
Validation by Partition
Taught by
Databricks