Apache Arrow and the Future of Data Frames with Wes McKinney
Association for Computing Machinery (ACM) via YouTube
Power BI Fundamentals - Create visualizations and dashboards from scratch
All Coursera Certificates 40% Off
Overview
Syllabus
Apache Arrow and the Future of Data Frames
Career Theme Programming interfaces for data preparation, analytics, and feature engineering
What exactly is a data frame?
A data frame is ... a programming interface ... for expressing data manipulations
Data frames address many analytical workloads that are either not possible or not well-served by traditional SQL-based systems
In R, the "data frame" data structure is part of the language Other projects implement their own (e.g. pandas) • Some projects may not use any data structures (e.g. compiling operations to SQL)
Most data frames are effectively "Islands" with a hard serialization barrier • Many non-reusable implementations of the same algorithms • Limited collaboration across projects and programming languages
Apache Arrow Open source community project launched in 2016 • Intersection of database systems, big data, and data science tools • Purpose: Language independent open standards and libraries to accelerate and simplify in-memory computing
Improve interoperability problems with other data processing systems . Standardize data structures used in data frame implementations • Promote collaboration and code reuse across libraries and programming languages
Limited data types Excessive memory consumption Poor processing efficiency for non-numeric types Accommodate larger-than-memory datasets
Apache Arrow Project Overview Language-agnostic in-memory columnar format for analytical query engines, data frames • Binary protocol for IPC/RPC . "Batteries included" development platform for building data processing applications
Arrow and the Future of Data Frames . As more data sources offer Arrow-based data access, it will make sense to process Arrow in situ rather than converting to some other data structure • Analytical systems will generally grow more efficient the more "Arrow-native" they become
Runtime memory format for analytical query processing • Ideal companion to columnar storage like Apache Parquet Fully shredded columnar, supports flat and nested schemas Organized for cache-efficient access on CPU/GPUS Optimized for data locality, SIMD, parallel processing Accommodates both random access and scan workloads
Taught by
Association for Computing Machinery (ACM)