Nikolai Nikolaev
Let's Get Aggregated: Custom UDAFs in Spark
#1about 2 minutes
Going beyond standard aggregations in Spark
Standard functions like sum and count are useful, but custom aggregations are required for specific business logic and performance on large datasets.
#2about 4 minutes
Understanding the Spark Aggregator interface
The Aggregator interface requires implementing zero, reduce, merge, and finish methods to support Spark's distributed execution model of pre-aggregation and shuffling.
#3about 2 minutes
Analyzing a standard word count solution's inefficiency
A typical word count solution using built-in functions and window functions results in an inefficient execution plan with two separate data shuffles.
#4about 2 minutes
Designing a UDAF for an efficient word count
The high-level design for a custom word count aggregator involves using a local frequency map as a buffer to pre-aggregate data before a single shuffle and merge step.
#5about 5 minutes
A step-by-step implementation of the UDAF methods
This walkthrough covers the Scala implementation of the Aggregator interface, including defining type parameters and coding the zero, reduce, merge, and finish logic.
#6about 1 minute
Analyzing the performance gains of the custom UDAF
Applying the custom aggregator and inspecting its execution plan reveals a significant performance improvement, reducing the process to a single data shuffle.
#7about 2 minutes
Leveraging complex data structures in UDAFs
UDAFs can handle complex, structured data types like case classes for both intermediate buffers and final outputs, enabling sophisticated, multi-part aggregations.
#8about 3 minutes
Key use cases for custom aggregation functions
Custom aggregators are ideal for complex business logic, performance optimization, code reusability across teams, and integration with the Spark SQL API.
Related jobs
Jobs that call for the skills explored in this talk.
Matching moments
18:02 MIN
Analyzing data with metric and bucket aggregations
Search and aggregations made easy with OpenSearch and NodeJS
11:12 MIN
Understanding Spark's core data APIs and abstractions
PySpark - Combining Machine Learning & Big Data
08:16 MIN
An introduction to the Apache Spark analytics engine
PySpark - Combining Machine Learning & Big Data
30:35 MIN
Navigating the challenges of distributed aggregations
Distributed search under the hood
30:44 MIN
Implementing data aggregation and API management
Building high performance and scalable architectures for enterprises
33:31 MIN
Efficient aggregations with probabilistic data structures
Distributed search under the hood
34:48 MIN
Q&A on performance, parallelism, and organizational impact
Convert batch code into streaming with Python
34:44 MIN
Building a data aggregation and enrichment pipeline
100 million days in Vienna: A story of APIs & AI in tourism.
Featured Partners
Related Videos
Making Data Warehouses fast. A developer's story.
Adnan Rahic
PySpark - Combining Machine Learning & Big Data
Ayon Roy
How AI Models Get Smarter
Ankit Patel
Make Your Data FABulous
Philipp Krenn
Fully Orchestrating Databricks from Airflow
Alan Mazankiewicz
Event-Driven Architecture: Breaking Conversational Barriers with Distributed AI Agents
Abhimanyu Selvan
WWC24 - Ankit Patel - Unlocking the Future Breakthrough Application Performance and Capabilities with NVIDIA
Ankit Patel
Maximising Cassandra's Potential: Tips on Schema, Queries, Parallel Access, and Reactive Programming
Hartmut Armbruster
From learning to earning
Jobs that call for the skills explored in this talk.

![Data Engineer Spark Scala"}}]},{"@context":"https://schema.org/","@type":"JobPosting","@id":"#jobPosting","title":"Data Engineer Spark Scala](https://wearedevelopers.imgix.net/public/default-job-listing-cover.png?w=400&ar=3.55&fit=crop&crop=entropy&auto=compress,format)
Data Engineer Spark Scala"}}]},{"@context":"https://schema.org/","@type":"JobPosting","@id":"#jobPosting","title":"Data Engineer Spark Scala
UCASE CONSULTING
Paris, France
Remote
Senior
Azure
Scala
Spark
Kafka
+3

Databricks Unified Data Analytics Platform Analytics Advisor
Accenture
€50K
Intermediate
Azure
PySpark
Data Lake
Data analysis

Data Engineer Spark/Scala/Python/AWS
UCASE CONSULTING
Paris, France
Senior
Spark
Python
Unit Testing
Amazon Web Services (AWS)

Data Engineer AWS / Spark / Scala /Python
UCASE CONSULTING
Paris, France
Senior
Spark
Python
Unit Testing
Amazon Web Services (AWS)

Databricks Unified Data Analytics Platform Application Developer
Accenture
€67K
Senior
Azure
Scala
Spark
DevOps
+4

Databricks Unified Data Analytics Platform Application Developer
Accenture
€54K
Junior
Azure
PySpark
Data Lake

Solutions Architect (Professional Services) Spark Expert
Databricks, Inc.
Charing Cross, United Kingdom
Remote
Intermediate
Spark

Databricks Unified Data Analytics Platform Data Platform Engineer
Accenture
Junior
ETL
Hive
Spark
Hadoop
Python
+1