Go beyond standard functions in Spark. Build custom, type-safe aggregations to execute complex logic with only a single data shuffle.
#1about 2 minutes
Going beyond standard aggregations in Spark
Standard functions like sum and count are useful, but custom aggregations are required for specific business logic and performance on large datasets.
#2about 4 minutes
Understanding the Spark Aggregator interface
The Aggregator interface requires implementing zero, reduce, merge, and finish methods to support Spark's distributed execution model of pre-aggregation and shuffling.
#3about 2 minutes
Analyzing a standard word count solution's inefficiency
A typical word count solution using built-in functions and window functions results in an inefficient execution plan with two separate data shuffles.
#4about 2 minutes
Designing a UDAF for an efficient word count
The high-level design for a custom word count aggregator involves using a local frequency map as a buffer to pre-aggregate data before a single shuffle and merge step.
#5about 5 minutes
A step-by-step implementation of the UDAF methods
This walkthrough covers the Scala implementation of the Aggregator interface, including defining type parameters and coding the zero, reduce, merge, and finish logic.
#6about 1 minute
Analyzing the performance gains of the custom UDAF
Applying the custom aggregator and inspecting its execution plan reveals a significant performance improvement, reducing the process to a single data shuffle.
#7about 2 minutes
Leveraging complex data structures in UDAFs
UDAFs can handle complex, structured data types like case classes for both intermediate buffers and final outputs, enabling sophisticated, multi-part aggregations.
#8about 3 minutes
Key use cases for custom aggregation functions
Custom aggregators are ideal for complex business logic, performance optimization, code reusability across teams, and integration with the Spark SQL API.
Related jobs
Jobs that call for the skills explored in this talk.
Dev Digest 213: Petrol Prices, Agentic Workflows, AI Skills and CODE100!Inside last week’s Dev Digest 213 .
🤫 Don’t tell your LLM that it is an expert
👻 AI generated code is invisible
🔄 Learn about agentic workflows
🛡️ Linux Foundation sponsors fight against AI slop
🦠 1M users infected by Chrome extension
🫃 The why of J...
Christina Schaireiter
Why Attend a Developer Event?Modern software engineering moves too fast for documentation alone. Attending a world-class event is about shifting from tactical execution to strategic leadership.
Skill Diversification: Break out of your specific tech stack to see how the industry...
Daniel Cranney
Dev Digest 160: Graphs and RAGs Explained and VS Code Extension Hacks Inside last week’s Dev Digest 160 .
🤖 How AI is reshaping UI and work
🚀 Tips on how to use Cursor most efficiently
🔒 How VS Code extensions can be a massive security issue
👩💻 What the move to Go for Typescript means for developers
👎 What a possible...
From learning to earning
Jobs that call for the skills explored in this talk.