Nikolai Nikolaev

Aug 20, 2025 • World Congress 2025

Let's Get Aggregated: Custom UDAFs in Spark

Go beyond standard functions in Spark. Build custom, type-safe aggregations to execute complex logic with only a single data shuffle.

#1about 2 minutes

Going beyond standard aggregations in Spark

Standard functions like sum and count are useful, but custom aggregations are required for specific business logic and performance on large datasets.

#2about 4 minutes

Understanding the Spark Aggregator interface

The Aggregator interface requires implementing zero, reduce, merge, and finish methods to support Spark's distributed execution model of pre-aggregation and shuffling.

#3about 2 minutes

Analyzing a standard word count solution's inefficiency

A typical word count solution using built-in functions and window functions results in an inefficient execution plan with two separate data shuffles.

#4about 2 minutes

Designing a UDAF for an efficient word count

The high-level design for a custom word count aggregator involves using a local frequency map as a buffer to pre-aggregate data before a single shuffle and merge step.

#5about 5 minutes

A step-by-step implementation of the UDAF methods

This walkthrough covers the Scala implementation of the Aggregator interface, including defining type parameters and coding the zero, reduce, merge, and finish logic.

#6about 1 minute

Analyzing the performance gains of the custom UDAF

Applying the custom aggregator and inspecting its execution plan reveals a significant performance improvement, reducing the process to a single data shuffle.

#7about 2 minutes

Leveraging complex data structures in UDAFs

UDAFs can handle complex, structured data types like case classes for both intermediate buffers and final outputs, enabling sophisticated, multi-part aggregations.

#8about 3 minutes

Key use cases for custom aggregation functions

Custom aggregators are ideal for complex business logic, performance optimization, code reusability across teams, and integration with the Spark SQL API.

1 month ago

Senior Agentic Data Scientist

Dynatrace
Linz, Austria

Senior

29 days ago

AI Software Engineer (m/f/d)

Sunhat
Köln, Germany

Remote

Senior

1 month ago

Senior Machine Learning Engineer (f/m/d)

MARKT-PILOT GmbH
Stuttgart, Germany

Remote

Senior

Analyzing data with metric and bucket aggregations

18:02 MIN

Analyzing data with metric and bucket aggregations

Search and aggregations made easy with OpenSearch and NodeJS

Understanding Spark's core data APIs and abstractions

11:12 MIN

Understanding Spark's core data APIs and abstractions

PySpark - Combining Machine Learning & Big Data

An introduction to the Apache Spark analytics engine

08:16 MIN

An introduction to the Apache Spark analytics engine

PySpark - Combining Machine Learning & Big Data

Navigating the challenges of distributed aggregations

30:35 MIN

Navigating the challenges of distributed aggregations

Distributed search under the hood

Implementing data aggregation and API management

30:44 MIN

Implementing data aggregation and API management

Building high performance and scalable architectures for enterprises

Efficient aggregations with probabilistic data structures

33:31 MIN

Efficient aggregations with probabilistic data structures

Distributed search under the hood

Q&A on performance, parallelism, and organizational impact

34:48 MIN

Q&A on performance, parallelism, and organizational impact

Convert batch code into streaming with Python

Building a data aggregation and enrichment pipeline

34:44 MIN

Building a data aggregation and enrichment pipeline

100 million days in Vienna: A story of APIs & AI in tourism.

Featured Partners

Making Data Warehouses fast. A developer's story.

Making Data Warehouses fast. A developer's story.

Adnan Rahic

about 4 years ago • JavaScript Congress

PySpark - Combining Machine Learning & Big Data

PySpark - Combining Machine Learning & Big Data

Ayon Roy

about 5 years ago • WeAreDevelopers LIVE

How AI Models Get Smarter

How AI Models Get Smarter

Ankit Patel

about 3 months ago • World Congress 2025

Make Your Data FABulous

Make Your Data FABulous

Philipp Krenn

about 3 years ago • World Congress 2022

Fully Orchestrating Databricks from Airflow

Fully Orchestrating Databricks from Airflow

Alan Mazankiewicz

about 4 years ago • WeAreDevelopers LIVE

Event-Driven Architecture: Breaking Conversational Barriers with Distributed AI Agents

Event-Driven Architecture: Breaking Conversational Barriers with Distributed AI Agents

Abhimanyu Selvan

about 3 months ago • World Congress 2025

WWC24 - Ankit Patel - Unlocking the Future Breakthrough Application Performance and Capabilities with NVIDIA

WWC24 - Ankit Patel - Unlocking the Future Breakthrough Application Performance and Capabilities with NVIDIA

Ankit Patel

about a year ago • World Congress 2024

Maximising Cassandra's Potential: Tips on Schema, Queries, Parallel Access, and Reactive Programming

Maximising Cassandra's Potential: Tips on Schema, Queries, Parallel Access, and Reactive Programming

Hartmut Armbruster

about a year ago • World Congress 2024

From learning to earning

Jobs that call for the skills explored in this talk.

DATA ENGINEER SPARK/SCALA

today

DATA ENGINEER SPARK/SCALA

UCASE CONSULTING
Paris, France

Azure

NoSQL

Scala

Spark

Kafka

+4

Data Engineer Spark Scala"}}]},{"@context":"https://schema.org/","@type":"JobPosting","@id":"#jobPosting","title":"Data Engineer Spark Scala

today

Data Engineer Spark Scala"}}]},{"@context":"https://schema.org/","@type":"JobPosting","@id":"#jobPosting","title":"Data Engineer Spark Scala

UCASE CONSULTING
Paris, France

Remote

Senior

Azure

Scala

Spark

Kafka

+3

Databricks Unified Data Analytics Platform Analytics Advisor

today

Databricks Unified Data Analytics Platform Analytics Advisor

Accenture

€50K

Intermediate

Azure

PySpark

Data Lake

Data analysis

Data Engineer Spark/Scala/Python/AWS

today

Data Engineer Spark/Scala/Python/AWS

UCASE CONSULTING
Paris, France

Senior

Spark

Python

Unit Testing

Amazon Web Services (AWS)

Data Engineer AWS / Spark / Scala /Python

today

Data Engineer AWS / Spark / Scala /Python

UCASE CONSULTING
Paris, France

Senior

Spark

Python

Unit Testing

Amazon Web Services (AWS)

Databricks Unified Data Analytics Platform Application Developer

today

Databricks Unified Data Analytics Platform Application Developer

Accenture

€67K

Senior

Azure

Scala

Spark

DevOps

+4

Databricks Unified Data Analytics Platform Application Developer

today

Databricks Unified Data Analytics Platform Application Developer

Accenture

€54K

Junior

Azure

PySpark

Data Lake

Solutions Architect (Professional Services) Spark Expert

today

Solutions Architect (Professional Services) Spark Expert

Databricks, Inc.
Charing Cross, United Kingdom

Remote

Intermediate

Spark

Databricks Unified Data Analytics Platform Data Platform Engineer

today

Databricks Unified Data Analytics Platform Data Platform Engineer

Accenture

Junior

ETL

Hive

Spark

Hadoop

Python

+1