Lakehouse Performance Engineer

IBM

Austin, United States of America

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Job location

Austin, United States of America

Tech stack

Java

Adobe InDesign

API

Artificial Intelligence

Analysis of Variance (ANOVA)

Code Review

Nvidia CUDA

Computer Programming

Databases

System Configuration

Continuous Integration

Data Files

Data Infrastructure

Data Stores

Linux

Distributed Systems

IBM Cloud Computing

Job Scheduling

Python

Linux Servers

Networking Basics

PCI Express

Ansible

Prometheus

Search Technologies

Shell Script

Software Engineering

Datadog

Pulumi

Data Processing

Grafana

Spark

Generative AI

Perf (Linux)

Kubernetes

Presto

Vertica

Terraform

Job description

IBM is building the next generation of watsonx.data: a GPU-accelerated, open data lakehouse engineered to deliver category-leading price-performance for analytics and AI workloads. We are hiring a Performance Engineer to be a hands-on focused on measuring, defending, and improving the performance and cost-per-performance of the platform across every release.

You will run the benchmarks, build the harnesses, and operate the backing infrastructure that the entire watsonx.data organization relies on to characterize performance. That includes the dedicated benchmark labs, GPU and CPU test fleets, dataset stores, result warehouses, and the automation that ties them together. Engineering, product, field, and competitive intelligence will all consume what you produce: regression signals in CI, executive scorecards, customer-facing dashboards, and the data behind claims that we are the market-leading open lakehouse.

Benchmarking & Workload Engineering

Industry-standard benchmarks: Run, maintain, and continuously improve reproducible benchmarks across watsonx.data configurations and against competitive offerings.
Customer-representative workloads: Build and curate workload suites that reflect real customer query mixes, data volumes, concurrency profiles, and freshness requirements: not just synthetic benchmarks.
Reproducibility & rigor: Ensure every published result is reproducible end-to-end: controlled environments, pinned versions, locked datasets, documented methodology, variance analysis, and statistically defensible reporting.
Cost-per-performance metrics: Operationalize the canonical price-performance KPIs ($/query, $/TB scanned, $/training-token, queries/sec/$, TCO at workload mix); instrument workloads, collect data, and produce repeatable scorecards.

Performance Observability & Analysis

Telemetry pipeline: Build and maintain the metrics, traces, profiles, GPU/CPU utilization, query plan, and IO telemetry that flow from benchmark runs into the performance data store.
Dashboards & scorecards: Develop dashboards that surface trends, regressions, and competitive position to engineering, leadership, and external audiences.
Regression gates: Operate performance regression gates in CI/CD; triage failures, file and drive issues with engine, storage, and GPU teams, and verify fixes.
Root-cause analysis: Drill into slow queries and GPU/CPU bottlenecks using profilers (Nsight, perf, async-profiler, pprof, flamegraphs) and query plan inspection to pinpoint regressions and improvement opportunities.

Backing Infrastructure for Performance

Performance environment ownership: Own the lifecycle of the dedicated performance environment(s) supporting watsonx.data: GPU and CPU clusters, networking, storage, and the orchestration that schedules workloads onto them.
Test fleet automation: Build and maintain infrastructure-as-code (Terraform/Ansible/Helm) for provisioning, configuring, and resetting test environments deterministically across on-prem hardware and cloud (IBM Cloud and partner clouds).
Benchmark harness platform: Develop and operate the benchmark harness itself: job scheduler, run orchestration, dataset provisioning, result capture, artifact storage, and the API/CLI other teams use to launch runs.
Dataset & result warehouse: Own the curated datasets used for benchmarking and the warehouse of historical results that powers trend analysis, regression detection, and competitive comparisons.
Capacity & utilization: Manage capacity and utilization of the performance lab so concurrent campaigns from different teams (query engine, storage, GPU acceleration, AI) run cleanly and without interference.
Self-service for engineers: Provide engineers across watsonx.data with self-service paths to run standardized perf experiments against well-known baselines, lowering the cost of evidence-based engineering decisions.

Collaboration & Reporting

Pair with engineers on the query engine, storage, GPU acceleration, catalog, and AI/RAG paths to land performance improvements and verify their impact.
Produce data, charts, and write-ups that feed internal quarterly scorecards and external performance whitepapers, blog posts, and analyst briefings.
Participate in design reviews and code reviews where performance is at stake; flag risks early and propose measurable acceptance criteria.
Document workloads, harnesses, lab usage, and results so the next engineer internal or external: can reproduce what you ran.

Requirements

8+ years of professional software engineering experience with at least 2 years focused on performance engineering, benchmarking, or SRE for a data platform, database, distributed system.
Strong programming skills in at least one of Python, Go, Java, plus comfort with shell scripting and modern automation tooling.
Working knowledge of at least one modern analytics engine (Presto/Trino, Spark, DuckDB, ClickHouse, or comparable) and at least one open table format (Iceberg, Delta, or Hudi).
Hands-on experience with at least some of: Linux performance tooling (perf, ftrace, eBPF), profilers (Nsight, async-profiler, pprof), and query plan analysis.
Infrastructure-as-code fluency in at least one of Terraform, Ansible, Pulumi, or Helm; comfort writing and maintaining the automation, not just consuming it.

Preferred technical and professional experience

Hands-on experience with GPU-accelerated data processing (RAPIDS/cuDF, Velox/Theseus-class engines, CUDA) and the GPU memory hierarchy (HBM, NVLink, PCIe trade-offs).
Experience publishing or co-authoring peer-reviewed or industry-recognized performance results (TPC, MLPerf, ClickBench, LST-Bench, or similar).
Experience operating a multi-tenant performance lab or shared test fleet where multiple teams ran experiments concurrently.
Experience building bespoke benchmark harnesses or workload generators, including dataset generation at TB+ scale.
Familiarity with vector search, retrieval-augmented generation (RAG), and AI inference/training performance characterization.
Familiarity with FinOps and cloud unit economics-translating raw performance numbers into $/performance and TCO conclusions.
Contributions to relevant open-source projects (Iceberg, Trino, Spark, Arrow, Velox, RAPIDS, OpenTelemetry, perf-tooling, etc.).
Hands-on experience designing and running performance experiments : controlling for variance, isolating variables, and producing clear, defensible results.
Experience operating real infrastructure: Linux servers, Kubernetes, container runtimes, networking basics, and object storage.
Comfort with observability tooling: metrics (Prometheus), tracing/telemetry (OpenTelemetry), and dashboards (Grafana or equivalent).

About the company

At IBM Software, we transform client challenges into solutions. Building the world's leading AI-powered, cloud-native products that shape the future of business and society. Our legacy of innovation creates endless opportunities for IBMers to learn, grow, and make an impact on a global scale. Working in Software means joining a team fueled by curiosity and collaboration. You'll work with diverse technologies, partners, and industries to design, develop, and deliver solutions that power digital transformation. With a culture that values innovation, growth, and continuous learning, IBM Software places you at the heart of IBM's product and technology landscape. Here, you'll have the tools and opportunities to advance your career while creating software that changes the world.

Role details

Job location

Tech stack

Job description

Requirements

About the company

Apply for this position

Good distractions

Moments

Videos View all