Big Data Architect 2

SLAC National Accelerator Laboratory

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Compensation

$ 221K

Job location

Tech stack

API

Artificial Intelligence

Airflow

Computing Platforms

Big Data

Cloud Computing

Cloud Engineering

Continuous Integration

Data Infrastructure

Distributed Systems

Ethernet

Experimental Data

InfiniBand

Job Scheduling

Python

Lightweight Directory Access Protocols (LDAP)

NetCDF

OAuth

Open Source Technology

OpenID

Performance Tuning

Role-Based Access Control

Remote Direct Memory Access

Prometheus

JSON Web Token

Security Assertion Markup Language (SAML)

Scientific Computating

Software Deployment

Data Streaming

Workflow Management Systems

AI Infrastructure

Reinforcement Learning

Data Logging

Model-Driven Development

Cloud Platform System

Data Ingestion

Istio

Large Language Models

Grafana

Multi-Agent Systems

Spark

HybridCloud

AI Platforms

Kubernetes

Information Technology

Dask

Codebase

Slurm

Machine Learning Operations

Stream Processing

Data Pipelines

Dynatrace

Job description

Platform Architecture & Engineering

Design, build, and operate highly available Kubernetes-based platforms optimized for scientific and agentic workloads
Architect scalable solutions for high-throughput data pipelines, real-time streaming, and batch scientific computing
Design and implement platform primitives for agentic workflow orchestration enabling autonomous, multi-step AI-driven pipelines that support experimental science
Develop cloud-native architectures supporting on-premises, hybrid cloud, and multi-cluster deployments
Build and maintain Infrastructure-as-Code using tools such as Helm, Kustomize, and GitOps workflows
Evaluate and introduce new technologies and patterns that advance the platform's capabilities for the scientific community

Agentic & AI Workflow Enablement

Lead platform design for agentic scientific workflows systems where AI agents autonomously orchestrate data acquisition, analysis, and experimental feedback loops
Collaborate with researchers and data scientists to define platform requirements for running large language model-driven and reinforcement learning agents at scale
Implement infrastructure patterns for agent orchestration frameworks (e.g., multi-agent pipelines, tool-use APIs, memory and state management) within Kubernetes
Ensure the platform supports the latency, throughput, and accelerator requirements of agentic workloads
Build guardrails, observability, and governance tooling suited to autonomous scientific agents operating on sensitive experimental data

Scientific Project Support

Partner with scientists and researchers at SLAC and across DOE labs and universities to design and implement solutions for major scientific programs, including:

Vera C. Rubin Observatory / LSST: Petabyte-scale nightly sky surveys requiring real-time alert pipelines and long-running batch analysis for dark matter and dark energy research
LCLS (Linac Coherent Light Source): Real-time analysis infrastructure for the world's brightest X-ray laser, capturing femtosecond-scale dynamics of matter
Cryo-EM: High-throughput 3D reconstruction pipelines for structural biology at near-atomic resolution
Accelerator Operations: Monitoring, control, and data acquisition infrastructure for particle accelerators
American Science Cloud: National-scale scientific data infrastructure to democratize access to computing resources across National Laboratories
Emerging Initiatives: Co-design of infrastructure for next-generation scientific computing programs not yet fully defined

Support the full project lifecycle from initial technical consultation through production deployment and ongoing operations

User Collaboration & Support

Act as a senior technical partner to the scientific user community, translating complex experimental requirements into scalable platform solutions
Lead requirements-gathering sessions and technical consultations with research groups
Provide hands-on guidance and training to help users adopt platform capabilities effectively
Gather user feedback and advocate for user needs in platform planning and roadmap prioritization
Develop documentation, runbooks, and reference architectures to empower scientific teams

Reliability & Operations

Define and maintain SLOs/SLAs for platform services supporting scientific workflows
Implement comprehensive monitoring, logging, and observability (Prometheus, Grafana, OpenTelemetry, Loki)
Design and implement CI/CD pipelines for scientific software, data processing workflows, and platform components
Lead incident response and post-mortem processes; participate in on-call rotation
Drive capacity planning and performance tuning for compute-intensive and data-intensive workloads
Optimize GPU and accelerator resource scheduling and utilization across the platform

Collaboration & Community

Build and maintain strong relationships with scientific user communities across SLAC, Stanford, and the broader research ecosystem
Collaborate with counterparts at DOE National Laboratories (LBNL, Fermilab, Argonne, etc.) to share architectures and best practices
Lead technical workshops, training sessions, and working groups to advance cloud-native and agentic workflow adoption
Contribute to and represent SLAC in open-source communities relevant to scientific computing and Kubernetes
Mentor junior team members and support a culture of technical excellence within AUS, * Fuel Discovery: Your infrastructure decisions directly shape the success of experiments that advance human understanding of the universe
Work at the Frontier of Agentic Science: Help define what it means to run autonomous AI-driven scientific pipelines at scale this is genuinely new territory
Collaborate with Exceptional People: Work alongside world-renowned scientists, engineers, and computing professionals solving problems no one has solved before
Shape the Field: Influence scientific computing platform strategy at a U.S. National Laboratory and contribute to standards adopted across the DOE complex

Professional Development

Access to cutting-edge technology, research facilities, and DOE computing infrastructure
Support for attending conferences (KubeCon, SC, ISC, and domain-specific scientific conferences) and pursuing certifications
Collaborative environment with experts across computing, physics, photon science, and cosmology
Opportunities to publish and present in both computing and scientific venues

Work Environment

Hybrid work arrangements available
State-of-the-art facilities on the Stanford University campus in the San Francisco Bay Area, The Application and User Services (AUS) group is the bridge between SLAC's scientific ambitions and the computing platforms that make them real. We don't just provide support we are deeply embedded partners in science. Our team tackles some of the most demanding computing challenges on the planet: millisecond-latency requirements for X-ray laser experiments, petabyte-scale nightly data ingestion from a sky survey, and increasingly, the orchestration of intelligent agents that can autonomously steer complex experimental workflows. We value deep technical craft, genuine collaboration with our scientific users, and the kind of creative thinking that only emerges when computing meets frontier science.

SLAC Employee Competencies

Effective Decisions: Uses job knowledge and sound judgment to make quality decisions in a timely manner
Self-Development: Pursues a variety of venues and opportunities to continue learning and growing
Dependability: Can be counted on to deliver results with a sense of personal responsibility for expected outcomes
Initiative: Pursues work proactively with optimism, positive energy, and motivation to move things forward
Adaptability: Flexes as needed when change occurs; maintains an open outlook while adjusting to new circumstances
Communication: Ensures effective information flow to diverse audiences; creates and delivers clear, appropriate written and spoken messages
Relationships: Builds relationships to foster trust, collaboration, and a positive climate in pursuit of common goals

Physical Requirements and Working Conditions

Consistent with its obligations under the law, the University will provide reasonable accommodation to any employee with a disability who requires accommodation to perform the essential functions of the job. May work extended hours during peak business cycles.
Given the nature of this position, SLAC is open to on-site and hybrid work options.

Work Standards

Interpersonal Skills: Demonstrates the ability to work well with Stanford colleagues and clients and with external organizations
Promote Culture of Safety: Demonstrates commitment to personal responsibility and value for environment, safety, and security; communicates related concerns; uses and promotes safe behaviors based on training and lessons learned. Meets the applicable roles and responsibilities as described in the ESH Manual, Chapter 1 General Policy and Responsibilities:
Subject to and expected to comply with all applicable University policies and procedures, including but not limited to the personnel policies and other policies found in the University's Administrative Guide:
As a national laboratory, SLAC National Accelerator Laboratory is responsible for adhering to the Homeland Security Presidential Directive 12 (HSPD-12) and Department of Energy (DOE) Order 473.1A, which require employees to obtain and maintain a HSPD-12 Personal Identity Verification (PIV) Credential. To obtain this credential, employees must successfully complete the applicable tier of federal background investigation post hire and receive a favorable federal adjudication. The tier of federal background investigation will be determined by job duties and national security or public trust responsibilities associated with the job. All tiers of investigation include a declaration of illegal drug activities, including use, supply, possession, or manufacture within the last 1 to 7 years (depending on the applicable tier of investigation). Illegal drug activities include marijuana and cannabis derivatives, which are still considered illegal under federal law, regardless of state laws.

Requirements

Minimum 8 years of software or infrastructure engineering experience with demonstrated expertise in distributed systems
Minimum 4 years of hands-on experience designing, deploying, and operating Kubernetes in production environments
Strong proficiency in Python and/or Go; comfort reading and contributing to multi-language codebases
Deep experience with container orchestration, networking, storage, and security in Kubernetes environments
Hands-on experience with Infrastructure-as-Code and GitOps tooling
Demonstrated ability to design and operate high-throughput or real-time data processing pipelines at scale
Experience with CI/CD pipeline design and implementation
Solid understanding of observability practices and tooling (Prometheus, Grafana, distributed tracing)

Agentic / AI Platform Experience

Familiarity with AI/ML infrastructure: GPU scheduling, model serving, workflow orchestration for ML pipelines
Awareness of agentic workflow frameworks and patterns (e.g., multi-agent orchestration, LLM tool use, agent state management)
Understanding of the infrastructure requirements that distinguish agentic workloads from traditional batch or streaming pipelines

Soft Skills

Exceptional communication skills with the ability to engage credibly with both scientists and engineers
Demonstrated ability to build trust and translate ambiguous research requirements into concrete technical designs
Strong customer-service orientation; patience and empathy when supporting users with diverse technical backgrounds
Self-directed and able to manage multiple high-priority projects in a fast-moving research environment
Collaborative and collegial team member who actively lifts others, * Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience, * Master's degree or Ph.D. in Computer Science, Engineering, Physics, or a related field
Experience building or operating platforms specifically for agentic AI workflows (multi-agent systems, LLM pipelines, reinforcement learning environments)
Background in scientific research or experience embedded in a scientific computing or HPC environment
Experience with large-scale data processing frameworks: Dask, Ray, Spark, MPI
Knowledge of scientific data formats and ecosystems: HDF5, FITS, NetCDF
Experience integrating HPC job schedulers (Slurm, HTCondor) with Kubernetes-native workloads
Deep understanding of Kubernetes CNI and CSI
Knowledge of network fabrics relevant to HPC/science (InfiniBand, RDMA, high-bandwidth Ethernet)
Contributions to open-source projects in cloud-native computing, scientific computing, or AI infrastructure
Familiarity with workflow orchestration tools: Argo Workflows, Prefect, Dagster, Airflow, Kubeflow Pipelines
Experience operating large scale federated or multi-cluster Kubernetes environments
Experience with authentication and authorization systems and patterns as applied to cloud-native platforms: OAuth2, OIDC, JWT, RBAC, and related identity federation technologies (LDAP, SAML2, COManage, Grouper)
Hands-on experience with Kubernetes-native ingress and auth integration: Traefik ForwardAuth, OAuth2 Proxy, Kubernetes Gateway API, and/or service mesh auth policies (Istio, Envoy)

Benefits & conditions

Competitive salary commensurate with experience
Comprehensive health, dental, and vision insurance
Retirement plans with employer contributions
Generous vacation and paid time off
Professional development and conference funding
Tuition reimbursement programs
On-site amenities and wellness programs

About the company

SLAC National Accelerator Laboratory is a U.S. Department of Energy laboratory operated by Stanford University. For over 60 years, SLAC has been at the forefront of scientific discovery, exploring how the universe works at the biggest, smallest, and fastest scales. From particle physics to astrophysics, materials science to biology, SLAC's world-class research facilities and scientific expertise drive innovation and push the boundaries of human knowledge., Do you want your Kubernetes clusters to do more than serve web traffic? At SLAC, our infrastructure powers the discovery of new materials, the mapping of the universe, and the understanding of fundamental physics. The Application and User Services (AUS) group within the Scientific Computing Services Division manages the platforms that underpin science at SLAC. We build and operate the systems that let researchers focus on discovery rather than infrastructure. We are now seeking a Senior Kubernetes Engineer to help design and implement a scalable, next-generation platform purpose-built for scientific and agentic workflows. This role is not just about managing pods and nodes it is about building the computational engines that allow scientists to peer into atomic structure, catalog billions of galaxies, and increasingly, to deploy intelligent, autonomous agents that drive the next generation of experimental science. You will stand at the intersection of cloud-native engineering and Nobel-prize caliber research, collaborating within SLAC and across the broader Department of Energy (DOE) complex, Stanford University, and partner institutions worldwide. Scientific experiments like the Vera C. Rubin Observatory and LCLS generate data at rates that challenge the limits of modern infrastructure. AI-driven agentic workflows pipelines where autonomous agents orchestrate complex, multi-step scientific analyses are rapidly becoming a core part of how experiments are designed, run, and interpreted. You will help us build and maintain the platform that makes all of this possible.

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

About the company

Apply for this position

Good distractions

Moments

Videos View all