Big Data Architect 2
SLAC National Accelerator Laboratory
yesterday
Role details
Contract type
Permanent contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Experience level
Senior Compensation
$ 221KJob location
Tech stack
API
Artificial Intelligence
Airflow
Computing Platforms
Big Data
Cloud Computing
Cloud Engineering
Continuous Integration
Data Infrastructure
Distributed Systems
Ethernet
Experimental Data
InfiniBand
Job Scheduling
Python
Lightweight Directory Access Protocols (LDAP)
NetCDF
OAuth
Open Source Technology
OpenID
Performance Tuning
Role-Based Access Control
Remote Direct Memory Access
Prometheus
JSON Web Token
Security Assertion Markup Language (SAML)
Scientific Computating
Software Deployment
Data Streaming
Workflow Management Systems
AI Infrastructure
Reinforcement Learning
Data Logging
Model-Driven Development
Cloud Platform System
Data Ingestion
Istio
Large Language Models
Grafana
Multi-Agent Systems
Spark
HybridCloud
AI Platforms
Kubernetes
Information Technology
Dask
Codebase
Slurm
Machine Learning Operations
Stream Processing
Data Pipelines
Dynatrace
Job description
Platform Architecture & Engineering
- Design, build, and operate highly available Kubernetes-based platforms optimized for scientific and agentic workloads
- Architect scalable solutions for high-throughput data pipelines, real-time streaming, and batch scientific computing
- Design and implement platform primitives for agentic workflow orchestration enabling autonomous, multi-step AI-driven pipelines that support experimental science
- Develop cloud-native architectures supporting on-premises, hybrid cloud, and multi-cluster deployments
- Build and maintain Infrastructure-as-Code using tools such as Helm, Kustomize, and GitOps workflows
- Evaluate and introduce new technologies and patterns that advance the platform's capabilities for the scientific community
Agentic & AI Workflow Enablement
- Lead platform design for agentic scientific workflows systems where AI agents autonomously orchestrate data acquisition, analysis, and experimental feedback loops
- Collaborate with researchers and data scientists to define platform requirements for running large language model-driven and reinforcement learning agents at scale
- Implement infrastructure patterns for agent orchestration frameworks (e.g., multi-agent pipelines, tool-use APIs, memory and state management) within Kubernetes
- Ensure the platform supports the latency, throughput, and accelerator requirements of agentic workloads
- Build guardrails, observability, and governance tooling suited to autonomous scientific agents operating on sensitive experimental data
Scientific Project Support
- Partner with scientists and researchers at SLAC and across DOE labs and universities to design and implement solutions for major scientific programs, including:
- Vera C. Rubin Observatory / LSST: Petabyte-scale nightly sky surveys requiring real-time alert pipelines and long-running batch analysis for dark matter and dark energy research
- LCLS (Linac Coherent Light Source): Real-time analysis infrastructure for the world's brightest X-ray laser, capturing femtosecond-scale dynamics of matter
- Cryo-EM: High-throughput 3D reconstruction pipelines for structural biology at near-atomic resolution
- Accelerator Operations: Monitoring, control, and data acquisition infrastructure for particle accelerators
- American Science Cloud: National-scale scientific data infrastructure to democratize access to computing resources across National Laboratories
- Emerging Initiatives: Co-design of infrastructure for next-generation scientific computing programs not yet fully defined
- Support the full project lifecycle from initial technical consultation through production deployment and ongoing operations
User Collaboration & Support
- Act as a senior technical partner to the scientific user community, translating complex experimental requirements into scalable platform solutions
- Lead requirements-gathering sessions and technical consultations with research groups
- Provide hands-on guidance and training to help users adopt platform capabilities effectively
- Gather user feedback and advocate for user needs in platform planning and roadmap prioritization
- Develop documentation, runbooks, and reference architectures to empower scientific teams
Reliability & Operations
- Define and maintain SLOs/SLAs for platform services supporting scientific workflows
- Implement comprehensive monitoring, logging, and observability (Prometheus, Grafana, OpenTelemetry, Loki)
- Design and implement CI/CD pipelines for scientific software, data processing workflows, and platform components
- Lead incident response and post-mortem processes; participate in on-call rotation
- Drive capacity planning and performance tuning for compute-intensive and data-intensive workloads
- Optimize GPU and accelerator resource scheduling and utilization across the platform
Collaboration & Community
- Build and maintain strong relationships with scientific user communities across SLAC, Stanford, and the broader research ecosystem
- Collaborate with counterparts at DOE National Laboratories (LBNL, Fermilab, Argonne, etc.) to share architectures and best practices
- Lead technical workshops, training sessions, and working groups to advance cloud-native and agentic workflow adoption
- Contribute to and represent SLAC in open-source communities relevant to scientific computing and Kubernetes
- Mentor junior team members and support a culture of technical excellence within AUS, * Fuel Discovery: Your infrastructure decisions directly shape the success of experiments that advance human understanding of the universe
- Work at the Frontier of Agentic Science: Help define what it means to run autonomous AI-driven scientific pipelines at scale this is genuinely new territory
- Collaborate with Exceptional People: Work alongside world-renowned scientists, engineers, and computing professionals solving problems no one has solved before
- Shape the Field: Influence scientific computing platform strategy at a U.S. National Laboratory and contribute to standards adopted across the DOE complex
Professional Development
- Access to cutting-edge technology, research facilities, and DOE computing infrastructure
- Support for attending conferences (KubeCon, SC, ISC, and domain-specific scientific conferences) and pursuing certifications
- Collaborative environment with experts across computing, physics, photon science, and cosmology
- Opportunities to publish and present in both computing and scientific venues
Work Environment
- Hybrid work arrangements available
- State-of-the-art facilities on the Stanford University campus in the San Francisco Bay Area, The Application and User Services (AUS) group is the bridge between SLAC's scientific ambitions and the computing platforms that make them real. We don't just provide support we are deeply embedded partners in science. Our team tackles some of the most demanding computing challenges on the planet: millisecond-latency requirements for X-ray laser experiments, petabyte-scale nightly data ingestion from a sky survey, and increasingly, the orchestration of intelligent agents that can autonomously steer complex experimental workflows. We value deep technical craft, genuine collaboration with our scientific users, and the kind of creative thinking that only emerges when computing meets frontier science.
SLAC Employee Competencies
- Effective Decisions: Uses job knowledge and sound judgment to make quality decisions in a timely manner
- Self-Development: Pursues a variety of venues and opportunities to continue learning and growing
- Dependability: Can be counted on to deliver results with a sense of personal responsibility for expected outcomes
- Initiative: Pursues work proactively with optimism, positive energy, and motivation to move things forward
- Adaptability: Flexes as needed when change occurs; maintains an open outlook while adjusting to new circumstances
- Communication: Ensures effective information flow to diverse audiences; creates and delivers clear, appropriate written and spoken messages
- Relationships: Builds relationships to foster trust, collaboration, and a positive climate in pursuit of common goals
Physical Requirements and Working Conditions
- Consistent with its obligations under the law, the University will provide reasonable accommodation to any employee with a disability who requires accommodation to perform the essential functions of the job. May work extended hours during peak business cycles.
- Given the nature of this position, SLAC is open to on-site and hybrid work options.
Work Standards
- Interpersonal Skills: Demonstrates the ability to work well with Stanford colleagues and clients and with external organizations
- Promote Culture of Safety: Demonstrates commitment to personal responsibility and value for environment, safety, and security; communicates related concerns; uses and promotes safe behaviors based on training and lessons learned. Meets the applicable roles and responsibilities as described in the ESH Manual, Chapter 1 General Policy and Responsibilities:
- Subject to and expected to comply with all applicable University policies and procedures, including but not limited to the personnel policies and other policies found in the University's Administrative Guide:
- As a national laboratory, SLAC National Accelerator Laboratory is responsible for adhering to the Homeland Security Presidential Directive 12 (HSPD-12) and Department of Energy (DOE) Order 473.1A, which require employees to obtain and maintain a HSPD-12 Personal Identity Verification (PIV) Credential. To obtain this credential, employees must successfully complete the applicable tier of federal background investigation post hire and receive a favorable federal adjudication. The tier of federal background investigation will be determined by job duties and national security or public trust responsibilities associated with the job. All tiers of investigation include a declaration of illegal drug activities, including use, supply, possession, or manufacture within the last 1 to 7 years (depending on the applicable tier of investigation). Illegal drug activities include marijuana and cannabis derivatives, which are still considered illegal under federal law, regardless of state laws.
Requirements
- Minimum 8 years of software or infrastructure engineering experience with demonstrated expertise in distributed systems
- Minimum 4 years of hands-on experience designing, deploying, and operating Kubernetes in production environments
- Strong proficiency in Python and/or Go; comfort reading and contributing to multi-language codebases
- Deep experience with container orchestration, networking, storage, and security in Kubernetes environments
- Hands-on experience with Infrastructure-as-Code and GitOps tooling
- Demonstrated ability to design and operate high-throughput or real-time data processing pipelines at scale
- Experience with CI/CD pipeline design and implementation
- Solid understanding of observability practices and tooling (Prometheus, Grafana, distributed tracing)
Agentic / AI Platform Experience
- Familiarity with AI/ML infrastructure: GPU scheduling, model serving, workflow orchestration for ML pipelines
- Awareness of agentic workflow frameworks and patterns (e.g., multi-agent orchestration, LLM tool use, agent state management)
- Understanding of the infrastructure requirements that distinguish agentic workloads from traditional batch or streaming pipelines
Soft Skills
- Exceptional communication skills with the ability to engage credibly with both scientists and engineers
- Demonstrated ability to build trust and translate ambiguous research requirements into concrete technical designs
- Strong customer-service orientation; patience and empathy when supporting users with diverse technical backgrounds
- Self-directed and able to manage multiple high-priority projects in a fast-moving research environment
- Collaborative and collegial team member who actively lifts others, * Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience, * Master's degree or Ph.D. in Computer Science, Engineering, Physics, or a related field
- Experience building or operating platforms specifically for agentic AI workflows (multi-agent systems, LLM pipelines, reinforcement learning environments)
- Background in scientific research or experience embedded in a scientific computing or HPC environment
- Experience with large-scale data processing frameworks: Dask, Ray, Spark, MPI
- Knowledge of scientific data formats and ecosystems: HDF5, FITS, NetCDF
- Experience integrating HPC job schedulers (Slurm, HTCondor) with Kubernetes-native workloads
- Deep understanding of Kubernetes CNI and CSI
- Knowledge of network fabrics relevant to HPC/science (InfiniBand, RDMA, high-bandwidth Ethernet)
- Contributions to open-source projects in cloud-native computing, scientific computing, or AI infrastructure
- Familiarity with workflow orchestration tools: Argo Workflows, Prefect, Dagster, Airflow, Kubeflow Pipelines
- Experience operating large scale federated or multi-cluster Kubernetes environments
- Experience with authentication and authorization systems and patterns as applied to cloud-native platforms: OAuth2, OIDC, JWT, RBAC, and related identity federation technologies (LDAP, SAML2, COManage, Grouper)
- Hands-on experience with Kubernetes-native ingress and auth integration: Traefik ForwardAuth, OAuth2 Proxy, Kubernetes Gateway API, and/or service mesh auth policies (Istio, Envoy)
Benefits & conditions
- Competitive salary commensurate with experience
- Comprehensive health, dental, and vision insurance
- Retirement plans with employer contributions
- Generous vacation and paid time off
- Professional development and conference funding
- Tuition reimbursement programs
- On-site amenities and wellness programs
About the company
SLAC National Accelerator Laboratory is a U.S. Department of Energy laboratory operated by Stanford University. For over 60 years, SLAC has been at the forefront of scientific discovery, exploring how the universe works at the biggest, smallest, and fastest scales. From particle physics to astrophysics, materials science to biology, SLAC's world-class research facilities and scientific expertise drive innovation and push the boundaries of human knowledge., Do you want your Kubernetes clusters to do more than serve web traffic? At SLAC, our infrastructure powers the discovery of new materials, the mapping of the universe, and the understanding of fundamental physics.
The Application and User Services (AUS) group within the Scientific Computing Services Division manages the platforms that underpin science at SLAC. We build and operate the systems that let researchers focus on discovery rather than infrastructure. We are now seeking a Senior Kubernetes Engineer to help design and implement a scalable, next-generation platform purpose-built for scientific and agentic workflows.
This role is not just about managing pods and nodes it is about building the computational engines that allow scientists to peer into atomic structure, catalog billions of galaxies, and increasingly, to deploy intelligent, autonomous agents that drive the next generation of experimental science. You will stand at the intersection of cloud-native engineering and Nobel-prize caliber research, collaborating within SLAC and across the broader Department of Energy (DOE) complex, Stanford University, and partner institutions worldwide.
Scientific experiments like the Vera C. Rubin Observatory and LCLS generate data at rates that challenge the limits of modern infrastructure. AI-driven agentic workflows pipelines where autonomous agents orchestrate complex, multi-step scientific analyses are rapidly becoming a core part of how experiments are designed, run, and interpreted. You will help us build and maintain the platform that makes all of this possible.