Site Reliability Engineer

Lawrence Berkeley National Laboratory

Berkeley, United States of America

16 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Shift work

Languages

English

Experience level

Senior

Compensation

$ 161K

Job location

Remote

Berkeley, United States of America

Tech stack

Java

API

Big Data

C++

Command-Line Interface

Configuration Management Databases

System Configuration

Data Centers

Data Center Infrastructure Management (CIM)

Perl

Issue Tracking Systems

Python

Network Security

Shell

Network Protocols

Reliability Engineering

Prometheus

Software Engineering

Diagnostic Tools

Scripting (Bash/Python/Go/Ruby)

Reliability of Systems

Firewalls (Computer Science)

Kubernetes

Information Technology

Virtual Agents

ServiceNow

Programming Languages

Job description

Work a 5-day schedule with 2-3 onsite operations shifts and 2-3 project days, rotating across day, swing, and overnight shifts as needed to monitor the NERSC HPC facility.
Monitor and respond to system, storage, network, and facility alerts, escalating issues when necessary.
Improve reliability through automation, process optimization, monitoring enhancements, and root-cause prevention.
Develop and maintain monitoring, alerting, and diagnostic tools, including integrations with HPC system APIs and ServiceNow.
Support 24/7 data collection and real-time diagnostics across critical infrastructure.
Contribute to Agentic AI solutions that automate workflows and improve operational efficiency.
Coordinate with NERSC teams on maintenance, workflows, and incident management.
Perform physical and logical data center inspections to ensure environmental and infrastructure health.
Maintain accurate incident and maintenance records in the ticketing system.
Analyze and resolve complex operational issues using sound technical judgment and collaboration with internal and external experts., * Appointment type: This is a full-time, career appointment, exempt (monthly paid) from overtime pay.
Salary range: The expected salary for this position is $131,760 - $161,064, which fits into the full salary of $117,132 - $197,676 depending upon the candidate's skills, knowledge, and abilities. This includes education, certifications, and years of experience.
Background check: This position is subject to a background check. Any convictions will be evaluated to determine if they directly relate to the responsibilities and requirements of the position. Having a conviction history will not automatically disqualify an applicant from being considered for employment.
Work modality: This position requires substantial on-site presence, but is eligible for a flexible work mode, and hybrid schedules may be considered. Hybrid work is a combination of performing work on-site at Lawrence Berkeley National Lab, 1 Cyclotron Road, Berkeley, CA and some telework. Individuals working a hybrid schedule must reside within 150 miles of Berkeley Lab. Work schedules are dependent on business needs. In rare cases, full-time telework or remote work modes may be considered.

Requirements

Typically requires a minimum of 5 years of related experience with a Bachelor's degree; or 3 years and a Master's degree; or equivalent work experience.
Experience in or willingness to work within a 24/7 onsite team environment to support large-scale data centers or critical installations.
Experience on Linux shell and working in a command-line (e.g. SSH) environment.
Experience with developing tools using various programming languages such as C, C++, Perl, Java, or Python or a scripting language with knowledge of standard software development practices.
Motivated, self-starter who can learn technologies that improve data center management in areas like Kubernetes, Prometheus/VictoriaMetrics, Alertmanager, building management software, evaporative cooling, and power utilization.
Experience with network security: configuring/maintaining ACLs, knowledge of firewalls
Experience collaborating across technical teams to resolve operational bottlenecks and ensure system reliability and alignment with service-level objectives.
Knowledge of and ability to work on large data communications networks/ Network Protocols and IT infrastructure supporting highly available systems and applications.

Desired skills/knowledge:

Experience with ServiceNow implementation is a plus, particularly in architecting or deploying solutions for Incident Management, Change Management, or CMDB to improve IT workflows.
Practical experience in developing and deploying Agentic AI or autonomous automation tools to streamline technical tasks.
Familiarity with ITSM best practices and an understanding of how to align service lifecycles with business goals is preferred.
A certification in a system administration area in platforms, software, or any other advanced education in the Computing Science area.
ServiceNow certifications.
ITIL certifications.

Benefits & conditions

We invest in our employees by offering a total rewards package you can count on:

Exceptional health and retirement benefits, including pension or 401K-style plans
Opportunities to grow in your career - check out our Tuition Assistance Program
A culture where you'll belong - we are invested in our teams!
In addition to accruing vacation and sick time, we also have a Winter Holiday Shutdown every year.
Parental bonding leave (for both mothers and fathers)
Pet insurance

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

Apply for this position

Good distractions

Moments

Videos View all