Staff Site Reliability Engineer (Production Engineer

Zscaler, Inc.

Denver, United States of America

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Compensation

$ 170K

Job location

Remote

Denver, United States of America

Tech stack

Artificial Intelligence

Application Release Automation

Bash

BSD Operating System

Code Review

Continuous Integration

Software Debugging

Linux

File Systems

Distributed Systems

Internet Control Message Protocol

Subnetting

OSI Models

Python

Machine Learning

Packet Analyzer

Network Protocols

Performance Tuning

Cloud Services

TensorFlow

Prometheus

Software Engineering

TCP/IP

Wide Area Networks

Scripting (Bash/Python/Go/Ruby)

Load Balancing

Perf (Linux)

Data Lake

Kubernetes

Production Code

Bare Metal

Job description

We are looking for a Staff Site Reliability Engineer (Production Engineer) to join our team. This is a hybrid role (onsite three days a week in San Jose, CA or another Zscaler office; remote can be considered for exceptional candidates) reporting to the Senior Manager, Site Reliability Engineering in the Zero Trust Exchange department. As a key member of the Zero Trust Exchange team, you will be responsible for all aspects of the Zscaler production data center services, including servers, operating systems, storage, and supporting systems. You will be an instrumental part of the Site Reliability Engineering team, ensuring the availability, latency, performance, efficiency, and scalability of a cloud that processes tens of billions of transactions daily., * Own the reliability of a large-scale cloud service (Linux/BSD, bare metal, Kubernetes, custom load balancing, SD-WAN) by partnering with Engineering and Network teams to define requirements early, conduct operability reviews, and contribute code/design docs for platform resilience

Develop and operate end-to-end observability (metrics/logs/traces, dashboards, alerting) and incident tooling to manage SLOs/error budgets, reduce noise, and improve system detection and diagnosis
Participate in an on-call rotation to lead full-cycle incident response; perform deep cross-stack troubleshooting (OS, networking, distributed systems, packet captures, core dumps) to drive permanent software fixes and codify learnings into runbooks and tests
Build and maintain everything-as-code for fleet and service lifecycle, driving provisioning, configuration, release automation, canary deployments, and complex rollout/rollback workflows
Continuously improve platform hygiene through consistent OS/app upgrades, dependency/vulnerability patching, capacity and performance tuning, and strict CI/CD validation prior to production rollouts

Requirements

You thrive in ambiguity. You're comfortable building the path as you walk it. You thrive in a dynamic environment, seeing ambiguity not as a hindrance, but as the raw material to build something meaningful.
You act like an owner. Your passion for the mission fuels your bias for action. You operate with integrity because you genuinely care about the outcome. True ownership involves leveraging dynamic range: the ability to navigate seamlessly between high-level strategy and hands-on execution.
You are a problem-solver. You love running towards the challenges because you are laser-focused on finding the solution, knowing that solving the hard problems delivers the biggest impact.
You are a high-trust collaborator. You are ambitious for the team, not just yourself. You embrace our challenge culture by giving and receiving ongoing feedback-knowing that candor delivered with clarity and respect is the truest form of teamwork and the fastest way to earn trust.
You are a learner. You have a true growth mindset and are obsessed with your own development, actively seeking feedback to become a better partner and a stronger teammate. You love what you do and you do it with purpose., * Foundational understanding of AI/ML technologies and experience leveraging, securing, or positioning AI-driven solutions to optimize outcomes within your functional domain
US Citizenship is required (due to the nature of assigned customers)
5+ years industry experience in software engineering, infrastructure software, and/or platform engineering
Proficiency in at least one programming language (such as Python, Bash, or Go) with demonstrated ability to write production-quality code (testing, code reviews, CI, maintainable design, scripting for diagnostics)
Strong Linux/Unix systems fundamentals (process/memory, filesystems, networking stack basics, debugging/perf troubleshooting) and solid understanding of networking protocols and components (e.g., HTTP, DNS, TCP/IP, ICMP, OSI model, subnetting, and load balancing/traffic concepts)
Proven experience operating production services (including incident response, troubleshooting, reducing toil) and managing BSD in production to drive systemic fixes through platform engineering, * Experience leveraging AI/ML frameworks or AIOps tools to build predictive anomaly detection, automate root-cause analysis, and optimize large-scale infrastructure reliability
Proven expertise in operating Kubernetes at scale
Deep experience with the Prometheus/OpenTelemetry ecosystems, including instrumenting golden signals, defining SLOs, and performing alert tuning to ensure high-availability environments

Benefits & conditions

The base salary range listed for this full-time position excludes commission/ bonus/ equity (if applicable) + benefits. Base Pay Range $119,000-$170,000 USD

At Zscaler, we are committed to building a team that reflects the communities we serve and the customers we work with. We foster an inclusive environment that values all backgrounds and perspectives, emphasizing collaboration and belonging. Join us in our mission to make doing business seamless and secure.

Our Benefits program is one of the most important ways we support our employees. Zscaler proudly offers comprehensive and inclusive benefits to meet the diverse needs of our employees and their families throughout their life stages, including:

Various health plans
Time off plans for vacation and sick time
Parental leave options
Retirement options
Education reimbursement
In-office perks, and more!

Learn more about Zscaler's hybrid working model and benefits here.

About the company

As a Staff Site Reliability Engineer, you'll oversee Zscaler production data center services, optimize code, and ensure cloud service availability and performance. Collaborate with cross-functional teams to improve processes and resolve escalated issues. The summary above was generated by AI About Zscaler Zscaler accelerates digital transformation to ensure our customers can be more agile, efficient, resilient, and secure. As an AI-forward enterprise, we are constantly pushing the envelope, leveraging the world's largest security data lake to power our cloud-native Zero Trust Exchange platform. This innovation protects our customers from cyberattacks and data loss by securely connecting users, devices, and applications in any location. Here, impact in your role matters more than title and trust is built on results. We say, impact over activity. We seek innovators who actively use AI to amplify their impact and who thrive in an environment where we leverage intelligent systems to stay ahead of evolving threats. We believe in transparency and value constructive, honest debate-we're focused on getting to the best ideas, faster. We build high-performing teams that can make an impact quickly and with high quality. To do this, we are building a culture of execution centered on customer obsession, collaboration, ownership, and accountability. We value high-impact, high-accountability with a sense of urgency where you're enabled to do your best work and embrace your potential. If you're driven by purpose, thrive on solving complex challenges, and want to be part of the team that's helping to secure the AI age, we invite you to bring your talents to Zscaler and help shape the future of cybersecurity.

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

About the company

Apply for this position

Good distractions

Moments

Videos View all