Senior Staff Site Reliability Engineer

Palo Alto Networks
Santa Clara, United States of America
7 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
$ 205K

Job location

Santa Clara, United States of America

Tech stack

Amazon Web Services (AWS)
ARM
Cloud Computing
Computer Security
Computer Programming
DevOps
Monitoring of Systems
Shell
Reliability Engineering
Ansible
Prometheus
Google Cloud Platform
Autoscaling
Grafana
Kubernetes
Cortex XSOAR Platform
Terraform
Docker
Pagerduty

Job description

The Cortex team builds and delivers the industry's most advanced SecOps platform, consisting of XDR, XSIAM, XSOAR, and XPANSE. As a member of the Cortex DevOps team, your role involves operating and maintaining a large-scale GCP environment, including the design, implementation, and continuous enhancement of our comprehensive observability systems. To meet the opportunities that such a role provides, you will have a deep knowledge of modern observability and monitoring tools and practices, having managed high cardinality metrics, implemented tracing, and operationalized large-scale logging solutions. As part of this role, you will collaborate closely with our engineering teams to develop innovative solutions that provide clear and actionable insights into our systems' performance and health., + Utilize your expertise in monitoring cloud platforms, particularly GCP, to optimize our infrastructure, leveraging cloud-native technologies.

  • Improve monitoring processes, alerts, and metrics, and work with development teams to ensure that all of our services have the right monitoring and metrics in place to detect problems before our customers do.

  • Leverage incident management processes to ensure efficient resolution of system issues and minimal impact on services.

  • Automate complex monitoring and alerting tasks by building tools for cloud operations, such as automated remediation of known issues and auto-scaling.

  • Stay up-to-date with cutting-edge technologies, evaluate their potential impact on our operations, and implement them when appropriate.

  • Provide follow-the-sun operational coverage in the production of our Observability infrastructure.

  • Work with our Engineering team to influence the operability of the product and ensure the reliability and availability of our services.

Requirements

  • 5+ years of experience as a DevOps/SRE engineer with a passion for technology and a strong motivation for high reliability at the service level.

  • High proficiency with Thanos, Prometheus, Grafana, Open Telemetry and other monitoring tools.

  • Clear understanding of incident and alerts management using tools like Pagerduty and Prometheus Alert Manager.

  • High proficiency in either Google Cloud Platform or Amazon Web Services.

  • High proficiency with Kubernetes and Docker for container orchestration.

  • High proficiency in Python programming and Linux Shell commands. Experience with Ansible and Terraform for infrastructure as code.

Preferred Qualifications

  • Effective communication and interpersonal skills, with the ability to work and coordinate between multiple teams in different time zones.

  • Ability to effectively troubleshoot and address emerging and complex problems.

  • Ability to operate independently, make decisions, take action, and take responsibility.

Benefits & conditions

The compensation offered for this position will depend on qualifications, experience, and work location. For candidates who receive an offer at the posted level, the starting base salary (for non-sales roles) or base salary + commission target (for sales/com-missioned roles) is expected to be the annual range listed below. The offered compensation may also include restricted stock units and a bonus. A description of our employee benefits may be found here (https://benefits.paloaltonetworks.com/) .

$126,000.00 - $204,500.00/yr

Our Commitment

We're trailblazers that dream big, take risks, and challenge cybersecurity's status quo. It's simple: we can't accomplish our mission without diverse teams innovating, together.

Apply for this position