Principal Site Reliability Engineer
Role details
Job location
Tech stack
Job description
Palo Alto Networks runs a large hybrid infrastructure and is one of the largest Google Cloud Platform customers. As a Site Reliability Engineer, you will be part of a team supporting the services running on this infrastructure. This includes automation, architecture, performance, metrics, troubleshooting, security, and reliability.
Our stack includes Kubernetes, Docker, Google Cloud Platform, AWS, Ansible, Terraform, Vault, Gitlab, Spinnaker, Pub/sub, Bigtable, Memorystore, Bigquery, RabbitMq, Kafka, MySQL, Python, and Go. We don't expect you to know all these, but we do expect you to learn the ones needed for this role.
Your Impact
- Contribute to the success of SRE and DevOps
- Develop expertise in new technologies
- Work with developers, researchers, data scientists, and security experts
- Design, build, and operate reliable, secure Cloud infrastructure
- Ensure that applications are production-ready, scalable, and reliable
- Develop tools and automation frameworks
- Automate robust deployment of robust services
- Orchestrate end-to-end monitoring and alerting
- Participate with SRE and Dev teams in the on-call rotation
- Lead root cause analysis of critical business and production issues
- Mentor and champion SRE culture
- Participate in design reviews
The Team
Wildfire is the industry's largest cloud-based malware protection engine that uses machine learning and crowdsourced intelligence to instantly prevent up to 95% of unknown malware variants inline without compromising business productivity. Wildfire infrastructure team supports the scalability and high availability of Wildfire clouds.
Requirements
- BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience
- Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
- Proficient in Python and/or Go
- Expertise in managing applications in the Kubenetes cluster with autoscaling enabled
- Experience in Production Engineering, DevOps, or Site Reliability
- Expertise in the public cloud (Google Cloud Platform or AWS), especially in Google Cloud Platform
- Strong Linux administration, internals, and network troubleshooting
- Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
- Experience with CI/CD pipelines, GitLab, and GitHub preferred
- Ability to diagnose and troubleshoot complex distributed systems handling high-volume transactions
- Excellent written and verbal communication, able to collaborate and rally support
- Self-disciplined, self-managed, self-motivated, and strong sense of ownership, urgency, and drive
- Passion for infrastructure and monitoring as code
- Ready to understand and dissect new technology stacks quickly
Benefits & conditions
The compensation offered for this position will depend on qualifications, experience, and work location. For candidates who receive an offer at the posted level, the starting base salary (for non-sales roles) or base salary + commission target (for sales/com-missioned roles) is expected to be the annual range listed below. The offered compensation may also include restricted stock units and a bonus. A description of our employee benefits may be found here.
$151,600.00 - $245,300.00/yr
Our Commitment
We're trailblazers that dream big, take risks, and challenge cybersecurity's status quo. It's simple: we can't accomplish our mission without diverse teams innovating, together.