Senior Site Reliability Engineer
Role details
Job location
Tech stack
Job description
The Wikimedia Foundation is looking for a Senior SRE to join our team, reporting to the Engineering Manager of the Data Platform Engineering SRE team. As a Senior SRE, you will be responsible for operating the systems supporting our data-oriented teams (Kubernetes, >6PB Hadoop, OpenSearch, Airflow, Superset, Kafka, etc), helping design and implement new systems and solutions, and ensuring that our systems scale to meet demand. In this role, you will interact with our client teams, support them in whatever adventure they are on, investigate incidents, migrate services to Kubernetes, …
You are responsible for:
- Simplifying our operations by standardizing how we deploy services and how we benefit from virtualizing and containerizing our applications
- Supporting our users, removing roadblocks, and making them more productive!
- Monitoring of systems and services, optimization of performance, and resource utilization
- Proactively identifying sources of instability in distributed systems and analyzing how complex systems fail from a reliability and resilience perspective.
- Automation and streamlining of tasks, as well as identifying process gaps
- Collaborating with a global and asynchronously communicating team (don't worry if you have never worked remotely, we'll help you get used to it)
- Mentoring peers in your areas of technical and operational strength
- Expected to travel domestically or potentially internationally 2-3 times a year for team gatherings and conferences
Requirements
Do you have experience in Virtualization?, * 5+ yearsof experience in an SRE/Operations/DevOps or software engineering role
- Experience with running applications and services at scale
- Proficiency with shell and a programming language used in an SRE/Operations engineering context (Python, Go, Ruby, etc.)
- Comfort with Open Source configuration management and orchestration tools (Puppet, Ansible, Terraform etc.)
- Communicative technical English
- Virtualization of data and compute
Qualities that are important to us:
- Share our values, appreciate our code of conduct, support our team norms, and work in accordance with all three
- Customer-oriented. We're here to help, not to block.
Strong English language skills and ability to work independently, as an effective part of a globally distributed team
- Comfortable working in the open
- Passionate about supporting our communities
Additionally, we'd love it if you have:
- Experience with Kubernetes and Ceph
- Experience with operating a data platform