Network / System Engineer
Role details
Job location
Tech stack
Job description
The Site Reliability Engineer (SRE) will join a team that owns the reliability and operational health of a large-scale platform. This role will operate hands-on across the stack to improve platform and application observability, drive reliability improvements, and deliver measurable gains in operational efficiency. The position involves working closely with core teams to execute platform modernization, harden production systems, and evolve support tooling to ensure the platform continues to meet its reliability and performance objectives., * Collaborate with engineers, architects, and teams to design, develop, test, and implement secure, robust, highly available, and scalable solutions for applications and platforms
- Design and implement deployment approaches using highly scalable, automated, continuous integration and continuous delivery pipelines
- Take responsibility for all aspects of reliability, collaborating with technical experts to resolve complex problems and ensure they do not reoccur
- Utilize a deep understanding of SRE practices, service level indicators, and service level objectives to proactively resolve issues
- Gather, analyze, and develop visualizations and reporting from large, diverse data sets to drive continuous improvement
- Identify opportunities to eliminate toil and automate the triage of issues to improve operational stability
- Collaborate with a global team to identify, analyze, and resolve platform vulnerabilities
- Proactively promote the adoption of site reliability engineering best practices within the team and organization
Requirements
- At least 5 years of combined experience in SRE, software development, or infrastructure engineering
- Strong experience in implementing, monitoring, and maintaining highly scalable and resilient application services and platforms
- Strong experience with monitoring tools such as OpenTelemetry (OTel), ELK (Elasticsearch, Logstash, Kibana), Splunk, and Dynatrace
- Knowledge in Python, Shell, or Perl scripting
- Proficiency in implementing CI/CD pipelines with tools such as Git and Jenkins
- Advanced knowledge of networking (firewalls, DNS, Load Balancing, Proxies, etc)
- Advanced understanding of the Linux operating system, including shell scripting and core commands
- Experience with Ansible for writing playbooks and using core modules
- Excellent interpersonal, organizational, and communication (written, verbal, and presentation) skills
- Self-motivated and results-oriented with excellent analytical and problem-solving skills
Desired skills:
- UI/UX experience to provide oversight on best practices for tooling used by production support teams
- Hands-on experience with Infrastructure as Code (IaC) tools like Terraform for automating infrastructure deployment
- Background in a large enterprise environment
- Ability to connect the dots and resolve complex infrastructure issues quickly
- Able to work in a fast-paced environment while meeting deadlines