Senior Engineer, Site Reliability Engineer
Role details
Job location
Tech stack
Job description
-
Lead Incident Recovery: Direct the recovery of incidents, analyze facts quickly, perform troubleshooting activities, and coordinate actions through incident recovery meetings.
-
Incident Reporting and Customer Advocacy: Write detailed incident reports, advocate for customers in post-incident reviews, and review and approve customer statements.
-
Production Environment Stability: Ensure a stable production environment by safely delivering changes and thoroughly assessing deployment risks.
-
Operational Process Improvement: Automate operational processes to reduce manual work. Ensure all alerts are actionable and collaborate with development teams to eliminate unnecessary alerts.
-
Subject Matter Expertise: Develop expertise in application dataflows and networking topologies and maintain comprehensive troubleshooting knowledge bases.
-
Collaboration with Development Teams: Collaborate closely with development teams to prioritize iterative improvements to production environments in the product backlog.
-
Project Intake and Delivery: Serve as the SRE point of contact for new projects. Collaborate with project delivery teams to produce high-quality project artifacts, ensure application designs meet supportability standards, conduct operational acceptance testing, and lead Game Day activities.
Requirements
-
Extensive experience in UNIX administration and scripting, including shell scripting and automation.
-
Practical experience supporting cloud-native applications, with a preference for AWS or Azure.
-
Advanced understanding of networking concepts such as TCP/IP, HTTP, and DNS resolution, with experience in real-time or low-latency environments a plus.
-
Experience in configuring and using Kubernetes, Docker, and container-based development and applications.
-
Proven experience in troubleshooting large distributed systems; experience with market data systems is a plus.
-
Expertise in working with and maintaining observability tooling (DataDog and BigPanda experience is highly desirable)
-
Strong grasp of version control systems, particularly Git.
-
A bachelor's degree in computer science or a related technical field involving software/systems engineering, or equivalent practical experience.
-
A minimum of 8 years of work experience in the industry, with customer support experience being a plus.