Software Engineer II - Site Reliability Engineering
Role details
Job location
Tech stack
Job description
The IT Player Experience Engineering team builds and operates platforms that support millions of players worldwide. As a Software Engineer II - SRE, you will focus on improving the reliability, scalability, and operational excellence of Java-based, microservices-driven systems that power player experiences. This role is critical to delivering FY26 goals by embedding SRE best practices across design, development, and operations., * Drive SRE initiatives to improve system availability, performance, and resilience across Java microservices
- Define and track SLOs, SLIs, and error budgets for critical services
- Lead incident response, root cause analysis (RCA), and postmortems to prevent recurrence
- Automate operational tasks to reduce toil and improve system reliability
Observability
- Design and implement monitoring, alerting, and logging strategies using industry-standard tools
- Build end-to-end observability with metrics, distributed tracing, and logs for microservices
- Tune alerts to reduce noise and ensure actionable signal during incidents
Engineering & Platform Enablement
- Collaborate with development teams to build reliability into Java/Spring Boot services from design through production
- Review service architecture for scalability, fault tolerance, and operability
- Improve CI/CD pipelines with reliability, testing, and deployment safety checks
- Support cloud-native deployments on AWS and containerized platforms (Docker/Kubernetes)
Best Practices & Enablement
- Champion SRE best practices including automation, capacity planning, and resiliency testing
- Contribute to runbooks, operational documentation, and knowledge sharing
- Partner with engineers, product managers, and leadership to balance feature velocity with system reliability
Requirements
Core Skills
- Strong experience with Java, Spring Boot, and microservices architectures
- Hands-on experience with monitoring, alerting, logging, and distributed tracing
- Experience supporting production systems with high availability and scale requirements
Cloud & Infrastructure
- Experience with AWS services and cloud-native architectures
- Familiarity with Docker, Kubernetes, and CI/CD pipelines
Reliability Mindset
- Experience with incident management, on-call rotations, and post-incident analysis
- Strong troubleshooting skills across application, infrastructure, and network layers
Collaboration
- Ability to work closely with application engineers to influence design for reliability
- Clear communication skills to explain operational risks and trade-offs
Benefits & conditions
We're proud to have an extensive portfolio of games and experiences, locations around the world, and opportunities across EA. We value adaptability, resilience, creativity, and curiosity. From leadership that brings out your potential, to creating space for learning and experimenting, we empower you to do great work and pursue opportunities for growth.
We adopt a holistic approach to our benefits programs, emphasizing physical, emotional, financial, career, and community wellness to support a balanced life. Our packages are tailored to meet local needs and may include healthcare coverage, mental well-being support, retirement savings, paid time off, family leaves, complimentary games, and more. We nurture environments where our teams can always bring their best to what they do.