Site Reliability Engineer
Role details
Job location
Tech stack
Job description
As a Site Reliability Engineer at JPMorgan Chase within the Commercial & Investment Banking division, you will solve complex and broad business problems using simple and straightforward solutions. Through code and cloud infrastructure, you will configure, maintain, monitor, and optimize applications and their associated infrastructure to independently decompose and iteratively improve existing solutions. You are a significant contributor to your team by sharing your knowledge of end to end operations, availability, reliability, and scalability of your application or platform., Guide and assist others in building appropriate level designs and gaining consensus from peers where appropriate Collaborate with other software engineers and teams to design and implement deployment approaches using automated continuous integration and continuous delivery pipelines Work with software engineers and teams to design, develop, test, and implement availability, reliability, scalability, and solutions for applications Implement infrastructure, configuration, and networks as code for applications and platforms in your remit Collaborate with technical experts, key stakeholders, and team members to resolve complex problems Understand service level indicators and utilize service level objectives to proactively resolve issues before they impact customers Support adoption of site reliability engineering best practices within your team
Requirements
Formal training or certification in software engineering concepts and 3+ years of applied experience Proven experience in reliability, scalability, performance, security, enterprise system architecture, tool reduction, and SRE best practices Proficient in at least one programming language (Python, Java/Spring Boot, etc.) Experience with observability tools such as Prometheus, Grafana, Datadog, Splunk, and others Strong understanding of monitoring, ing, telemetry, and service level objectives Knowledge of CI/CD tools such as Jenkins, GitLab, or Terraform Hands on experience with container orchestration (EKS, Kubernetes, Docker) Experience troubleshooting common networking issues Experience implementing and maintaining SLO/SLA frameworks, chaos engineering (Gremlin, Chaos Monkey) Comfortable working with traditional metrics (latency, availability) Knowledge of infrastructure components such as routers, load balancers, cloud products, container systems, compute, storage, and networks Hands on experience with tools like Jira, Confluence, ServiceNow, Netcool Ability to identify new technologies and tools to improve operations through monitoring and logging analysis