Senior Site Reliability Engineer

Fmr LLC

Westlake, United States of America

2 months ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Westlake, United States of America

Tech stack

Java

Artificial Intelligence

Amazon Web Services (AWS)

Relational Databases

Disaster Recovery

Fault Tolerance

Oracle Applications

Performance Tuning

Reliability Engineering

Prometheus

Software Engineering

SQL Databases

Systems Integration

Datadog

Data Logging

System Availability

Grafana

Generative AI

Kubernetes

Information Technology

Splunk

Dynatrace

Job description

This position is for a Sr. Site Reliability Engineer within the R4 Responsive OpsWorX Team covering multiple products in the Brokerage Recordkeeping Technology organization.

This Engineer will be responsible for responding to production incidents. You will closely work with our business partners responding to application specific questions and work with the product teams to promote availability, resilience, and stability., * Lead and execute cloud migration initiatives, ensuring minimal downtime, performance optimization, and adherence to architectural best practices.

Implement and maintain CI/CD pipelines to enable reliable, automated, and secure application deployments.
Ensure platforms meet high availability, scalability, fault tolerance, and disaster recovery requirements.
Design, implement, and continuously improve observability solutions, including:

Monitoring
Logging
Alerting
Distributed tracing using tools such as Prometheus, Grafana, ELK/OpenSearch, OpenTelemetry, Datadog, and Splunk.

Instrument applications and infrastructure to provide end-to-end visibility into system health, performance, and reliability.
Proactively identify performance bottlenecks, capacity risks, and failure points; recommend and implement remediation strategies.
Lead incident response, providing rapid triage and resolution during production outages or performance degradation.
Conduct root cause analysis (RCA) for critical incidents and drive corrective and preventive actions.
Collaborate closely with development, infrastructure, security, and business teams to ensure alignment with operational and business objectives.
Analyze and reverse-engineer existing applications to understand system behavior, integrations, and dependencies
Continuously evaluate emerging technologies, tools, and industry trends to improve platform reliability and operational efficiency.
Demonstrate adaptability and a strong learning mindset in a fast-paced, evolving environment.

Nice to Have Skills

AI. Apply Generative AI tools responsibly to improve productivity, including assisting with analysis, documentation, summarization, and ideation activities.
SQL. Utilize SQL and relational databases (Oracle or other RDBMS) to support application troubleshooting, reporting, and performance analysis.
Certification in public Cloud (AWS) or Kubernetes is a plus.

Requirements

Bachelor's degree or higher in a technology related field (like Engineering, Computer Science, Information Technology) required, master's degree is a plus.
Minimum 5 years of combined experience across Production Support, Application Development (Java), and Site Reliability Engineering (SRE) to ensure system stability, scalability, and performance.
Build, manage, and optimize resilient, scalable cloud platforms using AWS-native services, leveraging 3 years of hands-on experience with Amazon EKS and RDS.

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all