Senior Site Reliability Engineer

Fmr LLC
Westlake, United States of America
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Westlake, United States of America

Tech stack

Java
Artificial Intelligence
Amazon Web Services (AWS)
Relational Databases
Disaster Recovery
Fault Tolerance
Oracle Applications
Performance Tuning
Reliability Engineering
Prometheus
Software Engineering
SQL Databases
Systems Integration
Datadog
Data Logging
System Availability
Grafana
Generative AI
Kubernetes
Information Technology
Splunk
Dynatrace

Job description

This position is for a Sr. Site Reliability Engineer within the R4 Responsive OpsWorX Team covering multiple products in the Brokerage Recordkeeping Technology organization.

This Engineer will be responsible for responding to production incidents. You will closely work with our business partners responding to application specific questions and work with the product teams to promote availability, resilience, and stability., * Lead and execute cloud migration initiatives, ensuring minimal downtime, performance optimization, and adherence to architectural best practices.

  • Implement and maintain CI/CD pipelines to enable reliable, automated, and secure application deployments.
  • Ensure platforms meet high availability, scalability, fault tolerance, and disaster recovery requirements.
  • Design, implement, and continuously improve observability solutions, including:
  • Monitoring
  • Logging
  • Alerting
  • Distributed tracing using tools such as Prometheus, Grafana, ELK/OpenSearch, OpenTelemetry, Datadog, and Splunk.
  • Instrument applications and infrastructure to provide end-to-end visibility into system health, performance, and reliability.
  • Proactively identify performance bottlenecks, capacity risks, and failure points; recommend and implement remediation strategies.
  • Lead incident response, providing rapid triage and resolution during production outages or performance degradation.
  • Conduct root cause analysis (RCA) for critical incidents and drive corrective and preventive actions.
  • Collaborate closely with development, infrastructure, security, and business teams to ensure alignment with operational and business objectives.
  • Analyze and reverse-engineer existing applications to understand system behavior, integrations, and dependencies
  • Continuously evaluate emerging technologies, tools, and industry trends to improve platform reliability and operational efficiency.
  • Demonstrate adaptability and a strong learning mindset in a fast-paced, evolving environment.

Nice to Have Skills

  • AI. Apply Generative AI tools responsibly to improve productivity, including assisting with analysis, documentation, summarization, and ideation activities.
  • SQL. Utilize SQL and relational databases (Oracle or other RDBMS) to support application troubleshooting, reporting, and performance analysis.
  • Certification in public Cloud (AWS) or Kubernetes is a plus.

Requirements

  • Bachelor's degree or higher in a technology related field (like Engineering, Computer Science, Information Technology) required, master's degree is a plus.
  • Minimum 5 years of combined experience across Production Support, Application Development (Java), and Site Reliability Engineering (SRE) to ensure system stability, scalability, and performance.
  • Build, manage, and optimize resilient, scalable cloud platforms using AWS-native services, leveraging 3 years of hands-on experience with Amazon EKS and RDS.

Apply for this position