Senior Site Reliability Engineer
Fmr LLC
Westlake, United States of America
2 days ago
Role details
Contract type
Permanent contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Experience level
SeniorJob location
Westlake, United States of America
Tech stack
Java
Artificial Intelligence
Amazon Web Services (AWS)
Relational Databases
Disaster Recovery
Fault Tolerance
Oracle Applications
Performance Tuning
Reliability Engineering
Prometheus
Software Engineering
SQL Databases
Systems Integration
Datadog
Data Logging
System Availability
Grafana
Generative AI
Kubernetes
Information Technology
Splunk
Dynatrace
Job description
This position is for a Sr. Site Reliability Engineer within the R4 Responsive OpsWorX Team covering multiple products in the Brokerage Recordkeeping Technology organization.
This Engineer will be responsible for responding to production incidents. You will closely work with our business partners responding to application specific questions and work with the product teams to promote availability, resilience, and stability., * Lead and execute cloud migration initiatives, ensuring minimal downtime, performance optimization, and adherence to architectural best practices.
- Implement and maintain CI/CD pipelines to enable reliable, automated, and secure application deployments.
- Ensure platforms meet high availability, scalability, fault tolerance, and disaster recovery requirements.
- Design, implement, and continuously improve observability solutions, including:
- Monitoring
- Logging
- Alerting
- Distributed tracing using tools such as Prometheus, Grafana, ELK/OpenSearch, OpenTelemetry, Datadog, and Splunk.
- Instrument applications and infrastructure to provide end-to-end visibility into system health, performance, and reliability.
- Proactively identify performance bottlenecks, capacity risks, and failure points; recommend and implement remediation strategies.
- Lead incident response, providing rapid triage and resolution during production outages or performance degradation.
- Conduct root cause analysis (RCA) for critical incidents and drive corrective and preventive actions.
- Collaborate closely with development, infrastructure, security, and business teams to ensure alignment with operational and business objectives.
- Analyze and reverse-engineer existing applications to understand system behavior, integrations, and dependencies
- Continuously evaluate emerging technologies, tools, and industry trends to improve platform reliability and operational efficiency.
- Demonstrate adaptability and a strong learning mindset in a fast-paced, evolving environment.
Nice to Have Skills
- AI. Apply Generative AI tools responsibly to improve productivity, including assisting with analysis, documentation, summarization, and ideation activities.
- SQL. Utilize SQL and relational databases (Oracle or other RDBMS) to support application troubleshooting, reporting, and performance analysis.
- Certification in public Cloud (AWS) or Kubernetes is a plus.
Requirements
- Bachelor's degree or higher in a technology related field (like Engineering, Computer Science, Information Technology) required, master's degree is a plus.
- Minimum 5 years of combined experience across Production Support, Application Development (Java), and Site Reliability Engineering (SRE) to ensure system stability, scalability, and performance.
- Build, manage, and optimize resilient, scalable cloud platforms using AWS-native services, leveraging 3 years of hands-on experience with Amazon EKS and RDS.