Senior Site Reliability Engineer

Intercontinental Exchange

Jacksonville, United States of America

31 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Jacksonville, United States of America

Tech stack

Java

.NET

PHP

Microsoft Windows

Application Services

Automation of Tests

Azure

Client Server Models

Software as a Service

Cloud Computing

Cloud Engineering

Computer Engineering

Continuous Delivery

Continuous Integration

Linux

File Systems

Distributed Systems

Perl

Fault Tolerance

Python

Windows Server

Scrum

Red Hat Enterprise Linux - RHEL

Reliability Engineering

Ruby

Scripting (Bash/Python/Go/Ruby)

Containerization

Kubernetes

Information Technology

Deployment Automation

Performance Monitor

Job description

SRE new headcount to assist with day-to-day activities supporting ST Application services related to deployment and incident management. Build actionable alerts/automation for preventing incidents, detecting performance bottlenecks, and identifying maintenance activities., * Employ deep troubleshooting skills to improve the availability, performance, and security of IMT Services.

Coding and Automation of Applications on Linux, Windows, Cloud Platforms
Implement automated tests, automated deployments, and operational tools
Collaborate with Product and Support teams to plan and deploy product releases
Work with Linux, Windows, Cloud Platforms and Operations leaders to develop narratives, backlog grooming, epic planning, and overall sprint planning processes
Work with Engineering leadership to build shared services that meet the requirements and need of the platform and application teams
Ensure services are designed with 24/7 availability and operational readiness and rigor
Implementation of proactive monitoring, alerting, trend analysis and self-healing systems
Define non-functional requirements as part of the product lifecycle to influence the new designs, standards, and methods for scalable, highly available distributed systems
Contribute to product development / engineering as needed to ensure Quality of Service of Highly Available services
Identify, evaluate, and execute preventive measures to minimize/avoid impact to the customers experience. Proactive v/s Customer escalated
Resolution of product/service defects or design changes, infrastructure changes, or operational changes
Partner with other SREs and lead by example - contributor more than a delegator
Develop partnership-oriented relationships with business executives and functional leaders, especially as it relates to operations and technology

Requirements

7+ years of Systems/Applications automation in 24x7 Production support services environments
BS in Computer Science, Computer Engineering, Math, or equivalent professional experience
Fluency with one or more current generation scripting language (Python/Shell/Perl/ PHP/Ruby) AND/OR Java Development and/or .NET
7+ years managing Enterprise Red Hat Linux experience required
Excellent troubleshooting skills, utilizing a systematic problem-solving approach
Demonstrated experience in designing, analysing, and diagnosing large-scale distributed systems + Windows Server and/or Linux systems internals (system libraries, file systems, client-server protocols)
Experience with elastically scalable, fault tolerance and other cloud architecture patterns
Experience with Continuous Integration and Continuous Delivery concepts
Good to have experience in Containerization concepts like Kubernetes
Proven strength in SaaS services, experience in massive scale web operations
Must be able to multitask in a fast-paced environment with focus on timeliness, documentation, and communications with peers and business users alike
Expertise with monitoring, alerting and incident response tools and performing root cause analysis
Experience with deployment automation tools like UCD
Experience with Azure DevOps (ADO)