Site Reliability Engineer (SRE) / Application Production Support
Carlin Shayn
Alpharetta, United States of America
2 days ago
Role details
Contract type
Temporary contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Experience level
SeniorJob location
Alpharetta, United States of America
Tech stack
Java
JIRA
CA Workload Automation Ae
Batch Processing
Cloud Computing
IBM DB2
Monitoring of Systems
Spring
Python
Shell
Log Analysis
Oracle Applications
Reliability Engineering
Standard Sql
Shell Script
SQL Databases
Enterprise Software Applications
System Availability
Grafana
Spring-boot
SAP Sybase ASE
Software Troubleshooting
Kibana
Splunk
Pagerduty
ServiceNow
Control M
Job description
Morgan Stanley is seeking an experienced Site Reliability Engineer (SRE) / Application Production Support professional to support mission-critical financial applications. The ideal candidate will have strong expertise in production support, incident management, monitoring, troubleshooting, automation, and SRE best practices. This role requires hands-on experience with application support, log analysis, scheduling tools, and cloud technologies within enterprise environments., * Monitor and support production applications to ensure high availability and reliability.
- Investigate, troubleshoot, and resolve application, infrastructure, and batch processing issues.
- Analyze application logs and identify root causes for incidents and service disruptions.
- Perform log analysis using Splunk, Kibana, and related monitoring tools.
- Manage and support batch processing using scheduling tools such as Control-M and AutoSys.
- Handle incidents, service requests, and problem management activities using ITSM processes.
- Work with ticketing platforms such as ServiceNow, Jira, or Remedy.
- Collaborate with development, infrastructure, and business teams to ensure timely issue resolution.
- Participate in on-call support and incident response activities.
- Develop automation scripts using Unix Shell or Python to improve operational efficiency.
- Support observability, monitoring, and alert management initiatives., * Splunk
- Kibana
- SQL
- DB2
- Java
- Unix/Linux
- Python
- Control-M
- AutoSys
- ServiceNow
- Jira
- Grafana
- Loki
- PagerDuty
- BigPanda
- Cloud Technologies
Requirements
- Strong understanding of Site Reliability Engineering (SRE) principles.
- Experience with Production Support in enterprise environments.
- Experience in Incident Management, ITIL, and ITSM processes.
- Hands-on experience with:
- Splunk
- Kibana
- SQL
- DB2
- Java
- Unix/Linux
- Python or Shell Scripting
- Experience with scheduling tools:
- Control-M
- AutoSys
- Experience using ticketing tools:
- ServiceNow
- Jira
- Remedy
- Knowledge of Cloud Fundamentals.
- Strong troubleshooting and root cause analysis skills.
- Excellent communication and stakeholder management abilities.
Preferred Skills
- 5+ years of experience supporting enterprise applications.
- Experience with:
- Grafana
- Loki
- Oracle
- Sybase
- Knowledge of alert management platforms:
- BigPanda
- PagerDuty
- Familiarity with:
- Core Java
- Spring Framework
- Spring Boot
- Spring Integration
- Exposure to Synthetic Monitoring tools such as Apica.
- Experience supporting applications within the Financial Services domain., * 5+ years of Production Support, SRE, or Application Support experience.
- Experience supporting mission-critical enterprise applications.
- Financial Services or Banking domain experience preferred.