Software Engineer
Role details
Job location
Tech stack
Job description
We are seeking a senior, hands-on Software Engineer to support and evolve large-scale application and middleware platforms with a Site Reliability Engineering (SRE) mindset. This role focuses on production reliability, observability, and automation, shifting operations from reactive support to proactive, engineered reliability.
You will serve as an L2/L3 escalation point for mission-critical systems, owning incident response, problem management, and runbook-driven operations. You'll also build automation, infrastructure-as-code, and observability solutions that reduce toil, improve MTTR, and increase platform stability across VM-based and container-adjacent environments, including OpenShift (OCP).
This role supports a fast-growing platform portfolio (200+ applications, scaling rapidly) and requires strong architectural understanding, technical depth, and the ability to adapt across technologies. What You'll Do
- Act as a senior escalation point for L2/L3 production incidents, leading troubleshooting, recovery, and stabilization of application and middleware services.
- Apply SRE practices daily: define and improve reliability signals, enhance alert quality, conduct blameless post-incident reviews, and prioritize systemic fixes over manual work.
- Design and operate observability solutions (logs, metrics, traces, dashboards, and actionable alerts) to improve detection, diagnosis, and recovery times.
- Build and maintain automation and infrastructure-as-code to support repeatable, audited, and resilient operations across VM and container-adjacent platforms.
- Develop standardized operational automation (status checks, start/stop/restart patterns) to reduce dependency bottlenecks and enable safe self-service.
- Implement intelligent automation (including AI-assisted operations where appropriate) with strong guardrails for accuracy, security, and compliance.
- Monitor and remediate configuration drift; support automated compliance validation aligned with enterprise risk and change management.
- Integrate infrastructure and operational automation into CI/CD pipelines for safer, consistent rollouts.
- Support shared platform components such as ingress, load balancing integrations, and common middleware services.
- Create and maintain runbooks, operational documentation, and validation procedures to ensure consistent execution and operational readiness.
- Participate in an on-call rotation supporting 24x7 production operations., * Reduced incident frequency and faster recovery times through better observability and automation.
- Measurable reduction in operational toil and manual intervention.
- Reliable, auditable, and repeatable platform operations at scale.
- Clear, maintainable documentation and runbooks that enable consistent execution.
- Strong partnership with application, infrastructure, and security teams.
Requirements
- 5+ years of experience in software engineering, systems engineering, or production operations, or equivalent practical experience.
- Hands-on experience supporting production applications or middleware in complex, highly available environments.
- Strong troubleshooting skills with the ability to understand system architecture, capacity constraints, and failure modes.
- Experience with automation or scripting (e.g., Python, Bash, PowerShell, or similar).
- Experience working in Linux and/or Windows Server production environments.
- Familiarity with Git-based workflows and infrastructure or configuration as code.
- Ability to learn new technologies quickly and adapt across diverse platforms., * Experience supporting container-adjacent or Kubernetes-based platforms, including OpenShift (OCP).
- Experience implementing SRE operating practices (reliability metrics, alert engineering, toil reduction).
- Experience with observability platforms (e.g., Splunk, Elastic, or similar) beyond a single tool.
- Experience with automation frameworks (Ansible or equivalent).
- Experience integrating operations with CI/CD pipelines.
- Exposure to responsible AI usage in operations (automation assistance, predictive signals, guarded remediation).
- Strong communication skills and experience working in regulated or enterprise environments.