TechOps / Support Engineer
Role details
Job location
Tech stack
Job description
We are seeking a highly motivated TechOps / Support Engineer to join our Technology Operations team. The role is responsible for maintaining platform reliability, managing production incidents, coordinating Major Incident Management (MIM), driving Root Cause Analysis (RCA), and ensuring timely resolution of Severity (Sev) incidents. The engineer will participate in an active on-call rotation and work closely with engineering, infrastructure, product, and business teams to minimize service disruptions and improve operational excellence., < data-start="750" data-end="787">Incident Management & Operations
- Participate in a 24x7 on-call rotation and provide production support for critical business applications and services.
- Act as Incident Commander or coordinator during Severity (Sev) incidents and major outages.
- Lead Major Incident Management (MIM) activities, including stakeholder communication, bridge coordination, escalation management, and service restoration.
- Drive incidents through to resolution while ensuring adherence to defined SLAs and operational procedures.
- Monitor application, infrastructure, and platform health using observability and monitoring tools.
- Perform proactive issue detection, troubleshooting, and remediation.
< data-start="1446" data-end="1476">Root Cause Analysis (RCA)
- Lead and coordinate post-incident reviews and Root Cause Analysis (RCA) activities.
- Identify underlying causes of recurring issues and collaborate with engineering teams to implement permanent fixes.
- Track corrective and preventive actions to closure.
- Maintain detailed incident documentation, timelines, and lessons learned.
< data-start="1812" data-end="1860">Problem Management & Continuous Improvement
- Analyze incident trends and recommend operational improvements.
- Develop and enhance runbooks, knowledge base articles, and operational procedures.
- Drive automation initiatives to reduce manual effort and improve response times.
- Contribute to operational readiness reviews for new releases and platform changes.
< data-start="2181" data-end="2211">Stakeholder Communication
- Provide timely updates to internal stakeholders during critical incidents.
- Coordinate across engineering, infrastructure, cloud, security, and vendor teams during issue resolution.
- Ensure clear communication throughout the incident lifecycle.
Requirements
- Bachelor''s degree in Computer Science, Information Technology, Engineering, or a related field.
- 3-7+ years of experience in Technical Operations, Production Support, Site Reliability Engineering (SRE), or IT Operations.
- Strong experience managing production incidents and Sev1/Sev2 issues.
- Hands-on experience with Major Incident Management (MIM) processes.
- Proven experience conducting Root Cause Analysis (RCA) and driving corrective actions.
- Strong troubleshooting skills across applications, APIs, databases, and infrastructure.
- Experience with monitoring and observability tools such as Splunk, Datadog, Dynatrace, New Relic, Grafana, Prometheus, or similar.
- Knowledge of Linux/Unix systems and cloud environments (AWS, Azure, or Google Cloud Platform).
- Familiarity with ticketing and ITSM platforms such as ServiceNow, Jira Service Management, or similar.
- Excellent communication and stakeholder management skills., * Experience supporting cloud-native and microservices-based architectures.
- Knowledge of DevOps, CI/CD pipelines, and automation scripting (Python, Shell, PowerShell, etc.).
- ITIL Foundation or relevant operational certifications.
- Experience working in high-availability, mission-critical production environments., * Incident Leadership
- Major Incident Management (MIM)
- Root Cause Analysis (RCA)
- Production Support
- Operational Excellence
- Problem Solving
- Stakeholder Communication
- Escalation Management
- Automation Mindset
- Team Collaboration