Lead Integration & Observability Specialist (SRE Lead)
Role details
Job location
Tech stack
Job description
We are seeking a Lead Integration & Observability Specialist to design, implement, and lead enterprise observability and reliability solutions, while supporting cloud-based integration platforms on AWS/Azure. The role focuses on monitoring, automation, and operational readiness of applications, APIs, data pipelines, and messaging systems., * Lead the implementation of enterprise observability for applications, APIs, services, batch jobs, and data pipelines.
- Design and standardize monitoring, alerting, logging, metrics, and health checks across distributed systems.
- Integrate observability platforms with incident management and automation tools to support proactive issue detection and remediation.
- Support reliability and availability of integration platforms built on AWS/Azure
- Perform advanced troubleshooting using logs, metrics, and traces to resolve production issues.
- Define operational readiness standards and non-functional requirements.
- Mentor engineers on observability best practices and platform usage.
- Collaborate with product, support, and operations teams to improve service stability and delivery.
Requirements
-
15+ years of overall IT experience
-
7+ years of relevant experience in Observability / Monitoring / Reliability Engineering
-
Strong hands-on experience with enterprise observability tools, such as:
-
Instana, Dynatrace, AppDynamics, Prometheus, Grafana
Expertise in:
- Monitoring and alerting design
- Log management and analysis
- Metrics and distributed tracing
- Health checks and SLO/SLI concepts
Experience monitoring AWS/Azure workloads
Strong troubleshooting and incident analysis skills
Experience defining operational and non-functional requirements
Technical leadership and mentoring experience
Automation and ITSM integration (ServiceNow workflows, incident automation)
CI/CD and release management exposure
Cloud integration and messaging exposure
Automation and ITSM integration (ServiceNow workflows, incident automation)