Lead Integration & Observability Specialist (SRE Lead)

Ark Infotech Spectrum
McKinney, United States of America
14 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

McKinney, United States of America

Tech stack

API
Amazon Web Services (AWS)
Azure
Continuous Integration
Distributed Systems
Enterprise Messaging Systems
Release Management
Reliability Engineering
Prometheus
Data Logging
Grafana
Information Technology
Integration Frameworks
Cloud Integration
Appdynamics
Data Pipelines
Dynatrace
ServiceNow

Job description

We are seeking a Lead Integration & Observability Specialist to design, implement, and lead enterprise observability and reliability solutions, while supporting cloud-based integration platforms on AWS/Azure. The role focuses on monitoring, automation, and operational readiness of applications, APIs, data pipelines, and messaging systems., * Lead the implementation of enterprise observability for applications, APIs, services, batch jobs, and data pipelines.

  • Design and standardize monitoring, alerting, logging, metrics, and health checks across distributed systems.
  • Integrate observability platforms with incident management and automation tools to support proactive issue detection and remediation.
  • Support reliability and availability of integration platforms built on AWS/Azure
  • Perform advanced troubleshooting using logs, metrics, and traces to resolve production issues.
  • Define operational readiness standards and non-functional requirements.
  • Mentor engineers on observability best practices and platform usage.
  • Collaborate with product, support, and operations teams to improve service stability and delivery.

Requirements

  • 15+ years of overall IT experience

  • 7+ years of relevant experience in Observability / Monitoring / Reliability Engineering

  • Strong hands-on experience with enterprise observability tools, such as:

  • Instana, Dynatrace, AppDynamics, Prometheus, Grafana

Expertise in:

  • Monitoring and alerting design
  • Log management and analysis
  • Metrics and distributed tracing
  • Health checks and SLO/SLI concepts

Experience monitoring AWS/Azure workloads

Strong troubleshooting and incident analysis skills

Experience defining operational and non-functional requirements

Technical leadership and mentoring experience

Automation and ITSM integration (ServiceNow workflows, incident automation)

CI/CD and release management exposure

Cloud integration and messaging exposure

Automation and ITSM integration (ServiceNow workflows, incident automation)

Apply for this position