Cloud Engineer - Sr
Role details
Job location
Tech stack
Job description
SRE background required, AWS, Python/Java, Expertise in observability tools like Splunk, New Relic, Observe (Must have) Working on journey mapping on DFS intent Focus: Full-Stack Observability, System Traceability, & Executive Health Scoring
Requirements
We are seeking a hands-on Observability Specialist to accelerate the adoption of our Observe based platform. The ideal candidate possesses an SRE mindset-the ability to explore how complex systems interact and identify the exact data sets needed to provide a 360-degree view of the environment. You will bridge the gap between disparate Lines of Business (LOBs) to build E2E traceability and unified "Health Indices" that reduce mean-time-to-detect (MTTD) from hours to minutes.\Technical Skill Requirements
- Core Observability & Tooling
- Platform Expertise: Deep experience with modern observability platforms. While we use Observe, proficiency in New Relic, Splunk, or Databricks is required for rapid ramp-up.
- Query & Data Fluency: Expert-level ability to write complex queries (SQL-based or proprietary like NRQL/SPL) to aggregate API success rates, latency, and crash-free session data.
- Dashboard Architecture: Proven track record of building "Drill-Down" architectures-moving from high-level user journeys (e.g., Login) directly into microservice-level logs and traces.
- The Modern Tech Stack
- Infrastructure: Hands-on experience with AWS (ECS/Fargate/Lambda) and Docker.
- Languages: Ability to navigate and instrument code in Python or Java.
- Integrations: Familiarity with GraphQL for data fetching and Jenkins for CI/CD pipeline monitoring.
- Instrumentation: Hands-on experience with OTel, and familiarity with NewRelic APM or Datadog APM
- SRE & Systems Architecture Mindset
- Cross-Domain Traceability: Experience monitoring digital customer engagement across disparate system boundaries (e.g., Comms, Phone, and Backend APIs) to expose "silent failures."
- Telemetry Mapping: Ability to map technical metrics to business outcomes, specifically creating Unified Health Indices for Senior Leadership (SLT)Root Cause Analysis (RCA): Skill in configuring alerts and correlations that enable instant pinpointing of failures within complex user flows.