SRE Lead & Monitoring Consultant

VALUE SPECTRUM TECHNOLOGIES LLC

Phoenix, United States of America

23 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Phoenix, United States of America

Tech stack

Java

Test Suite

Amazon Web Services (AWS)

Audit Trail

Azure

Bash

Cloud Computing

DevOps

Disaster Recovery

Fraud Prevention and Detection

Monitoring of Systems

Python

Machine Learning

Node.js

PCI Data Security Standards

Performance Tuning

Reliability Engineering

Site Reliability Engineering Practices

Ansible

Prometheus

Runbook

Datadog

Scripting (Bash/Python/Go/Ruby)

Google Cloud Platform

Grafana

Multi-Cloud

HybridCloud

Kubernetes

Infrastructure Automation Frameworks

Performance Monitor

Kafka

Cloud Optimization

Terraform

Splunk

Docker

ELK

Job description

SRE Practice Development

Assess operational maturity and build SRE transformation roadmap
Establish SLOs, SLIs, and error budgets for critical services
Design incident management processes and on-call strategies
Implement chaos engineering and resilience testing
Mentor teams on SRE principles and best practices

Monitoring & Observability

Deploy and configure Datadog, Splunk, Grafana, and Prometheus
Implement metrics collection, log aggregation, and APM
Build custom dashboards and alerting configurations
Set up anomaly detection and intelligent alerting
Configure automated health checks and remediation
Establish golden signals monitoring (latency, traffic, errors, saturation)

Reliability & Compliance

Conduct reliability reviews and performance optimization
Design disaster recovery and failover procedures
Implement security monitoring and audit logging
Configure fraud detection and transaction monitoring
Create runbooks and operational documentation, * Fully configured monitoring stack with Datadog, Splunk, Grafana, and Prometheus
SLO/SLI definitions and error budgets
Custom dashboards, alerting, and automated remediation
Incident management framework and runbooks
Chaos engineering test suite

Requirements

Experience:

- 7+ years in Site Reliability Engineering, DevOps, or infrastructure engineering
3+ years in SRE leadership roles.
The ideal candidate will possess strong expertise in Java, Node.js, Kafka, AWS Cloud, and modern AIOps/Observability practices.
Implement proactive monitoring and predictive alerting using AIOps platforms and machine learning-driven insights.
3+ years hands-on experience with Datadog, Splunk, Grafana, and Prometheus.
Strong hands-on experience with Java and Node.js application architectures.
Previous experience in fintech or regulated industries.
Proven track record building SRE practices from scratch.

Technical Skills

Deep understanding of SRE principles, error budgets, and SLO/SLI frameworks.
Expertise with cloud platforms (AWS, Azure, or Google Cloud Platform).
Proficiency with Kubernetes, Docker, and infrastructure as code (Terraform, Ansible).
Strong programming/scripting skills (Python, Go, Bash).
Experience with incident management and post-mortem culture.
Knowledge of compliance requirements (SOC 2, PCI-DSS, ISO 27001).

Soft Skills

Exceptional leadership and mentoring abilities.
Strong communication and stakeholder management.
Data-driven decision-making approach.
Collaborative mindset with ability to drive cultural change.

Preferred Qualifications

Cloud certifications (AWS, Google Cloud Platform, Azure) or Kubernetes certifications (CKA/CKAD).
Experience with ELK stack.
Background in cloud cost optimization.
Multi-cloud or hybrid cloud experience.

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all