SRE Lead & Monitoring Consultant
VALUE SPECTRUM TECHNOLOGIES LLC
Phoenix, United States of America
23 days ago
Role details
Contract type
Permanent contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Experience level
SeniorJob location
Phoenix, United States of America
Tech stack
Java
Test Suite
Amazon Web Services (AWS)
Audit Trail
Azure
Bash
Cloud Computing
DevOps
Disaster Recovery
Fraud Prevention and Detection
Monitoring of Systems
Python
Machine Learning
Node.js
PCI Data Security Standards
Performance Tuning
Reliability Engineering
Site Reliability Engineering Practices
Ansible
Prometheus
Runbook
Datadog
Scripting (Bash/Python/Go/Ruby)
Google Cloud Platform
Grafana
Multi-Cloud
HybridCloud
Kubernetes
Infrastructure Automation Frameworks
Performance Monitor
Kafka
Cloud Optimization
Terraform
Splunk
Docker
ELK
Go
Job description
SRE Practice Development
- Assess operational maturity and build SRE transformation roadmap
- Establish SLOs, SLIs, and error budgets for critical services
- Design incident management processes and on-call strategies
- Implement chaos engineering and resilience testing
- Mentor teams on SRE principles and best practices
Monitoring & Observability
- Deploy and configure Datadog, Splunk, Grafana, and Prometheus
- Implement metrics collection, log aggregation, and APM
- Build custom dashboards and alerting configurations
- Set up anomaly detection and intelligent alerting
- Configure automated health checks and remediation
- Establish golden signals monitoring (latency, traffic, errors, saturation)
Reliability & Compliance
- Conduct reliability reviews and performance optimization
- Design disaster recovery and failover procedures
- Implement security monitoring and audit logging
- Configure fraud detection and transaction monitoring
- Create runbooks and operational documentation, * Fully configured monitoring stack with Datadog, Splunk, Grafana, and Prometheus
- SLO/SLI definitions and error budgets
- Custom dashboards, alerting, and automated remediation
- Incident management framework and runbooks
- Chaos engineering test suite
Requirements
Experience:
-
- 7+ years in Site Reliability Engineering, DevOps, or infrastructure engineering
- 3+ years in SRE leadership roles.
- The ideal candidate will possess strong expertise in Java, Node.js, Kafka, AWS Cloud, and modern AIOps/Observability practices.
- Implement proactive monitoring and predictive alerting using AIOps platforms and machine learning-driven insights.
- 3+ years hands-on experience with Datadog, Splunk, Grafana, and Prometheus.
- Strong hands-on experience with Java and Node.js application architectures.
- Previous experience in fintech or regulated industries.
- Proven track record building SRE practices from scratch.
Technical Skills
- Deep understanding of SRE principles, error budgets, and SLO/SLI frameworks.
- Expertise with cloud platforms (AWS, Azure, or Google Cloud Platform).
- Proficiency with Kubernetes, Docker, and infrastructure as code (Terraform, Ansible).
- Strong programming/scripting skills (Python, Go, Bash).
- Experience with incident management and post-mortem culture.
- Knowledge of compliance requirements (SOC 2, PCI-DSS, ISO 27001).
Soft Skills
- Exceptional leadership and mentoring abilities.
- Strong communication and stakeholder management.
- Data-driven decision-making approach.
- Collaborative mindset with ability to drive cultural change.
Preferred Qualifications
- Cloud certifications (AWS, Google Cloud Platform, Azure) or Kubernetes certifications (CKA/CKAD).
- Experience with ELK stack.
- Background in cloud cost optimization.
- Multi-cloud or hybrid cloud experience.