SRE Engineer

Litmus7 Systems Consulting Inc.
San Francisco, United States of America
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Remote
San Francisco, United States of America

Tech stack

Artificial Intelligence
Application Performance Management
Application Services
JIRA
Automation of Tests
Azure
Bash
Software Design Patterns
DevOps
Distributed Systems
Monitoring of Systems
Python
Log Analysis
Powershell
Reliability Engineering
Prometheus
Data Logging
Scripting (Bash/Python/Go/Ruby)
Data Ingestion
Cloud Monitoring
Grafana
Reliability of Systems
Containerization
Azure
Splunk
Dynatrace
Serverless Computing
Docker
ServiceNow
Microservices

Job description

  1. SRE Fundamentals & Reliability Engineering Apply core SRE principles including: SLIs, SLOs, SLAs definition and governance Error budgets and reliability trade-offs Incident management and postmortems Partner with SRE L2/L3 teams to improve system reliability and performance

  2. Observability Strategy & Tool Recommendation (Core Responsibility) Act as the central point of expertise for Splunk and Dynatrace capabilities Analyze requirements provided by: Application developers SRE L2/L3 engineers Research and determine: Whether requirements can be fulfilled using Splunk, Dynatrace, or both The most efficient, scalable, and cost-effective implementation approach Translate business and technical requirements into tool-specific solutions Recommend best practices, design patterns, and architecture for observability Continuously evaluate new features and enhancements in Splunk and Dynatrace

  3. Splunk Engineering Design and optimize Splunk-based logging and monitoring solutions Develop advanced SPL queries, dashboards, and alerts Define log onboarding strategies and data models Ensure data quality, governance, and cost efficiency Provide guidance on when and how to use Splunk effectively

  4. Dynatrace Expertise Configure and optimize Dynatrace for APM, RUM, and synthetic monitoring Leverage AI-driven anomaly detection and root cause analysis Map business transactions and critical user journeys Guide teams on best utilization of Dynatrace capabilities

  5. Azure Observability Implement and integrate monitoring solutions within Microsoft Azure Work with services such as: Azure App Services, AKS, Azure Functions Azure Monitor, Log Analytics, Application Insights Ensure seamless integration between Azure, Splunk, and Dynatrace

  6. Automation & Enablement Develop automation scripts using Python, PowerShell, or Bash Enable self-service observability for engineering teams Integrate monitoring tools with ServiceNow, Jira, or similar platforms Provide documentation, standards, and reusable templates

  7. Collaboration & Advisory Act as a trusted advisor to developers and SRE teams Conduct requirement intake sessions and translate them into solutions Provide training and guidance on observability best practices Drive adoption of standardized monitoring approaches across teams

Requirements

5+ years of experience in SRE, DevOps, or Observability Engineering Strong understanding of SRE fundamentals (SLIs, SLOs, error budgets, incident management) Deep hands-on experience with: Splunk (log ingestion, SPL, dashboards, alerting) Dynatrace (APM, RUM, synthetic monitoring) Strong expertise in Microsoft Azure Experience supporting large-scale, customer-facing platforms Proficiency in scripting (Python, PowerShell, or Bash) Strong analytical and problem-solving skills Preferred Qualifications Experience in retail/e-commerce environments Knowledge of microservices and distributed systems Experience with AKS, Docker, and containerized environments Familiarity with additional observability tools (Prometheus, Grafana, ELK) Certifications in Splunk, Dynatrace, or Azure

Apply for this position