Site Reliability Engineer
Role details
Job location
Tech stack
Job description
We are looking for an SRE to improve reliability and operational readiness with a strong focus on metrics, alerting, and event management. You will build and maintain monitoring using Prometheus/VictoriaMetrics, integrate alerts and events with BigPanda, and participate in on-call rotations to drive fast incident response and continuous improvement across Windows and Linux environments., * Build and operate metrics/monitoring platforms: Prometheus and/or VictoriaMetrics (scrape configs, exporters, recording rules)
- Design and maintain alerting strategy: thresholds, anomaly detection where applicable, alert routing, deduplication, and noise reduction
- Integrate monitoring/alerting and events with BigPanda (correlation, enrichment, routing, incident workflows)
- Create and maintain dashboards and operational visibility (Grafana or equivalent)
- Develop and maintain runbooks, operational playbooks, and incident response procedures
- Participate in on-call shifts: triage alerts, manage incidents, coordinate response, and lead communication during outages
- Perform root-cause analysis, postmortems, and implement corrective/preventive actions
- Improve service reliability via SLOs/SLIs, capacity planning, and automation to reduce toil
- Support monitoring for core infrastructure and services on Windows and Linux, including HA components and clusters
- Collaborate with DevOps/Engineering to instrument applications and standardize telemetry (metrics, logs, traces where applicable), * Participation in a rotating on-call schedule (including nights/weekends as needed)
- Ownership of incident response: rapid triage, escalation, mitigation, and follow-up improvements
- Commitment to improving monitoring quality to reduce alert fatigue and improve MTTR
Requirements
Do you have experience in Windows?, * Experience in SRE / Operations / DevOps with production incident ownership
- Hands-on experience with Prometheus and/or VictoriaMetrics (exporters, alert rules, recording rules, troubleshooting)
- Experience integrating alerting/event pipelines with BigPanda (or similar event correlation tools)
- Strong troubleshooting skills across Linux and Windows systems (networking, OS, services)
- Ability to build reliable alerting with minimal noise (correlation, grouping, suppression, maintenance windows)
- Experience with Git-based workflows for monitoring-as-code and configuration management
Nice to have
-
Grafana administration and dashboard design standards
-
Log management (ELK/EFK, Loki) and/or tracing (OpenTelemetry)
-
Automation skills (Python, PowerShell, Bash) and configuration tools (Ansible)
-
Messaging/cache/proxy operations: RabbitMQ, Redis, Nginx
-
Experience with Windows clustering or HA environments
-
Experience defining SLOs/SLIs and operational KPIs
-
Experience in managing VOIP components and protocols (SIP , FreeSwitch, OpenSIP, session border controllers)
-
Experience with load balancing components ( F5 LTM, F5 GTM)
-
Experience with Virtualization platforms such as VMWare or HyperV
-
Experience with administering AWS or Azure tenants