Site Reliability Engineer

Intermedia Intelligent Communications
3 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Shift work
Languages
English

Job location

Remote

Tech stack

Microsoft Windows
Amazon Web Services (AWS)
Azure
Bash
VoIP
Configuration Management
Data Deduplication
Noise Reduction
Linux
DevOps
Hyper-V
Python
Linux System Administration
Routing
Nginx
Powershell
RabbitMQ
Redis
Reliability Engineering
Ansible
Prometheus
Virtualization Technology
Curam Configuration Tools
Load Balancing
F5 GTM
Grafana
Mttr
Git Flow
ArcSight Event Correlation
Windows Clustering
VMware

Job description

We are looking for an SRE to improve reliability and operational readiness with a strong focus on metrics, alerting, and event management. You will build and maintain monitoring using Prometheus/VictoriaMetrics, integrate alerts and events with BigPanda, and participate in on-call rotations to drive fast incident response and continuous improvement across Windows and Linux environments., * Build and operate metrics/monitoring platforms: Prometheus and/or VictoriaMetrics (scrape configs, exporters, recording rules)

  • Design and maintain alerting strategy: thresholds, anomaly detection where applicable, alert routing, deduplication, and noise reduction
  • Integrate monitoring/alerting and events with BigPanda (correlation, enrichment, routing, incident workflows)
  • Create and maintain dashboards and operational visibility (Grafana or equivalent)
  • Develop and maintain runbooks, operational playbooks, and incident response procedures
  • Participate in on-call shifts: triage alerts, manage incidents, coordinate response, and lead communication during outages
  • Perform root-cause analysis, postmortems, and implement corrective/preventive actions
  • Improve service reliability via SLOs/SLIs, capacity planning, and automation to reduce toil
  • Support monitoring for core infrastructure and services on Windows and Linux, including HA components and clusters
  • Collaborate with DevOps/Engineering to instrument applications and standardize telemetry (metrics, logs, traces where applicable), * Participation in a rotating on-call schedule (including nights/weekends as needed)
  • Ownership of incident response: rapid triage, escalation, mitigation, and follow-up improvements
  • Commitment to improving monitoring quality to reduce alert fatigue and improve MTTR

Requirements

Do you have experience in Windows?, * Experience in SRE / Operations / DevOps with production incident ownership

  • Hands-on experience with Prometheus and/or VictoriaMetrics (exporters, alert rules, recording rules, troubleshooting)
  • Experience integrating alerting/event pipelines with BigPanda (or similar event correlation tools)
  • Strong troubleshooting skills across Linux and Windows systems (networking, OS, services)
  • Ability to build reliable alerting with minimal noise (correlation, grouping, suppression, maintenance windows)
  • Experience with Git-based workflows for monitoring-as-code and configuration management

Nice to have

  • Grafana administration and dashboard design standards

  • Log management (ELK/EFK, Loki) and/or tracing (OpenTelemetry)

  • Automation skills (Python, PowerShell, Bash) and configuration tools (Ansible)

  • Messaging/cache/proxy operations: RabbitMQ, Redis, Nginx

  • Experience with Windows clustering or HA environments

  • Experience defining SLOs/SLIs and operational KPIs

  • Experience in managing VOIP components and protocols (SIP , FreeSwitch, OpenSIP, session border controllers)

  • Experience with load balancing components ( F5 LTM, F5 GTM)

  • Experience with Virtualization platforms such as VMWare or HyperV

  • Experience with administering AWS or Azure tenants

About the company

Are you looking for a company where YOUR VOICE is heard? Where you can MAKE A DIFFERENCE? Do you THRIVE in a FAST-PACED work environment? Do you wake every morning EXCITED to work with GREAT PEOPLE and create SUCCESS TOGETHER? Then Intermedia is the place for you. Intermedia has established itself as a leading provider of cloud communications and collaboration tech that allows companies to connect better. We have a strong track record of growth, profitability, and creating an environment where everyone matters. Everyone. While we are fast-paced and admittedly a bit intense, we promise that you won't be bored. You will find Intermedia is a place where you can indulge your passion for creating and supporting great cloud technology. What's more, we always look to promote from within and have many employees who have been with us 10, 15, and 20+ years! Culture at Intermedia is built on teamwork and transparency. We hold each other accountable and always have each other's back! Are you ready to make your mark?

Apply for this position