Site Reliability Engineer
Role details
Job location
Tech stack
Job description
As a Site Reliability Engineer, you will be responsible for: Operational Excellence & Incident Management
-
Maintain and monitor production systems for availability, latency, and performance.
-
Lead incident response efforts, including communication, resolution, and postmortem documentation.
-
Design and implement health checks, alerting systems, and automated remediation workflows.
-
Drive root cause analysis and implement permanent resolutions for recurring issues.
Observability & Insights
-
Set up and maintain full observability stacks (logging, metrics, tracing) using tools like Prometheus, Grafana, Datadog, OpenTelemetry, or ELK.
-
Analyze telemetry and logs to identify trends, anomalies, and opportunities for improvement.
-
Conduct post-incident reviews and use insights to inform future engineering investments.
Performance & Systems Optimization
-
Tune and optimize distributed systems, including AKKA.NET actors, for performance and resource efficiency.
-
Work with developers to evolve architecture and improve system throughput, latency, and stability.
-
Optimize PostgreSQL performance, queries, and maintenance strategies.
CI/CD & Automation
-
Design and maintain modern CI/CD pipelines using GitHub Actions, Azure Pipelines, or GitLab CI.
-
Automate deployment, testing, and rollback processes to reduce friction and increase deployment frequency.
Requirements
5+ years of experience in SRE, DevOps, or Infrastructure Engineering roles.
-
Expertise in Kubernetes and container orchestration at scale.
-
Strong experience with AKKA.NET or similar actor-based frameworks.
-
Proficiency with scripting and automation (Bash, PowerShell, Python).
-
Experience with observability tools (Phobos,Datadog, Prometheus, Grafana, OpenTelemetry, ELK).
-
Hands-on experience with cloud platforms (AWS, Azure, or GCP).
-
Strong PostgreSQL knowledge-performance tuning, query optimization, maintenance.
-
Proven ability to lead incident management and drive postmortem processes.
-
A builder's mindset with high standards for operational excellence and technical ownership.
Preferred Tools & Ecosystem Experience
-
CI/CD: GitHub Actions, Azure Pipelines, GitLab CI
-
Infrastructure: Kubernetes, Docker, Terraform
-
Monitoring: Phobos (AKKA.NET), Datadog, Prometheus
-
Source Control: GitHub, GitLab, Azure DevOps
-
Programming: C#, Python, Bash, PowerShell