Site Reliability Engineer

Apetan Consulting

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Remote

Tech stack

Amazon Web Services (AWS)

Azure

Bash

Computer Security

Computer Programming

Computer Networks

DDoS Mitigation

DevOps

Distributed Systems

DNS

Monitoring of Systems

Hypertext Transfer Protocols (HTTP)

Internet Security

Python

Open Source Technology

Performance Tuning

Reliability Engineering

Prometheus

Zero Trust Network Access

TCP/IP

Software Vulnerability Management

Data Logging

Scripting (Bash/Python/Go/Ruby)

Google Cloud Platform

Istio

Grafana

Cloudformation

Containerization

Kubernetes

Infrastructure Automation Frameworks

Low Latency

Linkerd (Service Mesh)

Terraform

Ddos

Docker

ELK

Job description

We are seeking a Senior Site Reliability Engineer to ensure the reliability, scalability, and security of our internet-facing security platform. You will work on high-availability systems that protect and process large-scale network traffic, driving automation, observability, and incident response excellence., * Design, build, and operate highly available, scalable, and secure infrastructure

Maintain uptime and performance of internet security platforms (e.g., WAF, DDoS protection, gateways)
Implement and improve observability (monitoring, logging, tracing, alerting)
Automate infrastructure provisioning and operational workflows
Lead incident response, root cause analysis, and postmortems
Collaborate with security, platform, and development teams to harden systems
Optimize system performance, latency, and cost efficiency
Define and enforce SLOs, SLIs, and error budgets

Requirements

Strong experience in Site Reliability Engineering, DevOps, or production engineering
Proficiency in Linux/Unix systems and networking fundamentals (TCP/IP, DNS, HTTP/S)
Experience with cloud platforms (AWS, Google Cloud Platform, or Azure)
Hands-on experience with containerization and orchestration (Docker, Kubernetes)
Strong scripting/programming skills (Python, Go, or Bash)
Experience with infrastructure as code (Terraform, CloudFormation)
Knowledge of monitoring tools (Prometheus, Grafana, ELK stack, etc.), * Understanding of internet security concepts (TLS, firewalls, WAF, Zero Trust)
Experience mitigating DDoS attacks and handling large-scale traffic patterns
Familiarity with CDN, edge networks, and secure proxy architectures
Knowledge of vulnerability management and system hardening, * Experience operating high-scale distributed systems
Familiarity with incident management tools and on-call practices
Exposure to compliance standards (SOC 2, ISO 27001, etc.)
Experience with service mesh (e.g., Istio, Linkerd), * Drive reliability best practices across teams
Mentor junior engineers and improve operational maturity
Lead critical incident handling and continuous improvement initiatives, * Strong problem-solving and analytical thinking
Clear communication during high-pressure incidents
Ownership mindset with a focus on reliability and security, * Experience working in cybersecurity or internet-scale platforms
Contributions to open-source SRE or security tooling