Site Reliability Engineer

Jobgether

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Remote

Tech stack

Amazon Web Services (AWS)

Cloud Computing

Data Infrastructure

DevOps

Fault Tolerance

PostgreSQL

MySQL

Performance Tuning

Redis

Reliability Engineering

Data Logging

System Availability

Delivery Pipeline

Grafana

Reliability of Systems

Kubernetes

Vertica

Job description

This role offers the opportunity to play a critical part in scaling and maintaining a high-growth platform used by a global audience. You will be responsible for ensuring system reliability, performance, and security as infrastructure demands continue to expand. Working in a fully remote and highly collaborative environment, you will partner closely with engineering teams to build resilient, scalable systems. This is a hands-on position suited for someone who thrives in fast-paced environments and enjoys solving complex operational challenges. You'll have a direct impact on uptime, system health, and long-term infrastructure strategy while contributing to automation and continuous improvement initiatives. Accountabilities:

Act as a primary responder for incidents and outages, ensuring high availability and rapid resolution of production issues.
Own and continuously improve monitoring, alerting, and logging systems to enhance observability and system health.
Manage and optimize database infrastructure, including MySQL, PostgreSQL, ClickHouse, and Redis.
Maintain and enhance server infrastructure and deployment pipelines for improved efficiency and reliability.
Collaborate with engineering teams to design and implement scalable, fault-tolerant systems.
Contribute to the development of internal SRE tools and automation to streamline operations.

Requirements

3+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
Strong expertise in AWS and Kubernetes, with hands-on experience managing cloud-native systems.
Proven experience handling incident response and maintaining production-grade systems.
Solid background in database operations, performance tuning, and optimization.
Familiarity with observability tools, monitoring frameworks, and logging best practices.
Strong communication skills and ability to work effectively in a remote, asynchronous environment.
Fluent English proficiency (written and spoken).
Bonus: Experience with SOC2 compliance, scaling high-growth platforms, or working with ClickHouse or similar technologies.