Site Reliability Engineer - MX based

VANHACK TECHNOLOGIES INC.

New York, United States of America

14 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

New York, United States of America

Tech stack

Amazon Web Services (AWS)

Application Performance Management

Systems Engineering

Cloud Computing

Cloud Computing Security

Cloud Engineering

Computer Networks

Databases

Continuous Delivery

Continuous Integration

DevOps

Distributed Systems

Web Servers

Performance Tuning

Reliability Engineering

Software Engineering

Virtualization Technology

Enterprise Software Applications

System Availability

Software Troubleshooting

Caching

Reliability of Systems

Kubernetes

Infrastructure Automation Frameworks

Information Technology

Deployment Automation

Terraform

Devsecops

Microservices

Job description

We are looking for a Senior Site Reliability Engineer (SRE) to help build, maintain, and scale highly reliable cloud infrastructure and enterprise applications. This role is focused on ensuring platform stability, performance, scalability, automation, and operational excellence across AWS environments.

The ideal candidate combines strong software engineering fundamentals with deep operational and infrastructure expertise, and thrives in high-scale production environments.

Responsibilities

Design, implement, and maintain highly available and scalable infrastructure on AWS
Improve platform reliability, observability, and operational efficiency
Automate infrastructure provisioning and management using Terraform
Manage and support containerized environments using EKS or ECS
Build and enhance CI/CD pipelines and deployment automation processes
Monitor production systems and proactively identify reliability and performance issues
Lead incident response, troubleshooting, root cause analysis, and postmortem processes
Design and manage escalation response plans across monitoring, response, remediation, and retrospective activities
Collaborate with software engineering teams to improve system resilience and scalability
Optimize application performance for high-concurrency workloads and caching strategies
Drive reliability engineering best practices, automation, and continuous improvement initiatives
Participate in architecture reviews and operational readiness processes

Requirements

Strong experience as an SRE, Cloud Engineer, DevOps Engineer, or Software Engineer supporting production infrastructure
Hands-on experience with AWS in large-scale production environments
Experience with infrastructure-as-code technologies, preferably Terraform
Experience with containerization and orchestration platforms, preferably EKS or ECS
Strong troubleshooting experience across:Web server platforms Application platforms Operating systems Networking components Virtualization technologies Storage systems Database platforms
Experience working with CI/CD and continuous deployment environments
Experience supporting high-concurrency systems and caching strategies
Strong incident management, root cause analysis, and systems engineering skills
Ability to design and manage operational escalation processes in proactive and collaborative environments
Demonstrated experience managing highly scaled cloud infrastructure
Strong communication and problem-solving skills
Bachelor's degree in Computer Science, related technical field, or equivalent practical experience

Nice to Have

Experience with observability and monitoring platforms
Kubernetes ecosystem knowledge
Experience with distributed systems and microservices architectures
Familiarity with SLOs, SLIs, and error budgets
Experience with performance tuning and capacity planning
Exposure to DevSecOps and cloud security best practices

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all