Site Reliability Engineer - MX based

VANHACK TECHNOLOGIES INC.
New York, United States of America
14 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

New York, United States of America

Tech stack

Amazon Web Services (AWS)
Application Performance Management
Systems Engineering
Cloud Computing
Cloud Computing Security
Cloud Engineering
Computer Networks
Databases
Continuous Delivery
Continuous Integration
DevOps
Distributed Systems
Web Servers
Performance Tuning
Reliability Engineering
Software Engineering
Virtualization Technology
Enterprise Software Applications
System Availability
Software Troubleshooting
Caching
Reliability of Systems
Kubernetes
Infrastructure Automation Frameworks
Information Technology
Deployment Automation
Terraform
Devsecops
Microservices

Job description

We are looking for a Senior Site Reliability Engineer (SRE) to help build, maintain, and scale highly reliable cloud infrastructure and enterprise applications. This role is focused on ensuring platform stability, performance, scalability, automation, and operational excellence across AWS environments.

The ideal candidate combines strong software engineering fundamentals with deep operational and infrastructure expertise, and thrives in high-scale production environments.

Responsibilities

  • Design, implement, and maintain highly available and scalable infrastructure on AWS
  • Improve platform reliability, observability, and operational efficiency
  • Automate infrastructure provisioning and management using Terraform
  • Manage and support containerized environments using EKS or ECS
  • Build and enhance CI/CD pipelines and deployment automation processes
  • Monitor production systems and proactively identify reliability and performance issues
  • Lead incident response, troubleshooting, root cause analysis, and postmortem processes
  • Design and manage escalation response plans across monitoring, response, remediation, and retrospective activities
  • Collaborate with software engineering teams to improve system resilience and scalability
  • Optimize application performance for high-concurrency workloads and caching strategies
  • Drive reliability engineering best practices, automation, and continuous improvement initiatives
  • Participate in architecture reviews and operational readiness processes

Requirements

  • Strong experience as an SRE, Cloud Engineer, DevOps Engineer, or Software Engineer supporting production infrastructure

  • Hands-on experience with AWS in large-scale production environments

  • Experience with infrastructure-as-code technologies, preferably Terraform

  • Experience with containerization and orchestration platforms, preferably EKS or ECS

  • Strong troubleshooting experience across:Web server platforms Application platforms Operating systems Networking components Virtualization technologies Storage systems Database platforms

  • Experience working with CI/CD and continuous deployment environments

  • Experience supporting high-concurrency systems and caching strategies

  • Strong incident management, root cause analysis, and systems engineering skills

  • Ability to design and manage operational escalation processes in proactive and collaborative environments

  • Demonstrated experience managing highly scaled cloud infrastructure

  • Strong communication and problem-solving skills

  • Bachelor's degree in Computer Science, related technical field, or equivalent practical experience

Nice to Have

  • Experience with observability and monitoring platforms
  • Kubernetes ecosystem knowledge
  • Experience with distributed systems and microservices architectures
  • Familiarity with SLOs, SLIs, and error budgets
  • Experience with performance tuning and capacity planning
  • Exposure to DevSecOps and cloud security best practices

Apply for this position