Site Reliability Engineer

Interon IT Solutions LLC

Malvern, United States of America

2 months ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Malvern, United States of America

Tech stack

Java

JavaScript

API

Amazon Web Services (AWS)

Cloud Computing

Software Debugging

Distributed Systems

Fault Tolerance

Monitoring of Systems

Intrusion Detection and Prevention

Python

Software Architecture

Reliability Engineering

Prometheus

Software Engineering

Systems Architecture

Scripting (Bash/Python/Go/Ruby)

Grafana

Event Driven Architecture

Deployment Automation

Dynatrace

Job description

As a Senior Reliability Engineer, you will play a critical role in solving impactful operational problems. You are curious and take a proactive approach to identifying problems and making improvements. You balance innovative thinking with pragmatism and understand the long-term impacts of technical decisions. You communicate complex ideas clearly and collaborate effectively to deliver scalable solutions. Core Responsibilities Team is focused on automating incident response and infrastructure management. While Java and Python receive a stronger emphasis, candidates with solid programming fundamentals in any language and the ability to adapt will be considered. Experience with AWS and event-driven architectures is also valuable. From a technical standpoint, familiarity with observability concepts (e.g., distributed tracing) and tools like Prometheus or Grafana is beneficial, though not mandatory. More important is an understanding of the underlying principles, such as instrumentation and monitoring strategies.

Improve resiliency engineering practices across platforms and applications, including resilient application design patterns, system observability and deployment strategies
Incident detection, troubleshooting, and resolution.
Develop automation for incident response and infrastructure management
Develop and support OpenTelemetry integrations for multiple application platforms (browser, ECS, lambda, etc) and languages (JavaScript, Java)
Contribute to architectural decisions and support implementation of solutions.

Requirements

Deep knowledge of Java or Javascript. Practical experience developing and operating software in distributed systems environments.
Problem-solving and analytical thinking: ability to diagnose complex issues and propose efficient solutions. Strong debugging and optimization skills for performance and scalability.
Cloud platforms: Hands-on experience with AWS services and cloud infrastructure
System architecture and design: ability to design scalable, secure, and maintainable systems.
Working knowledge of Python (or similar scripting language).
Strong knowledge of resiliency engineering techniques for both platforms and applications.
Experience troubleshooting complex production issues and implementing effective mitigations.
Familiarity with OpenTelemetry specification and core APIs.

From a screening perspective, we recommend focusing on:

How candidates approach software releases and validate functionality
Their understanding of system dependencies and fault tolerance
Experience with diagnosing and resolving production issues
Their ability to reflect on past incidents and identify improvements
Evidence of systems thinking and architectural awareness

Site Reliability Engineer

Role details

Job location

Tech stack

Job description

Requirements

About the company

Apply for this position

Role details

Job location

Tech stack

Job description

Requirements

About the company

Apply for this position

Good distractions

Moments

Videos View all