Application Reliability Engineer

Dahl Consulting

Brooklyn Park, United States of America

yesterday

Role details

Contract type

Temporary contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Compensation

$ 187K

Job location

Brooklyn Park, United States of America

Tech stack

Java

Amazon Web Services (AWS)

Azure

Software Bug Management

Cloud Computing

Code Review

Continuous Integration

Data Validation

ETL

Software Debugging

DevOps

Distributed Systems

Memory Management

Python

Performance Tuning

Prometheus

Software Engineering

Data Streaming

Datadog

Data Logging

Warehouse Management Systems

Real Time Systems

Grafana

Spark

Reliability of Systems

Backend

GIT

Kotlin

Event Driven Architecture

Containerization

Gitlab-ci

Infrastructure Automation Frameworks

Low Latency

Production Code

Data Analytics

Api Design

Software Coding

Terraform

Splunk

Software Version Control

Data Pipelines

Docker

Jenkins

Microservices

Job description

Our firm is partnering with a leading enterprise organization operating at significant scale across retail, supply chain, and data-driven planning platforms. We are seeking an Application Reliability Engineer to support and enhance demand forecasting and assortment planning applications that are critical to business operations.

This role uniquely blends production support, incident response, and hands-on software development. You will work within a globally distributed engineering model, collaborating closely with engineering and data science teams to ensure system reliability while also delivering targeted enhancements, fixes, and pipeline improvements.

The ideal candidate brings a DevOps-oriented mindset, strong debugging and development skills, and experience operating within complex, data-intensive systems., As an Application Reliability Engineer, you will be responsible for the operational stability, performance, and continuous improvement of user-facing planning and forecasting applications. You will support real-time systems, diagnose and resolve production issues, and contribute directly to codebases and data pipelines that power forecasting and planning workflows.

This position bridges production operations and engineering-balancing incident response and user support with active development work to improve reliability, scalability, and functionality., * Monitor application health, availability, and performance across distributed systems

Diagnose and resolve production issues related to latency, throughput, memory, and data accuracy
Perform root-cause analysis and implement durable fixes, including code and configuration changes
Design, develop, and deploy bug fixes, enhancements, and small features across applications and pipelines
Support end users through ticket triage, escalation, and resolution
Partner closely with engineering and data science teams to design, implement, and validate solutions
Maintain and enhance batch and/or streaming data pipelines supporting forecasting systems
Improve observability through logging, metrics, dashboards, and alerting
Participate in incident management processes and operate within defined SLOs/SLAs
Contribute to documentation, runbooks, and operational best practices
Participate in code reviews and follow testing, deployment, and maintainability standards
Assist with CI/CD pipeline improvements and deployment processes

Requirements

Backend coding skills (Python, Java, Kotlin, or Scala), hands-on experience monitoring and troubleshooting distributed cloud applications, diagnose production issues, support data pipelines (e.g., Spark/ETL), Application Reliability & Performance

Experience monitoring, troubleshooting, and tuning distributed systems
Hands-on experience with observability tools such as Grafana, Prometheus, Splunk, Datadog, or similar
Ability to diagnose and resolve production issues involving latency, memory usage, or throughput
Strong root-cause analysis skills with experience implementing long-term fixes

Software Engineering & Development

Proficiency in at least one backend language: Python, Java, Kotlin, or Scala
Ability to read, debug, modify, and write production-grade code
Experience working with APIs, microservices, and event-driven architectures
Version control experience using Git and participation in structured code reviews

Data & Pipeline Operations

Working knowledge of batch and/or streaming data pipelines
Experience supporting or enhancing frameworks such as Apache Spark or similar tools
Understanding of data validation, logging, and error-handling practices
Ability to monitor, troubleshoot, and improve ETL/ELT workflows

Cloud & Infrastructure

Hands-on experience with AWS, GCP, or Azure
Familiarity with containerization technologies such as Docker and Kubernetes
Experience with CI/CD tools such as Jenkins or GitLab CI

Incident Response & Support

Production support or SRE experience, including triage, escalation, and resolution
Ability to work within defined SLOs/SLAs and contribute to reducing incident recurrence
Experience supporting end users and handling break-fix scenarios

Preferred Qualifications

Experience with infrastructure-as-code tools such as Terraform or Cloud Deployment Manager
Exposure to ML- or analytics-driven systems, including forecasting platforms
Advanced performance tuning and scalability optimization experience
Familiarity with retail, merchandising, or supply chain systems
Experience supporting globally distributed teams across multiple time zones
Knowledge of automated alerting, runbooks, and operational playbooks

Benefits & conditions

Dahl Consulting is proud to offer a comprehensive benefits package to eligible employees that will allow you to choose the best coverage to meet your family's needs. For details, please review the DAHL Benefits Summary: https://www.dahlconsulting.com/benefits-w2fta/.