Application Reliability Engineer

Dahl Consulting
Brooklyn Park, United States of America
yesterday

Role details

Contract type
Temporary contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Compensation
$ 187K

Job location

Brooklyn Park, United States of America

Tech stack

Java
Amazon Web Services (AWS)
Azure
Software Bug Management
Cloud Computing
Code Review
Continuous Integration
Data Validation
ETL
Software Debugging
DevOps
Distributed Systems
Memory Management
Python
Performance Tuning
Prometheus
Software Engineering
Data Streaming
Datadog
Data Logging
Warehouse Management Systems
Real Time Systems
Grafana
Spark
Reliability of Systems
Backend
GIT
Kotlin
Event Driven Architecture
Containerization
Gitlab-ci
Infrastructure Automation Frameworks
Low Latency
Production Code
Data Analytics
Api Design
Software Coding
Terraform
Splunk
Software Version Control
Data Pipelines
Docker
Jenkins
Microservices

Job description

Our firm is partnering with a leading enterprise organization operating at significant scale across retail, supply chain, and data-driven planning platforms. We are seeking an Application Reliability Engineer to support and enhance demand forecasting and assortment planning applications that are critical to business operations.

This role uniquely blends production support, incident response, and hands-on software development. You will work within a globally distributed engineering model, collaborating closely with engineering and data science teams to ensure system reliability while also delivering targeted enhancements, fixes, and pipeline improvements.

The ideal candidate brings a DevOps-oriented mindset, strong debugging and development skills, and experience operating within complex, data-intensive systems., As an Application Reliability Engineer, you will be responsible for the operational stability, performance, and continuous improvement of user-facing planning and forecasting applications. You will support real-time systems, diagnose and resolve production issues, and contribute directly to codebases and data pipelines that power forecasting and planning workflows.

This position bridges production operations and engineering-balancing incident response and user support with active development work to improve reliability, scalability, and functionality., * Monitor application health, availability, and performance across distributed systems

  • Diagnose and resolve production issues related to latency, throughput, memory, and data accuracy
  • Perform root-cause analysis and implement durable fixes, including code and configuration changes
  • Design, develop, and deploy bug fixes, enhancements, and small features across applications and pipelines
  • Support end users through ticket triage, escalation, and resolution
  • Partner closely with engineering and data science teams to design, implement, and validate solutions
  • Maintain and enhance batch and/or streaming data pipelines supporting forecasting systems
  • Improve observability through logging, metrics, dashboards, and alerting
  • Participate in incident management processes and operate within defined SLOs/SLAs
  • Contribute to documentation, runbooks, and operational best practices
  • Participate in code reviews and follow testing, deployment, and maintainability standards
  • Assist with CI/CD pipeline improvements and deployment processes

Requirements

Backend coding skills (Python, Java, Kotlin, or Scala), hands-on experience monitoring and troubleshooting distributed cloud applications, diagnose production issues, support data pipelines (e.g., Spark/ETL), Application Reliability & Performance

  • Experience monitoring, troubleshooting, and tuning distributed systems
  • Hands-on experience with observability tools such as Grafana, Prometheus, Splunk, Datadog, or similar
  • Ability to diagnose and resolve production issues involving latency, memory usage, or throughput
  • Strong root-cause analysis skills with experience implementing long-term fixes

Software Engineering & Development

  • Proficiency in at least one backend language: Python, Java, Kotlin, or Scala
  • Ability to read, debug, modify, and write production-grade code
  • Experience working with APIs, microservices, and event-driven architectures
  • Version control experience using Git and participation in structured code reviews

Data & Pipeline Operations

  • Working knowledge of batch and/or streaming data pipelines
  • Experience supporting or enhancing frameworks such as Apache Spark or similar tools
  • Understanding of data validation, logging, and error-handling practices
  • Ability to monitor, troubleshoot, and improve ETL/ELT workflows

Cloud & Infrastructure

  • Hands-on experience with AWS, GCP, or Azure
  • Familiarity with containerization technologies such as Docker and Kubernetes
  • Experience with CI/CD tools such as Jenkins or GitLab CI

Incident Response & Support

  • Production support or SRE experience, including triage, escalation, and resolution
  • Ability to work within defined SLOs/SLAs and contribute to reducing incident recurrence
  • Experience supporting end users and handling break-fix scenarios

Preferred Qualifications

  • Experience with infrastructure-as-code tools such as Terraform or Cloud Deployment Manager
  • Exposure to ML- or analytics-driven systems, including forecasting platforms
  • Advanced performance tuning and scalability optimization experience
  • Familiarity with retail, merchandising, or supply chain systems
  • Experience supporting globally distributed teams across multiple time zones
  • Knowledge of automated alerting, runbooks, and operational playbooks

Benefits & conditions

Dahl Consulting is proud to offer a comprehensive benefits package to eligible employees that will allow you to choose the best coverage to meet your family's needs. For details, please review the DAHL Benefits Summary: https://www.dahlconsulting.com/benefits-w2fta/.

Apply for this position