Application Reliability Engineer
Role details
Job location
Tech stack
Job description
Our firm is partnering with a leading enterprise organization operating at significant scale across retail, supply chain, and data-driven planning platforms. We are seeking an Application Reliability Engineer to support and enhance demand forecasting and assortment planning applications that are critical to business operations.
This role uniquely blends production support, incident response, and hands-on software development. You will work within a globally distributed engineering model, collaborating closely with engineering and data science teams to ensure system reliability while also delivering targeted enhancements, fixes, and pipeline improvements.
The ideal candidate brings a DevOps-oriented mindset, strong debugging and development skills, and experience operating within complex, data-intensive systems., As an Application Reliability Engineer, you will be responsible for the operational stability, performance, and continuous improvement of user-facing planning and forecasting applications. You will support real-time systems, diagnose and resolve production issues, and contribute directly to codebases and data pipelines that power forecasting and planning workflows.
This position bridges production operations and engineering-balancing incident response and user support with active development work to improve reliability, scalability, and functionality., * Monitor application health, availability, and performance across distributed systems
- Diagnose and resolve production issues related to latency, throughput, memory, and data accuracy
- Perform root-cause analysis and implement durable fixes, including code and configuration changes
- Design, develop, and deploy bug fixes, enhancements, and small features across applications and pipelines
- Support end users through ticket triage, escalation, and resolution
- Partner closely with engineering and data science teams to design, implement, and validate solutions
- Maintain and enhance batch and/or streaming data pipelines supporting forecasting systems
- Improve observability through logging, metrics, dashboards, and alerting
- Participate in incident management processes and operate within defined SLOs/SLAs
- Contribute to documentation, runbooks, and operational best practices
- Participate in code reviews and follow testing, deployment, and maintainability standards
- Assist with CI/CD pipeline improvements and deployment processes
Requirements
Backend coding skills (Python, Java, Kotlin, or Scala), hands-on experience monitoring and troubleshooting distributed cloud applications, diagnose production issues, support data pipelines (e.g., Spark/ETL), Application Reliability & Performance
- Experience monitoring, troubleshooting, and tuning distributed systems
- Hands-on experience with observability tools such as Grafana, Prometheus, Splunk, Datadog, or similar
- Ability to diagnose and resolve production issues involving latency, memory usage, or throughput
- Strong root-cause analysis skills with experience implementing long-term fixes
Software Engineering & Development
- Proficiency in at least one backend language: Python, Java, Kotlin, or Scala
- Ability to read, debug, modify, and write production-grade code
- Experience working with APIs, microservices, and event-driven architectures
- Version control experience using Git and participation in structured code reviews
Data & Pipeline Operations
- Working knowledge of batch and/or streaming data pipelines
- Experience supporting or enhancing frameworks such as Apache Spark or similar tools
- Understanding of data validation, logging, and error-handling practices
- Ability to monitor, troubleshoot, and improve ETL/ELT workflows
Cloud & Infrastructure
- Hands-on experience with AWS, GCP, or Azure
- Familiarity with containerization technologies such as Docker and Kubernetes
- Experience with CI/CD tools such as Jenkins or GitLab CI
Incident Response & Support
- Production support or SRE experience, including triage, escalation, and resolution
- Ability to work within defined SLOs/SLAs and contribute to reducing incident recurrence
- Experience supporting end users and handling break-fix scenarios
Preferred Qualifications
- Experience with infrastructure-as-code tools such as Terraform or Cloud Deployment Manager
- Exposure to ML- or analytics-driven systems, including forecasting platforms
- Advanced performance tuning and scalability optimization experience
- Familiarity with retail, merchandising, or supply chain systems
- Experience supporting globally distributed teams across multiple time zones
- Knowledge of automated alerting, runbooks, and operational playbooks
Benefits & conditions
Dahl Consulting is proud to offer a comprehensive benefits package to eligible employees that will allow you to choose the best coverage to meet your family's needs. For details, please review the DAHL Benefits Summary: https://www.dahlconsulting.com/benefits-w2fta/.