Lead Site Reliability Engineer

ChargeItSpot, LLC

Philadelphia, United States of America

2 months ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Philadelphia, United States of America

Tech stack

Bash

Cloud Computing

DevOps

Disaster Recovery

Python

Reliability Engineering

Prometheus

Datadog

Data Logging

Cloud Platform System

Grafana

Kubernetes

Infrastructure Automation Frameworks

Terraform

New Relic (SaaS)

Job description

The Lead Site Reliability Engineer is a senior technical leadership role within the Engineering organization, responsible for the reliability, availability, and operational excellence of ARC's cloud infrastructure and kiosks platform. This role owns uptime, SLAs, and incident response while driving long-term improvements to system resilience, observability, and operational maturity. The Lead SRE serves as both a hands-on technical leader and a force multiplier across platform, QA, and development teams., * We do what we say we will do.

Details matter. A lot.
Bias for action.
Customer obsession.
Diversity and authenticity.
No ego. Only outcome.
Think big.

Learn more about our core values here!

Why Join Us

We're building something ambitious - and doing it with integrity, collaboration, and purpose. If our mission and values resonate with you, we'd love to hear how you'd like to contribute and be part of the journey.

If you have the unique combination of skills and qualities we are looking for, please submit your resume and a cover letter expressing your motivation to apply to this position to https://experiencearc.com/careers/.

ARC by ChargeItSpot is an Equal Opportunity employer. Personnel are chosen on the basis of ability without regard to race, color, religion, sex, national origin, disability, marital status or sexual orientation, in accordance with federal and state law. Requirements:

Own uptime, SLAs, and overall reliability of cloud infrastructure and kiosks platform.
Lead incident response, root-cause analysis, and drive actionable postmortems.
Automate infrastructure, deployments, and operational tasks using modern IaC and scripting in collaboration with the Platform Engineering team.
Maintain and improve monitoring, alerting, and observability (Grafana, Prometheus, New Relic, etc).
Manage, operate and recommend improvement of mo
Execute and continuously improve disaster recovery and business continuity plans.
Partner with platform engineering, QA, and development teams to ensure operational readiness.
Establish and maintain runbooks, operational standards, and reliability best practices.
Provide leadership, mentorship, and clear communication during both normal operations and incidents.
Optimize cloud and Kubernetes environments for reliability, performance, and scalability.

Requirements

Do you have experience in Team management?, This role is well-suited for an experienced engineer who thrives in high-ownership environments and can balance real-time operational demands with strategic reliability initiatives. The Lead SRE will establish and evolve operational standards, disaster recovery practices, and automation frameworks, while leading incident response and postmortems with clarity and accountability. Strong communication, sound technical judgment, and a bias toward preventative engineering are critical to success in this role., You think several steps ahead. You are relentless, strategic, and a long-term thinker. You believe the details are important and so you get them right. You are a fast learner. You take feedback well and implement it. You care about getting to the best outcome, and do not focus on being right or wrong., * 8+ years in SRE, DevOps, or Platform Engineering roles; 2+ years in a senior or lead capacity.

Strong experience supporting production environments with strict SLAs and high uptime requirements.
Deep knowledge of Kubernetes, containers, and cloud-native infrastructure.
Proficiency in automation and scripting using Bash, Python, or Go.
Hands-on experience with CI/CD pipelines and release engineering in modern environments.
Expert-level familiarity with IaC tools (Terraform preferred).
Strong understanding of monitoring, alerting, logging, and observability tooling.
Experience implementing and managing GitOps workflows (ArgoCD or similar).
Demonstrated ability to lead incidents and communicate effectively with technical and non-technical stakeholders.
Solid understanding of disaster recovery planning, resilience practices, and system hardening.

About the company

Hybrid preferred (1-2 days onsite) or remote possible for the right candidate. Location to be discussed. We are headquartered in Philadelphia, PA and the team operates in East Coast business hours., Launched in late 2021 to serve frontline workers, ARC was born out of the consumer-facing technology that phone charging provider ChargeItSpot brought to market in 2012. ARC is a device management solution integrated with smart lockers, designed to store, secure, and charge company-owned handheld devices (i.e., by Zebra, Honeywell, etc.) that frontline workers use to do their jobs and perform their core job functions (e.g., for package scanning, inventory lookup, task management, mobile point of sale, etc.). Clients turn to ARC because they find that it is extremely difficult to manage and maintain their investment in enterprise mobile devices post purchase. There's a ton of waste from legacy, manual processes. Devices frequently go missing (25% annually), stop working, or run out of power - costing payroll time, money and productivity. ARC virtually eliminates these issues, ensuring that devices are functional, charged and not missing - all while improving productivity and experience for ground teams. Market demand for ARC has been overwhelming and the company has been growing rapidly. Device management has been a huge unmet need for decades with a problem space that is deceptively nuanced, complex and costly. ARC is uniquely positioned to solve these problems given its decade of relevant technical expertise with ChargeItSpot phone-charging lockers, our legacy mobile device product. ARC builds upon ChargeItSpot's competencies and has carried over expertise like IP (protected by 8 patents and counting), deep technical know-how, and real-world experience gained while solving similar problems in a live field environment. ARC's Mission Minimize Device Waste. Maximize Worker Productivity. Make Life Easier. ARC's Vision Be the unrivaled leader in physical device management. With more than 25,000 ARC units deployed by 2030, ARC will simplify life for more than 1,000,000 workers every day. Want to learn more? See the work we're doing with Sam's Club and Walmart Canada. About the Team At ARC, we surround ourselves with independent thinkers who are detail-oriented, and customer obsessed. Our clients have routinely called us "the most talented team they've ever worked with." We value determination, resourcefulness, imagination, and follow-through. We want people who are ready to get things done. Our focus is fierce, but it's not all hard work. We take time to get to know each other through a daily game of Jeopardy, meals together, and nights out for karaoke. We operate a hybrid work model, with most team members working in the office a couple of days a week, and the rest remotely.

Role details

Job location

Tech stack

Job description

Requirements

About the company

Apply for this position

Good distractions

Moments

Videos View all