Senior DevOps Engineer

Lumicity LLC
Millbrae, United States of America
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Millbrae, United States of America

Tech stack

Amazon Web Services (AWS)
Bash
Cloud Computing
Cloud Engineering
Continuous Integration
Data Systems
Software Debugging
DevOps
Distributed Systems
Github
Python
PostgreSQL
Reliability Engineering
Software Vulnerability Management
Datadog
Scripting (Bash/Python/Go/Ruby)
System Availability
Grafana
Backend
Sentry
Machine Learning Operations
Vertica
Terraform
Serverless Computing

Job description

We're hiring a Senior DevOps Engineer or Site Reliability Engineer - depending on where your experience and interests land.

Both roles sit within our engineering team, report into engineering leadership, and work closely with backend and ML engineers. The difference is in focus:

  • DevOps track: Infrastructure as code, CI/CD, deployment systems, developer experience, and platform reliability.
  • SRE track: Observability, incident management, SLO frameworks, and production reliability across distributed systems.

Whichever track you're on, this is a hands-on, high-ownership role. You'll have real production responsibility and real impact on how the platform performs at scale.

What you'll work on

  • Design and evolve AWS-based cloud infrastructure using Terraform
  • Own and improve CI/CD pipelines (GitHub Actions) for fast, safe deployments
  • Standardize deployment patterns across serverless workloads (Lambda), containerized services (ECS), and workflow orchestration systems
  • Define observability standards across metrics, logs, and traces using OpenTelemetry, Datadog, Grafana, and Sentry
  • Build proactive detection for reliability risks, latency regressions, and performance degradation
  • Partner with backend and ML teams to debug distributed system issues, including Postgres performance
  • Lead and support incident response and root cause analysis
  • Automate security and compliance workflows (access controls, audit readiness, vulnerability management)
  • Participate in on-call rotation, * Modern cloud-native stack: AWS, Terraform, GitHub Actions, ECS, Lambda, Aurora Postgres, Datadog, OpenTelemetry

Requirements

Must have:

  • 7+ years in DevOps, SRE, or infrastructure engineering in a B2B SaaS environment
  • Strong production AWS experience
  • Deep hands-on Terraform (IaC) experience
  • CI/CD pipeline ownership (GitHub Actions or equivalent)
  • Experience with serverless and containerized services in production
  • Postgres in production (performance, tuning, operations)
  • Observability tooling: metrics, logs, traces - and the ability to turn signals into action
  • Scripting fluency (Python, Bash, or similar)
  • High ownership mindset - you're not waiting to be assigned an incident, you're already thinking about failure modes

Nice to have:

  • Experience in healthcare, fintech, or other regulated environments
  • ClickHouse or high-scale analytics systems
  • OpenTelemetry and modern observability architecture
  • ML infrastructure experience

About the company

Series B healthcare AI company that has grown revenue by a tremendous amount. More than 100 enterprise healthcare organizations use our platform to automate complex, compliance-critical operational workflows - the kind of work that used to require large manual teams and still carries serious downstream risk if it breaks. We're about 100 people, well-funded, and at an inflection point: our platform is scaling fast, our engineering team is growing, and reliability is becoming mission-critical. This isn't a company that's been around long enough to accumulate decades of technical debt. You'd be building the right foundation from the start.

Apply for this position