Senior DevOps Engineer

Mission Pet Health
Watertown, United States of America
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Watertown, United States of America

Tech stack

API
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Application Services
Automation of Tests
Bash
Software as a Service
Cloud Computing
Static Program Analysis
Computer Networks
Linux
DevOps
Github
Identity and Access Management
Python
Key Management
MongoDB
OpenID
Redis
Reliability Engineering
Prometheus
Software Deployment
Software Engineering
Datadog
Scripting (Bash/Python/Go/Ruby)
Load Balancing
Autoscaling
Grafana
Gitlab-ci
Kubernetes
Kafka
Terraform
New Relic (SaaS)
Amazon Web Services (AWS)
Docker
Static Application Security Testing
Dynamic Application Security Testing

Job description

We're looking for a Senior DevOps Engineer to own our cloud infrastructure end-to-end - from operating a large multi-tenant Kubernetes environment to building CI/CD pipelines that teams actually trust. You'll work across AWS, drive infrastructure-as-code standards, and lead our migration toward GitLab CI and a Grafana-based observability stack while keeping production environments stable.

What You'll Do

  • Operate and scale a multi-tenant AWS EKS cluster where each client runs an isolated set of application services - owning tooling to onboard, scale, and observe hundreds of service instances reliably
  • Build and improve CI/CD pipelines in GitLab CI and GitHub Actions with automated testing, static analysis, and build-gated releases; maintain ArgoCD GitOps workflows for production deployments
  • Lead the migration from Datadog to a self-managed Grafana observability stack (Grafana, Loki, Mimir/Prometheus, Tempo) - dashboards, SLOs, alert routing, and on-call integration
  • Manage secrets, IAM, and security scanning pipelines using AWS KMS, Secrets Manager, external-secrets operator, and Auth0/Dex OIDC - enforcing least-privilege across all environments
  • Own and evolve the Redpanda (Kafka-compatible) streaming layer and its integrations with application workers
  • Drive cloud cost optimization through right-sizing, autoscaling, and shared infrastructure patterns on EKS
  • Document infrastructure with automated tooling (terraform-docs) and maintain standards that scale across teams
  • Automate operational toil - certificate renewal, clinic environment provisioning, deployment validation, runbook automation

Requirements

Do you have experience in WAF?, Required

  • 5+ years in DevOps or infrastructure engineering
  • 3+ years operating Kubernetes in production - AWS EKS preferred - including CSI drivers, cluster autoscaling, network policy (Calico), and pod identity
  • 3+ years hands-on with AWS core services (IAM, S3, KMS, Secrets Manager, STS, EKS, Load Balancer Controller, ECR)
  • Strong Terraform experience; GitOps experience with ArgoCD or Flux
  • Hands-on experience with GitLab CI and/or GitHub Actions
  • Scripting proficiency in Python and Bash
  • Experience with IAM design and security best practices (SAST/DAST, secret scanning, OIDC federation)
  • Familiarity with streaming or message-queue infrastructure (Redpanda, Kafka, or equivalent)

Nice to Have

  • Experience migrating from a SaaS observability tool (Datadog, New Relic) to a self-hosted Grafana stack
  • Grafana stack depth - Loki for logs, Mimir or Thanos for metrics, Tempo for traces, Alertmanager for routing
  • Experience with Redpanda specifically, or deep Kafka operations knowledge
  • Background in multi-tenant SaaS platforms or per-customer service isolation patterns
  • AWS certification
  • Familiarity with chaos engineering tooling (chaos-mesh or LitmusChaos)
  • Background in software engineering or scripting-heavy roles

Tech Stack

Current production: AWS (EKS, S3, KMS, Secrets Manager, STS, Load Balancer) · Terraform · GitHub Actions · ArgoCD · Kubernetes · Traefik · Coraza WAF · Redis HA · MongoDB · Auth0 · Dex · external-secrets · Datadog · Docker · Python · Bash · Linux

Where we're going: GitLab CI · Redpanda · Grafana · Loki · Prometheus/Mimir · Tempo · Alertmanager

Platform components you'll operate: ArgoCD · Traefik · Coraza WAF · Auth0 · Dex · Redis HA · MongoDB · API servers · client-facing portals · internal tooling

Benefits & conditions

  • Own infrastructure across a real multi-tenant platform serving production clinic environments
  • Lead the observability and streaming migrations - greenfield decisions with lasting impact
  • Collaborative engineering culture with high trust and low bureaucracy
  • Competitive salary, benefits, and flexible work arrangements

Apply for this position