DevOps & Infrastructure Lead
Role details
Job location
Tech stack
Job description
RV LIFE is looking for a Senior DevOps & Infrastructure Lead to help us stabilize, document, and modernize the infrastructure behind our products.
This is a hands-on senior role for someone comfortable inheriting real production systems, reducing operational risk, improving reliability, and moving us toward a documented, secure, automated, infrastructure-as-code operating model.
We run production across DigitalOcean, AWS, Cloudflare, and other hosting providers, and are consolidating onto managed, infrastructure-as-code platforms. We need deep, hands-on expertise across these environments.
RV LIFE is an AI-first engineering organization. We expect this person to use AI to accelerate discovery, documentation, runbooks, log review, scripting, and infrastructure-as-code drafting, while applying strict human judgment around security, secrets, production access, destructive commands, rollback, and correctness.
This role focuses on the infrastructure path to reliability; application-level architecture changes are handled in partnership with our engineering team. It is not just about keeping servers alive. It is about building durable practices that reduce single-person dependency, improve visibility, and make our systems safer to operate.
This is not a standard 9-to-5 role. Production issues do not keep business hours, so it carries real on-call responsibility: you need to be reachable and able to respond when unforeseen incidents arise.
What You'll Do
- Administer and improve existing DigitalOcean infrastructure.
- Support and improve Linux-based production server environments.
- Migrate self-managed databases onto managed database services, with validated failover, backups, and recovery.
- Move applications onto managed runtimes (including Laravel Cloud where it fits), replacing manual deploy processes with automated, repeatable pipelines.
- Expand and harden our use of Cloudflare for edge, static hosting, caching, and security.
- Build a clear inventory of servers, services, databases, domains, access paths, backups, monitoring, and operational risks.
- Create and maintain practical runbooks for common and emergency infrastructure workflows.
- Improve incident response, escalation paths, monitoring, logging, and alerting.
- Review and improve backup, restore, and disaster-recovery procedures.
- Identify recurring manual work and convert it into safer procedures, scripts, automation, or infrastructure-as-code.
- Help define infrastructure-as-code standards and move appropriate infrastructure into repeatable, version-controlled workflows.
- Work with AWS services where needed (Lambda, VPC, IAM, CloudWatch, S3, SSM/Secrets Manager, queues).
- Use AI tools to accelerate discovery, documentation, scripting, troubleshooting, and automation, with strong production-safety judgment.
- Partner with engineering leadership to prioritize infrastructure risk and modernization; track work clearly in Jira/GitHub and communicate proactively about risks, tradeoffs, and blockers.
What Success Looks Like
In the first 30-60 days, you'll take ownership of how we see and operate our infrastructure, building on what we already track and closing the gaps.
You'll validate and take ownership of what already exists:
- Our infrastructure inventory and server map
- Our monitoring and alerting
- Our DNS / Cloudflare configuration
- Our prioritized infrastructure risk register
You'll create what we're missing:
- An access and credential map
- Verified backup and restore status for critical systems (tested, not assumed)
- Runbooks for the highest-risk operational workflows
In the first 90 days, you'll move us toward a durable, consolidated model. Success means:
- The first core database migrated to a managed service, with a tested restore, plus a clear, sequenced plan for the rest.
- The first application running on a managed runtime (App Platform or Laravel Cloud).
- The first static frontend served from Cloudflare Pages.
- A measurably stronger edge security posture.
- Critical systems no longer understood by only one person; common tasks have documented procedures; manual processes are being converted to automation; AI is used safely to reduce toil., * Takes ownership without waiting to be told every next step.
- Is calm and practical during incidents.
- Can inherit messy systems without being judgmental or reckless.
- Prefers consolidating on platforms we already run over adding new vendors.
- Documents as you go.
- Uses AI as leverage, but does not blindly trust its output; you verify, test, and apply senior judgment before anything touches production.
- Knows when to automate and when to stabilize first.
- Communicates clearly with technical and non-technical stakeholders.
- Understands that reliability is not just uptime: it is visibility, repeatability, recovery, and shared understanding.
- Wants to leave infrastructure better than you found it.
Requirements
- Senior-level experience operating production infrastructure.
- Deep, hands-on Linux server administration (the traditional, 'old-school' kind): operating, securing, and troubleshooting manually managed production servers (LAMP/LEMP, system services, cron, networking, SSH) directly at the command line, not only through a cloud console.
- Experience with DigitalOcean, Linode, AWS EC2, bare VPS hosting, or comparable environments.
- Senior database operations: migrating self-managed MySQL to a managed service, replication, backup validation, restore testing, and IO isolation.
- Strong Cloudflare across DNS, WAF, CDN and caching behavior, page rules, Workers, Pages, and Zero Trust/Access, including traffic routing and origin protection.
- PHP/Laravel application environments, and experience with a managed Laravel runtime (Laravel Cloud and/or DigitalOcean App Platform).
- Datadog or a comparable observability platform for monitoring, alerting, dashboards, logs, and incident investigation.
- Infrastructure-as-code such as Terraform, Pulumi, AWS CDK, Serverless Framework, or CloudFormation.
- CI/CD pipelines and deployment automation.
- Practical AWS experience (Lambda, IAM, VPC, CloudWatch, S3, SSM/Secrets Manager, queues).
- Good judgment around production safety, access control, secrets, backups, and incident response.
- Willingness to carry real on-call responsibility and respond to production incidents outside normal business hours; this is not a strict 9-to-5 role.
- A habit of documenting what you learn and creating runbooks others can follow.
- Practical experience using AI tools (ChatGPT, Claude, Cursor, GitHub Copilot, or similar), with strong judgment about where human verification is required.
- Ability to work independently in a small, remote engineering organization where practical ownership matters more than bureaucracy., * Experience migrating manually managed services onto managed platforms or IaC.
- Experience moving static frontends onto Cloudflare Pages.
- Managed migrations for MongoDB, OpenSearch, or Valkey/Redis.
- Experience supporting Node.js, React, and React Native alongside PHP.
- Experience helping organizations reduce infrastructure bus-factor risk.
- Experience working with external DevOps/security partners or auditors.