TELECOMMUTE DevOps Staff Engineer - Platform & Reliability

The Search Solutions, LLC

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Remote

Tech stack

Artificial Intelligence

Amazon Web Services (AWS)

Cloud Computing

Software Quality

Continuous Integration

Data Stores

DevOps

Distributed Systems

Python

Systems Development Life Cycle

Data Processing

Google Cloud Platform

Large Language Models

Multi-Agent Systems

Model Validation

AI Platforms

Kubernetes

Virtual Agents

Terraform

Microservices

Job description

As the Staff DevOps Engineer you will be part of the new Product Engineering team tasked with designing and building the next generation of Agentic AI-powered products for. Acting as the Technical Lead and Primary Architect, you will be a hands-on leader responsible for the team's overall delivery of the runtime environment and automation for AI services and Agents. You will lead a small squad by decomposing complex platform requirements-such as AI-specific CI/CD, agent observability, and automated scaling-into actionable tasks while remaining deeply embedded in the codebase, * Technical Lead & Execution: Lead the technical delivery of the Agentic Platform by translating high-level infrastructure roadmaps into actionable development tasks. You will own tasks breakdown for your squad, ensuring high-quality output through technical mentorship and rigorous architectural oversight.

Automated Agent Delivery - CI/CD: Architect and implement high-velocity CI/CD pipelines specifically designed for the lifecycle of AI Agents and services, including automated model evaluation and blue-green deployments for agentic workflows on Google Cloud Platform.
Cloud Infrastructure Engineering: Lead the design and implementation of our cloud-native infrastructure on Google Cloud Platform using Terraform and Kubernetes (GKE). You will own the core runtime environment where autonomous agents and transactional microservices coexist.
Agentic Observability & SRE: Apply SRE principles to build a specialized monitoring and alerting stack for AI agents. You will implement tracing for agent "reasoning loops" and ensure the reliability of the underlying Vector and Graph data stores.
AI-Native SDLC Leadership: Actively utilize coding agents to plan, generate, and refactor platform code and Infrastructure as Code "IaC", maintaining high velocity while ensuring code quality.
Scale & Performance: Monitor and optimize the performance and cost-effectiveness of AI workloads, ensuring our platform can handle high-frequency agent calls and multi-modal data processing., * Security & Governance: Own the implementation of secure runtime boundaries, ensuring that both human users and AI agents operate within strict, audited permission sets

Requirements

Experience: 10+ years of Software or Platform Engineering experience, with a background as a hands-on engineer who has successfully led technical squads. Technical Stack: Expert mastery of Google Cloud Platform (GKE, Vertex AI), Terraform, Kubernetes, and Python. Product AI Platform: Proven track record of designing and shipping production platforms for AI/LLM workloads, including specialized CI/CD and observability for agentic architectures. Reliability Mindset: Strong command of SRE principles, including experience with SLOs, error budgets, and troubleshooting complex distributed systems. Cloud Infrastructure: Experienced in working with cloud platforms (Google Cloud Platform, AWS) and deploying containerized services that are secure and scalable. Coding Agents: Demonstrated proficiency in using coding agents to accelerate the SDLC and plan and code complex engineering tasks.

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all