TELECOMMUTE DevOps Staff Engineer - Platform & Reliability

The Search Solutions, LLC
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Remote

Tech stack

Artificial Intelligence
Amazon Web Services (AWS)
Cloud Computing
Software Quality
Continuous Integration
Data Stores
DevOps
Distributed Systems
Python
Systems Development Life Cycle
Data Processing
Google Cloud Platform
Large Language Models
Multi-Agent Systems
Model Validation
AI Platforms
Kubernetes
Virtual Agents
Terraform
Microservices

Job description

As the Staff DevOps Engineer you will be part of the new Product Engineering team tasked with designing and building the next generation of Agentic AI-powered products for. Acting as the Technical Lead and Primary Architect, you will be a hands-on leader responsible for the team's overall delivery of the runtime environment and automation for AI services and Agents. You will lead a small squad by decomposing complex platform requirements-such as AI-specific CI/CD, agent observability, and automated scaling-into actionable tasks while remaining deeply embedded in the codebase, * Technical Lead & Execution: Lead the technical delivery of the Agentic Platform by translating high-level infrastructure roadmaps into actionable development tasks. You will own tasks breakdown for your squad, ensuring high-quality output through technical mentorship and rigorous architectural oversight.

  • Automated Agent Delivery - CI/CD: Architect and implement high-velocity CI/CD pipelines specifically designed for the lifecycle of AI Agents and services, including automated model evaluation and blue-green deployments for agentic workflows on Google Cloud Platform.

  • Cloud Infrastructure Engineering: Lead the design and implementation of our cloud-native infrastructure on Google Cloud Platform using Terraform and Kubernetes (GKE). You will own the core runtime environment where autonomous agents and transactional microservices coexist.

  • Agentic Observability & SRE: Apply SRE principles to build a specialized monitoring and alerting stack for AI agents. You will implement tracing for agent "reasoning loops" and ensure the reliability of the underlying Vector and Graph data stores.

  • AI-Native SDLC Leadership: Actively utilize coding agents to plan, generate, and refactor platform code and Infrastructure as Code "IaC", maintaining high velocity while ensuring code quality.

  • Scale & Performance: Monitor and optimize the performance and cost-effectiveness of AI workloads, ensuring our platform can handle high-frequency agent calls and multi-modal data processing., * Security & Governance: Own the implementation of secure runtime boundaries, ensuring that both human users and AI agents operate within strict, audited permission sets

Requirements

Experience: 10+ years of Software or Platform Engineering experience, with a background as a hands-on engineer who has successfully led technical squads. Technical Stack: Expert mastery of Google Cloud Platform (GKE, Vertex AI), Terraform, Kubernetes, and Python. Product AI Platform: Proven track record of designing and shipping production platforms for AI/LLM workloads, including specialized CI/CD and observability for agentic architectures. Reliability Mindset: Strong command of SRE principles, including experience with SLOs, error budgets, and troubleshooting complex distributed systems. Cloud Infrastructure: Experienced in working with cloud platforms (Google Cloud Platform, AWS) and deploying containerized services that are secure and scalable. Coding Agents: Demonstrated proficiency in using coding agents to accelerate the SDLC and plan and code complex engineering tasks.

Apply for this position