DevOps Engineer - AI Infrastructure & Platforms

Insight Global
Palo Alto, United States of America
6 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Palo Alto, United States of America

Tech stack

Kubernetes Security
Artificial Intelligence
Amazon Web Services (AWS)
Automation of Tests
Azure
Cloud Computing
Cloud Engineering
Continuous Integration
DevOps
Monitoring of Systems
Identity and Access Management
Python
Node.js
Open Web Application Security
Shell Script
Web Applications
AI Infrastructure
Pulumi
Data Processing
React
Delivery Pipeline
Large Language Models
Model Validation
Infrastructure as Code (IaC)
Cloudformation
Containerization
AI Platforms
Kubernetes
Machine Learning Operations
Terraform
Data Pipelines

Job description

We are seeking a Senior DevOps Engineer to join our specialized AI Engineering and Research team. This team is responsible for building Everse (an evaluation and simulation platform for AI agents) and advanced LLM data pipelines. Your role will focus on architecting the underlying infrastructure that allows our researchers and engineers to deploy, scale, and monitor complex AI models and web applications securely.

You will bridge the gap between AI research and production-grade stability, ensuring our Kubernetes clusters and CI/CD pipelines are optimized for high-performance AI workloads., * Infrastructure as Code (IaC): Design, build, and maintain scalable cloud infrastructure using Terraform or CloudFormation.

  • Kubernetes Orchestration: Manage and optimize secure Kubernetes clusters, specifically for hosting data-heavy React/Node.js applications and Python-based AI services.

  • CI/CD Pipeline Development: Build and automate robust deployment pipelines to ensure rapid, high-frequency releases for the Everse platform.

  • MLOps Support: Collaborate with AI scientists to streamline the deployment of LLM and RLHF workflows, managing the infrastructure required for model evaluation and simulation.

  • Security & Compliance: Implement security best practices (OWASP, IAM roles) to ensure data privacy within our annotation and video surveillance tools.

  • Monitoring & Observability: Establish deep visibility into system performance and cost-tracking for cloud resources (AWS/GCP/Azure).

Requirements

  • 6+ years of DevOps/SRE experience in a cloud-native environment.

  • Expert-level Kubernetes (K8s) knowledge, including cluster security, networking, and scaling.

  • Strong proficiency in Python (for automation scripts and data pipeline support) and Shell scripting.

  • Hands-on experience with Cloud Providers: Deep expertise in AWS, GCP, or Azure.

  • IaC Mastery: Proven experience with Terraform, Pulumi, or similar tools.

  • Security Mindset: Experience securing applications in Kubernetes and familiarity with container security scanning. * Prior experience supporting ML/AI teams or managing GPU-accelerated workloads.

  • Experience with MLOps tools (e.g., Kubeflow, MLflow, or Weights & Biases).

  • Familiarity with Vector Databases or high-scale data processing engines.

  • Background in automating complex simulation environments or sandboxes.

Apply for this position