DevOps Engineer - AI Infrastructure & Platforms

Insight Global

Palo Alto, United States of America

1 month ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Palo Alto, United States of America

Tech stack

Kubernetes Security

Artificial Intelligence

Amazon Web Services (AWS)

Automation of Tests

Azure

Cloud Computing

Cloud Engineering

Continuous Integration

DevOps

Monitoring of Systems

Identity and Access Management

Python

Node.js

Open Web Application Security

Shell Script

Web Applications

AI Infrastructure

Pulumi

Data Processing

React

Delivery Pipeline

Large Language Models

Model Validation

Infrastructure as Code (IaC)

Cloudformation

Containerization

AI Platforms

Kubernetes

Machine Learning Operations

Terraform

Data Pipelines

Job description

We are seeking a Senior DevOps Engineer to join our specialized AI Engineering and Research team. This team is responsible for building Everse (an evaluation and simulation platform for AI agents) and advanced LLM data pipelines. Your role will focus on architecting the underlying infrastructure that allows our researchers and engineers to deploy, scale, and monitor complex AI models and web applications securely.

You will bridge the gap between AI research and production-grade stability, ensuring our Kubernetes clusters and CI/CD pipelines are optimized for high-performance AI workloads., * Infrastructure as Code (IaC): Design, build, and maintain scalable cloud infrastructure using Terraform or CloudFormation.

Kubernetes Orchestration: Manage and optimize secure Kubernetes clusters, specifically for hosting data-heavy React/Node.js applications and Python-based AI services.
CI/CD Pipeline Development: Build and automate robust deployment pipelines to ensure rapid, high-frequency releases for the Everse platform.
MLOps Support: Collaborate with AI scientists to streamline the deployment of LLM and RLHF workflows, managing the infrastructure required for model evaluation and simulation.
Security & Compliance: Implement security best practices (OWASP, IAM roles) to ensure data privacy within our annotation and video surveillance tools.
Monitoring & Observability: Establish deep visibility into system performance and cost-tracking for cloud resources (AWS/GCP/Azure).

Requirements

6+ years of DevOps/SRE experience in a cloud-native environment.
Expert-level Kubernetes (K8s) knowledge, including cluster security, networking, and scaling.
Strong proficiency in Python (for automation scripts and data pipeline support) and Shell scripting.
Hands-on experience with Cloud Providers: Deep expertise in AWS, GCP, or Azure.
IaC Mastery: Proven experience with Terraform, Pulumi, or similar tools.
Security Mindset: Experience securing applications in Kubernetes and familiarity with container security scanning. * Prior experience supporting ML/AI teams or managing GPU-accelerated workloads.
Experience with MLOps tools (e.g., Kubeflow, MLflow, or Weights & Biases).
Familiarity with Vector Databases or high-scale data processing engines.
Background in automating complex simulation environments or sandboxes.

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all