DevOps Engineer - AI Infrastructure & Platforms
Role details
Job location
Tech stack
Job description
We are seeking a Senior DevOps Engineer to join our specialized AI Engineering and Research team. This team is responsible for building Everse (an evaluation and simulation platform for AI agents) and advanced LLM data pipelines. Your role will focus on architecting the underlying infrastructure that allows our researchers and engineers to deploy, scale, and monitor complex AI models and web applications securely.
You will bridge the gap between AI research and production-grade stability, ensuring our Kubernetes clusters and CI/CD pipelines are optimized for high-performance AI workloads., * Infrastructure as Code (IaC): Design, build, and maintain scalable cloud infrastructure using Terraform or CloudFormation.
-
Kubernetes Orchestration: Manage and optimize secure Kubernetes clusters, specifically for hosting data-heavy React/Node.js applications and Python-based AI services.
-
CI/CD Pipeline Development: Build and automate robust deployment pipelines to ensure rapid, high-frequency releases for the Everse platform.
-
MLOps Support: Collaborate with AI scientists to streamline the deployment of LLM and RLHF workflows, managing the infrastructure required for model evaluation and simulation.
-
Security & Compliance: Implement security best practices (OWASP, IAM roles) to ensure data privacy within our annotation and video surveillance tools.
-
Monitoring & Observability: Establish deep visibility into system performance and cost-tracking for cloud resources (AWS/GCP/Azure).
Requirements
-
6+ years of DevOps/SRE experience in a cloud-native environment.
-
Expert-level Kubernetes (K8s) knowledge, including cluster security, networking, and scaling.
-
Strong proficiency in Python (for automation scripts and data pipeline support) and Shell scripting.
-
Hands-on experience with Cloud Providers: Deep expertise in AWS, GCP, or Azure.
-
IaC Mastery: Proven experience with Terraform, Pulumi, or similar tools.
-
Security Mindset: Experience securing applications in Kubernetes and familiarity with container security scanning. * Prior experience supporting ML/AI teams or managing GPU-accelerated workloads.
-
Experience with MLOps tools (e.g., Kubeflow, MLflow, or Weights & Biases).
-
Familiarity with Vector Databases or high-scale data processing engines.
-
Background in automating complex simulation environments or sandboxes.