Platform Engineer - AI Agent Infrastructure

Jobgether

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Remote

Tech stack

Artificial Intelligence

Airflow

Amazon Web Services (AWS)

Google BigQuery

Cascading

Cloud Computing

Cloud Engineering

Computer Networks

Databases

Software Debugging

Distributed Systems

Identity and Access Management

PostgreSQL

Enterprise Messaging Systems

MongoDB

NoSQL

Platform as a Service (PAAS)

RabbitMQ

Redis

Reliability Engineering

Message Oriented Middleware

SQL Databases

AI Infrastructure

Datadog

Data Logging

Pulumi

System Availability

Delivery Pipeline

Large Language Models

Snowflake

Grafana

Caching

Amazon Web Services (AWS)

Backend

Event Driven Architecture

Amazon Web Services (AWS)

Containerization

Kubernetes

Infrastructure Automation Frameworks

Kafka

Data Management

Machine Learning Operations

Virtual Agents

Terraform

Docker

Databricks

Job description

This is a high-impact engineering opportunity to shape the infrastructure behind a rapidly growing AI agent platform operating at global scale. You will take ownership of core cloud architecture, ensuring systems remain reliable, observable, and ready for continuous expansion. The role combines deep infrastructure expertise with strong architectural thinking, making it ideal for someone who enjoys building scalable distributed systems rather than simply maintaining existing environments. You will lead decisions around messaging systems, automation, deployment pipelines, and platform resilience. Working in a remote and innovation-driven environment, you will collaborate with high-performing teams using modern cloud and AI technologies. This is an opportunity to directly influence the future of AI infrastructure in production environments. Accountabilities:

Own and evolve the cloud infrastructure supporting AI agents running at scale in production environments
Design and implement event-driven architectures using durable asynchronous messaging systems
Improve inter-service communication by replacing synchronous dependencies with scalable messaging patterns
Build and maintain infrastructure as code frameworks for provisioning, deployment, and environment consistency
Ensure platform reliability, scalability, and performance across distributed workloads
Develop advanced observability capabilities including dashboards, alerts, tracing, logging, and health monitoring
Lead incident response analysis and proactively improve system resilience based on production learnings
Evaluate emerging technologies and drive architectural decisions as the platform matures
Optimize databases, storage systems, and caching layers for speed, availability, and cost efficiency
Collaborate with engineering teams to support secure and efficient deployment of AI workloads

Requirements

4+ years of experience in platform engineering, infrastructure engineering, SRE, or backend systems roles
Strong expertise in event-driven architecture and messaging systems such as Kafka, RabbitMQ, NATS, or similar
Deep AWS experience including EC2, VPC, IAM, S3, RDS, and internal networking concepts
Solid experience with SQL databases such as PostgreSQL and NoSQL systems such as MongoDB or Redis
Strong Docker knowledge including container lifecycle management, health checks, resource limits, and image optimization
Proven experience debugging distributed systems, asynchronous flows, and cascading production failures
Hands-on experience with Infrastructure as Code tools such as Terraform or Pulumi
Strong observability skills using Datadog or equivalent tools for APM, logging, monitoring, and tracing
Experience with Go or similar backend programming languages
Strong communication skills and ability to lead technical decisions in remote teams, * Experience supporting AI or MLOps infrastructure, model serving, LLM inference, or GPU workloads
Familiarity with LangFuse, LangSmith, Braintrust, MLflow, or similar AI observability tools
Experience building multi-tenant container platforms or internal PaaS environments
Kubernetes migration or production operations experience
Exposure to Airflow, Prefect, Snowflake, BigQuery, Databricks, or similar data platforms
ECS and AI agent framework ecosystem knowledge is a plus