Platform Engineer - AI Agent Infrastructure

Jobgether
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Remote

Tech stack

Artificial Intelligence
Airflow
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Google BigQuery
Cascading
Cloud Computing
Cloud Engineering
Computer Networks
Databases
Software Debugging
Distributed Systems
Identity and Access Management
PostgreSQL
Enterprise Messaging Systems
MongoDB
NoSQL
Platform as a Service (PAAS)
RabbitMQ
Redis
Reliability Engineering
Message Oriented Middleware
SQL Databases
AI Infrastructure
Datadog
Data Logging
Pulumi
System Availability
Delivery Pipeline
Large Language Models
Snowflake
Grafana
Caching
Amazon Web Services (AWS)
Backend
Event Driven Architecture
Amazon Web Services (AWS)
Containerization
Kubernetes
Infrastructure Automation Frameworks
Kafka
Data Management
Machine Learning Operations
Virtual Agents
Terraform
Docker
Databricks

Job description

This is a high-impact engineering opportunity to shape the infrastructure behind a rapidly growing AI agent platform operating at global scale. You will take ownership of core cloud architecture, ensuring systems remain reliable, observable, and ready for continuous expansion. The role combines deep infrastructure expertise with strong architectural thinking, making it ideal for someone who enjoys building scalable distributed systems rather than simply maintaining existing environments. You will lead decisions around messaging systems, automation, deployment pipelines, and platform resilience. Working in a remote and innovation-driven environment, you will collaborate with high-performing teams using modern cloud and AI technologies. This is an opportunity to directly influence the future of AI infrastructure in production environments. Accountabilities:

  • Own and evolve the cloud infrastructure supporting AI agents running at scale in production environments
  • Design and implement event-driven architectures using durable asynchronous messaging systems
  • Improve inter-service communication by replacing synchronous dependencies with scalable messaging patterns
  • Build and maintain infrastructure as code frameworks for provisioning, deployment, and environment consistency
  • Ensure platform reliability, scalability, and performance across distributed workloads
  • Develop advanced observability capabilities including dashboards, alerts, tracing, logging, and health monitoring
  • Lead incident response analysis and proactively improve system resilience based on production learnings
  • Evaluate emerging technologies and drive architectural decisions as the platform matures
  • Optimize databases, storage systems, and caching layers for speed, availability, and cost efficiency
  • Collaborate with engineering teams to support secure and efficient deployment of AI workloads

Requirements

  • 4+ years of experience in platform engineering, infrastructure engineering, SRE, or backend systems roles
  • Strong expertise in event-driven architecture and messaging systems such as Kafka, RabbitMQ, NATS, or similar
  • Deep AWS experience including EC2, VPC, IAM, S3, RDS, and internal networking concepts
  • Solid experience with SQL databases such as PostgreSQL and NoSQL systems such as MongoDB or Redis
  • Strong Docker knowledge including container lifecycle management, health checks, resource limits, and image optimization
  • Proven experience debugging distributed systems, asynchronous flows, and cascading production failures
  • Hands-on experience with Infrastructure as Code tools such as Terraform or Pulumi
  • Strong observability skills using Datadog or equivalent tools for APM, logging, monitoring, and tracing
  • Experience with Go or similar backend programming languages
  • Strong communication skills and ability to lead technical decisions in remote teams, * Experience supporting AI or MLOps infrastructure, model serving, LLM inference, or GPU workloads
  • Familiarity with LangFuse, LangSmith, Braintrust, MLflow, or similar AI observability tools
  • Experience building multi-tenant container platforms or internal PaaS environments
  • Kubernetes migration or production operations experience
  • Exposure to Airflow, Prefect, Snowflake, BigQuery, Databricks, or similar data platforms
  • ECS and AI agent framework ecosystem knowledge is a plus

Benefits & conditions

  • Competitive compensation package
  • Fully remote work from anywhere
  • One-time home office setup allowance
  • Company-provided work equipment
  • Stock options
  • Health plan coverage regardless of location
  • Flexible paid time off
  • Language learning and professional development courses
  • Personal growth and continuous learning support
  • Opportunity to shape cutting-edge AI infrastructure at scale

Apply for this position