Senior Platform Engineer - AI Infrastructure & Observability
Role details
Job location
Tech stack
Job description
We're looking for a Senior Platform Engineer (m/f/d) excited about building infrastructure for AI-first applications. You'll own our cloud platform - from Kubernetes clusters running real-time voice agents to ClickHouse analytics pipelines processing millions of events daily. You'll tackle novel observability challenges such as monitoring ClickHouse cluster health, ensuring sub-200ms latency for voice AI, and tracking data pipeline quality. We're AI-first not just in what we build, but in how we operate - leveraging AI-native tools for incident response and building automation that uses LLMs to accelerate debugging and root cause analysis.
Your mission
- Automate Incident Management: Implement AI-native incident management tools to accelerate response and automate root cause analysis.
- Manage Cloud Infrastructure: Operate and optimize AWS EKS infrastructure with Terraform, tailored for AI workloads and analytics pipelines.
- Ensure Data Reliability: Maintain ETL workflows, ClickHouse cluster health, and batch jobs, ensuring data freshness and quality.
- Optimize System Performance: Design API failover strategies, implement caching layers, and continuously optimize infrastructure.
- Improve Developer Experience: Maintain Skaffold-based local development environments, enhance CI/CD pipelines, and build internal productivity tooling.
- Enhance Observability: Implement and monitor SLOs, use AI tools for log analysis, and improve visibility through structured logging.
Requirements
Do you have experience in gRPC?, * Infrastructure Expertise: 5+ years of software engineering experience and 3+ years running Kubernetes in production (AWS EKS preferred).
- IaC Mastery: Strong Terraform and GitOps workflows experience, with deep AWS knowledge (VPCs, RDS, ElastiCache, Lambda).
- Data & AI Focus: Experience monitoring ETL pipelines and analytics workloads (ClickHouse, Redshift, BigQuery), and excitement for AI-native operations tools such as log analysis or automated remediation.
- Backend & Leadership: Proficient in Python or Kotlin/Java, familiar with FastAPI, Spring Boot, Django, or gRPC. Able to work independently, mentor others, and drive technical decisions.
- Mindset: Strong written communication skills, comfort with ambiguity, and motivation to build at the intersection of AI and infrastructure.