Platform Engineer (Cloud Infrastructure, AI Platform)
Role details
Job location
Tech stack
Job description
Home-Office Apache Airflow ArgoCD Artificial Intelligence (AI) Backend Entwicklung Backup / Recovery CI/CD (Continuous Integration/Delivery)
+19 Top, As a Platform Engineer within Advanced Analytics (DA3) in the Chief Data & AI Office area at Allianz Partners, you will join our central AI team to build and operate the cloud infrastructure that powers AI-enabled solutions at global scale., We are looking for an engineer with deep Kubernetes and cloud expertise to implement, automate, and maintain the platform foundations that enable teams to deploy and operate AI services reliably.
You will work in a cross-functional environment with Backend Engineers, ML Engineers, AI Architects, and Platform Architects, taking hands-on ownership of the infrastructure layer, from Kubernetes clusters and CI/CD pipelines to observability systems and security controls.
In this role, you will translate platform architecture into working infrastructure, reduce operational toil through automation, and ensure production systems meet reliability and security standards.
Through this role, you will have the main following responsibilities:
- Implement and operate Kubernetes infrastructure (AKS): cluster lifecycle, networking, resource management, auto-scaling, and multi-tenancy patterns.
- Build and maintain CI/CD pipelines using GitHub Actions and ArgoCD for automated testing, container builds, and GitOps deployments.
- Develop Infrastructure as Code (Terraform, Bicep) to provision and manage Azure resources with consistency and auditability.
- Operate container registries (ACR), artifact management, and image security scanning workflows.
- Implement and maintain observability infrastructure: Azure Monitor, Application Insights, Prometheus, Grafana-including dashboards, alerting, and distributed tracing.
- Manage async processing infrastructure: Celery workers, Redis queues, and workflow orchestration patterns supporting AI agent execution.
- Implement platform security controls: network policies, pod security standards, Key Vault integration, RBAC, and private endpoint configurations.
- Support database infrastructure: PostgreSQL management, backup/recovery, connection pooling, and performance tuning.
- Create self-service tooling and templates that enable development teams to deploy and operate services with minimal friction.
- Diagnose and resolve infrastructure issues across clusters, pipelines, and cloud services; perform root-cause analysis and implement preventative improvements.
- Collaborate with Platform Architects, Backend Engineers, and ML Engineers to translate architecture designs into reliable infrastructure., Our employees play an integral part in our success as a business. We appreciate that each of our employees are unique and have unique needs, ambitions and we enjoy being a part of their journey. We are there to empower and encourage you with your personal and professional development ensuring that you take control by offering a large variety of courses and targeted development programs.
Requirements
- 5+ years professional experience in platform engineering, SRE, or DevOps roles; experience supporting AI/ML workloads is a strong plus.
- Strong Kubernetes experience: cluster operations, networking (Ingress, network policies), storage, autoscaling, and troubleshooting.
- Solid Infrastructure as Code experience with Terraform, Bicep, or equivalent tools.
- Production experience with Azure cloud services: AKS, ACR, Key Vault, Azure Monitor, Virtual Networks, Private Endpoints, and Azure Policy.
- Strong CI/CD experience: GitHub Actions (self-hosted runners, reusable workflows), ArgoCD, or similar GitOps tooling.
- Proficiency in Python for automation, scripting, and tooling.
- Experience with container security: image scanning, runtime security, network policies, and least-privilege patterns.
- Experience with observability stack: Prometheus, Grafana, centralized logging, and alerting configuration.
- Familiarity with async task processing: Celery, Redis, or equivalent message queue patterns.
- Strong Linux systems administration and networking fundamentals.
- Operational mindset with strong troubleshooting skills across infrastructure layers.
Ways of Working
- Comfortable in agile, iterative delivery environments with ownership and accountability.
- Clear communicator and collaborator across global, cross-functional stakeholders.
- Strong focus on reliability and automation: you measure success by system uptime and reduced manual toil.
- Proactive learner with pragmatic adoption of AI-assisted developer tools (GitHub Copilot, Claude Code) to improve automation and delivery.
Nice to Have
- Experience supporting AI/ML infrastructure: GPU scheduling, model serving platforms, or ML pipeline orchestration.
- Service mesh experience (Istio, Linkerd) for traffic management and security.
- Experience with Databricks or similar data platform infrastructure.
- Familiarity with workflow orchestration (Temporal, Airflow) for complex AI pipelines.
- Experience with cost optimization: FinOps practices, resource right-sizing, and reserved capacity planning.
- Experience in regulated environments where auditability and secure-by-default infrastructure are essential.
- Certifications: CKA/CKAD, Azure Administrator, or Terraform Associate.