Development Architect for the Autonomous Operations Platform (AIOps)
Role details
Job location
Tech stack
Job description
We want to transform cloud infrastructure operations through AI at SAP. You'll contribute to autonomous operations capabilities within the Apeiro Reference Architecture (ApeiroRA), part of the EU's IPCEI-CIS initiative to strengthen Europe's digital sovereignty. Our mission: eliminate manual incident response as much as possible and enable predictive detection across distributed cloud environments.
Working within the Apeiro ecosystem and the Linux Foundation's NeoNephos community, you'll collaborate daily across various teams. We share knowledge freely, review each other's code with kindness, and celebrate wins as a team. You will grow and so will we, because you're here.
The Role: In your role as a Development Architect, you'll conceptualize and detail out our idea of an autonomous operations platform. You'll plan and design distributed systems that enable AI-driven incident management, define integrations with AI/ML services, and develop Kubernetes-native operators that autonomously remediate infrastructure issues based on AI insights.
Your technical leadership will influence design and implementation approaches of the autonomous operations platform within the Apeiro Reference Architecture. You'll tackle challenges like telemetry correlation across logs/metrics/traces, automated root cause analysis, and knowledge graph systems that power runbook automation.
You'll design production systems that effectively utilize AI/ML services while ensuring core functionality remains intact during infrastructure failures and network partitions. This work will power autonomous operations across SAP's global cloud infrastructure, directly supporting the mission to maximize automated resolution rates and minimize manual effort at enterprise scale.
Requirements
- Expert Programming: Deep expertise in languages such as Python, Java, or Go with focus on distributed systems, service integration, and cloud-native architecture at scale
- Technical Leadership: Experience in a dedicated software architecture role with a proven track record of finding solutions to complex problems
- Cloud & Kubernetes: Kubernetes skills including operator development, custom controllers, and production operations across multi-cloud environments
- Observability Skills: Experience with technologies like Prometheus, OpenTelemetry, Grafana, timeseries databases, and distributed tracing systems
- AI/ML Integration: Understanding of how to consume and integrate AI/ML services and interpret model outputs for operational decisions
- Cloud-Native Practices: Proficiency in CI/CD pipelines, Infrastructure-as-Code (e.g. ArgoCD, GitHub Actions, Terraform), and GitOps workflows
- Resilience Patterns: Understanding of distributed systems failure modes, graceful degradation, and designing for unreliable infrastructure
Soft Skills:
- Open Source Contributions: Active involvement in open-source communities - we'd love to see what you've built and shared with the world
- Team Spirit: You communicate openly, give and receive feedback gracefully, and genuinely enjoy working closely with others
- Language Skills: Fluency both written and spoken in English (German is a plus, but not a dealbreaker - we'll figure it out together)
- Mentorship & Knowledge Sharing: Experience guiding junior or mid-level engineers and contributing to a learning culture within your team
- Innovation Drive: The urge to question the status quo, discover problems, and find creative solutions