AI Ops - Senior Architect
Role details
Job location
Tech stack
Job description
The Senior Architect will collaborate with cross-functional engineering, cloud, platform, and data science teams to deliver predictive, proactive, and automated operational outcomes., * Lead the architecture and implementation of AI-powered operational frameworks, including predictive analytics, anomaly detection, NLP-driven automation, and auto-remediation systems.
- Define and evolve the overall AI Ops strategy, roadmap, standards, and governance.
- Implement intelligent monitoring and decision models that enhance reliability and operational efficiency.
- Architect solutions that integrate machine learning models into production operations workflows.
Observability, Monitoring & Automation
- Design end-to-end observability ecosystems (metrics, logs, traces, topology, events) integrated with AI/ML platforms.
- Build anomaly detection models using ML and time-series analysis to identify issues before failures occur.
- Drive automated incident detection, impact assessment, and classification using AI-based models.
- Implement proactive auto-healing and automated resolution workflows.
Cloud & Platform Engineering
- Architect scalable AI Ops platforms using AWS, Azure, or Google Cloud Platform cloud-native services.
- Design infrastructure and pipelines for AI-driven monitoring and operational insights.
- Integrate AI Ops capabilities with Kubernetes, service mesh, cloud-native microservices, and distributed systems.
- Optimize cost, performance, and reliability using intelligent orchestration and scaling.
Data Engineering & ML Ops Integration
- Partner with data engineering teams to build robust data pipelines for operational data ingestion.
- Work with ML Ops teams to operationalize ML models, including training, evaluation, deployment, and monitoring.
- Ensure continuous retraining and drift detection for AI Ops models.
- Define data taxonomies, quality standards, and metadata management for operational datasets.
SRE, DevOps & Automation Frameworks
- Align AI Ops with SRE principles, SLIs, SLOs, and error budgets.
- Integrate AI-driven insights into CI/CD pipelines and operational workflows.
- Develop event-driven, automated runbooks using ML and rule-based systems.
- Implement intelligent capacity planning, scaling, and resource optimization.
Security, Compliance & Governance
- Ensure AI Ops solutions meet enterprise security, compliance, and audit requirements.
- Define governance frameworks for AI model usage, transparency, and monitoring.
- Collaborate with cybersecurity teams on intelligent threat detection and risk analysis.
Leadership & Collaboration
- Provide architectural leadership and technical direction to engineering and operations teams.
- Mentor teams on AI Ops concepts, automation, and intelligent operations.
- Present architecture proposals and operational improvements to leadership stakeholders.
- Influence enterprise-wide transformation toward autonomous operations.
Requirements
We are seeking a highly skilled AI Ops - Senior Architect to lead the design, implementation, and optimization of AI-driven operational platforms across large-scale, mission-critical environments. The ideal candidate will possess deep expertise in machine learning-enabled operations, observability, automation frameworks, cloud engineering, and enterprise SRE/DevOps practices. This role will drive the transformation of traditional IT operations into intelligent, autonomous, self-healing systems., * 12+ years of IT experience with 5+ years in SRE/DevOps/AI Ops architecture.
-
Strong expertise in:
-
AI Ops platforms (Moogsoft, Dynatrace Davis AI, BigPanda, New Relic AI, Datadog AIOps)
-
Observability stacks (Prometheus, Grafana, ELK, Splunk, AppDynamics)
-
ML pipelines and ML Ops tooling (SageMaker, Vertex AI, MLflow, Databricks)
-
Cloud architectures on AWS / Azure / Google Cloud Platform
-
Event-driven systems and automation tools
Strong programming/scripting in Python, Go, or Java for automation and ML integration.
Experience with Kubernetes, Docker, microservices, and distributed systems.
Deep understanding of time-series analysis, anomaly detection, NLP, and predictive analytics.
Experience operationalizing ML models and integrating them into production systems. Preferred Qualifications
- Certifications in cloud architecture or ML engineering.
- Background in enterprise-scale SRE, observability, or operations automation.
- Experience with LLM-based automation and AI agents for IT operations.
- Experience in highly regulated industries (Finance, Healthcare, Telecom).