Cloud Operations Engineer
Role details
Job location
Tech stack
Job description
-
Act as L3 escalation for critical production incidents across AWS, Azure, and GCP workloads
-
Lead deep triage and root cause analysis for recurring operational issues
-
Drive problem management by identifying patterns and implementing permanent fixes
-
Host and operate applications on VMs, containers, and managed cloud services
-
Perform advanced maintenance including upgrades, patching, scaling, tuning, and hardening
-
Troubleshoot complex networking issues including routing, DNS, load balancers, TLS, private endpoints, and firewall rules
-
Validate and optimize VPC/VNet design, peering, VPN, and hybrid connectivity
-
Create and maintain Terraform modules for standardized multi-cloud provisioning
-
Manage Kubernetes clusters via Rafay and native tooling across EKS, AKS, and GKE
-
Implement and manage CI/CD pipelines using Harness and GitOps workflows with ArgoCD and Crossplane
-
Configure and manage API Gateways and Apigee for service traffic management
-
Automate operational tasks including patching, certificate rotation, scaling, backups, and health checks
-
Improve monitoring and alerting using Splunk, Splunk OTel, Prometheus, and Grafana
Requirements
-
Strong experience in L3 Cloud Operations, SRE, or Platform Engineering supporting production environments
-
Hands-on expertise across AWS, Azure, and GCP including deployment, operations, and troubleshooting
-
Advanced troubleshooting across compute, storage, networking, IAM, DNS, TLS, and application runtime
-
Solid Terraform experience including module development and state management
-
Experience operating Kubernetes environments (EKS, AKS, GKE) with Rafay cluster management
-
Proficiency with Harness CI/CD pipelines and GitOps practices using ArgoCD and Crossplane
-
Experience with API Gateway and Apigee for API lifecycle management
-
Strong observability experience using Splunk, Splunk OTel, Prometheus, Grafana, and AppDynamics
-
Experience managing PostgreSQL and MongoDB in production environments
-
Working knowledge of vector databases and their role in AI/ML workloads
-
Familiarity with AIOps tools and practices for intelligent operations automation
-
Proficiency with GitHub for source control and collaboration workflows
-
Proven track record delivering RCAs, automation improvements, and stability enhancements
-
Ability to create clear runbooks, SOPs, and operational documentation
Preferred Qualifications
-
Experience building or supporting AI/ML infrastructure including vector database deployments
-
Experience implementing HA/DR architectures and executing disaster recovery drills
-
Exposure to private-only cloud architectures and secure ingress/egress controls
-
Familiarity with FinOps practices in multi-cloud environments
-
Experience with Vault for secrets management and secure credential injection
-
Hands-on experience with Score/Manifesto for declarative workload definitions