Systems Engineer - Cloud Ops

AutoZone, Inc.

Memphis, United States of America

2 days ago

Role details

Contract type

Permanent contract

Employment type

Part-time / full-time

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Job location

Memphis, United States of America

Tech stack

Kubernetes Security

API

Artificial Intelligence

Application Packaging

Systems Engineering

Build Automation

Cloud Computing

Computer Networks

Continuous Integration

Software Debugging

DevOps

Programming Tools

DNS

Monitoring of Systems

Subnetting

Java Virtual Machine (JVM)

Key Management

Linux System Administration

Performance Tuning

Role-Based Access Control

Prometheus

Azure

Software Deployment

TCP/IP

Datadog

Data Logging

Google Cloud Platform

Load Balancing

Cloud Platform System

Cloud Monitoring

GitHub Copilot

Istio

Delivery Pipeline

Large Language Models

Grafana

Prompt Engineering

Kubernetes Helm Charts

Generative AI

Firewalls (Computer Science)

Containerization

Gitlab-ci

Kubernetes

Information Technology

Deployment Automation

Machine Learning Operations

Virtual Agents

Firewall Services Module

Terraform

Heap (Data Structure)

Dynatrace

Job description

As a Systems Engineer on the Cloud Operations team, you will be responsible for deploying, managing, and optimizing our cloud-based infrastructure on Google Cloud Platform (GCP). You will work with technologies such as Terraform, Kubernetes (GKE), GitOps/ArgoCD, CI/CD pipelines, and observability tools to ensure reliable, secure, and scalable platform operations.

You will also contribute to our AI/ML platform initiatives, supporting infrastructure for LLM-based applications and AI-powered automation tools that enhance developer productivity and operational efficiency.

You will collaborate with development teams, SREs, and platform architects to ensure seamless deployment and delivery of applications while maintaining the highest standards of reliability, security, and performance.

Responsibilities

Cloud Infrastructure, Automation & Operations:

Design, build, and maintain cloud infrastructure using Terraform to automate provisioning, scaling, and lifecycle management of resources on GCP
Develop and maintain CI/CD pipelines using GitLab CI to automate build, test, and deployment workflows. Implement and maintain GitOps practices using ArgoCD for declarative, version-controlled application deployment
Monitor system performance using observability tools (Dynatrace, Cloud Monitoring, Prometheus/Grafana) and troubleshoot production issues
Participate in on-call rotation to provide 24/7 support for critical infrastructure incidents
Perform root cause analysis on incidents and implement preventive measures. Document runbooks, architecture decisions, and operational procedures

Kubernetes Platform Management:

Deploy, configure, and manage containerized applications on Google Kubernetes Engine (GKE), including GKE Autopilot and Standard clustersManage cluster lifecycle including upgrades, node pool configurations, and capacity planning
Troubleshoot pod failures, CrashLoopBackOff, OOMKilled events, and container resource issues
Configure and optimize resource requests/limits, Horizontal Pod Autoscaler (HPA), and Vertical Pod Autoscaler (VPA)
Manage Kubernetes networking including Services, Ingress controllers, Network Policies, and DNS configurations. Implement and manage service mesh (Istio) for traffic management, observability, and security
Manage secrets and configurations using Kubernetes Secrets, ConfigMaps, and external secret management tools. Implement pod security standards, RBAC policies, and workload identity configurations

AI/ML Platform & Automation:

Support infrastructure for AI/ML workloads including LLM-based applications and model serving platforms
Deploy and manage AI-powered developer tools such as coding assistants (Claude Code, GitHub Copilot) and agentic AI systems. Explore and implement AI-assisted incident response and automated remediation workflows
Build and maintain infrastructure for Retrieval-Augmented Generation (RAG) pipelines and vector databases
Configure GPU-enabled node pools and optimize resource allocation for AI/ML workloads
Implement MCP (Model Context Protocol) servers and AI agent integrations for operational automation
Stay current with emerging AI technologies and evaluate their applicability for infrastructure automation

Requirements

Kubernetes Expertise (Essential):

3+ years hands-on experience with Kubernetes in production environments
Deep understanding of Kubernetes architecture: API server, etcd, scheduler, controller manager, kubelet
Experience with GKE (Standard and Autopilot modes), including cluster creation, upgrades, and maintenance
Proficiency in troubleshooting workloads: analyzing pod logs, events, describe outputs, and container states
Strong understanding of resource management: requests, limits, QoS classes, and resource quotas
Experience with Kubernetes networking: Services (ClusterIP, NodePort, LoadBalancer), Ingress, Network Policies
Knowledge of Kubernetes storage: PersistentVolumes, PersistentVolumeClaims, StorageClasses, dynamic provisioning
Experience with Helm charts for application packaging and deployment
Familiarity with Kubernetes security: RBAC, Pod Security Standards, Secrets management, Workload Identity
Understanding of Kubernetes observability: metrics-server, kubectl top, container resource monitoring
Experience debugging common issues: ImagePullBackOff, CrashLoopBackOff, OOMKilled, Evicted pods, pending pods

Cloud & Infrastructure:

3+ years of experience with Google Cloud Platform (GCP) services including GKE, Cloud Run, Cloud SQL, Memorystore, Pub/Sub, and Cloud Logging
Strong experience with Terraform for infrastructure as code (IaC)
Understanding of cloud networking: VPCs, subnets, firewall rules, Cloud NAT, Private Service Connect

CI/CD & GitOps:

Proficiency with GitLab CI/CD pipelines
Experience with ArgoCD or similar GitOps tools
Understanding of Helm charts and Kustomize for Kubernetes manifest management

Observability & Troubleshooting:

Experience with monitoring and APM tools (Dynatrace, Datadog, Prometheus, Grafana)
Ability to analyze logs, metrics, and traces to diagnose production issues
Familiarity with JVM troubleshooting (heap dumps, thread analysis, GC tuning, connection pool issues)

AI/ML Knowledge:

Basic understanding of LLM concepts, prompt engineering, and AI model deployment
Familiarity with AI coding assistants and their integration into development workflows
Interest in agentic AI systems and autonomous automation tools
Exposure to vector databases (Pinecone, Weaviate, pgvector) and RAG architectures is a plus

Systems & Networking:

Strong Linux administration skills
Understanding of networking concepts (DNS, load balancing, firewalls, TCP/IP)
Experience with service mesh (Istio) is a plus

General:

Excellent problem-solving and analytical skills
Strong written and verbal communication
Ability to work effectively in a collaborative, cross-functional environment
Experience working in an Agile/DevOps culture
Bachelor's degree in Computer Science, Information Technology, or related field (or equivalent experience)

Benefits & conditions

AutoZone offers thoughtful benefits programs with one-on-one benefits guidance designed to improve AutoZoners' physical, mental and financial well-being.

All AutoZoners (Full-Time and Part-Time):

Competitive pay
Unrivaled company culture
Medical, dental and vision plans
Exclusive discounts and perks, including an AutoZone in-store discount
401(k) with company match and Stock Purchase Plan
AutoZoners Living Well Program for free mental health support
Opportunities for career growth

Additional Benefits for Full-Time AutoZoners:

Paid time off
Life, and short- and long-term disability insurance options
Health Savings and Flexible Spending Accounts with wellness rewards
Tuition reimbursement

About the company

Since opening our first store in 1979, AutoZone has grown into a leading retailer and distributor of automotive parts and accessories across the Americas. Our customer-first mindset and commitment to Going the Extra Mile define who we are, for both our customers and AutoZoners. Working at AutoZone means being part of a team that values dedication, teamwork, and growth. Whether you're helping customers or building your career, we provide tools and support to help you succeed and drive your future.

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

About the company

Apply for this position

Good distractions

Moments

Videos View all