AI Infrastructure Platform Engineer

Ark Infotech Spectrum
Charlotte, United States of America
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Shift work
Languages
English
Experience level
Intermediate

Job location

Charlotte, United States of America

Tech stack

Artificial Intelligence
Azure
Computer Clusters
Continuous Integration
Linux
Machine Learning
Openshift
Performance Tuning
Azure
AI Infrastructure
Google Cloud Platform
Grafana
Multi-Cloud
Generative AI
AI Platforms
Kubernetes
Splunk
Serverless Computing

Job description

Lead complex infrastructure initiatives supporting Generative AI and Predictive AI platforms from design to production operations.

  • Serve as a technical lead for platforms supporting AI/ML model training, inference, and batch workloads.
  • Design, build, deploy, and operate OpenShift-based container platforms optimized for high-performance GPU workloads.
  • Build, support and operate scalable GPU SuperPod architecture with large multi-node GPU clusters.
  • Own monitoring, alerting, and observability using Grafana, Splunk, and enterprise telemetry tools.
  • Define SLIs/SLOs and build actionable alerts to proactively detect performance, capacity, and resiliency risks.
  • Build AI- and agent-based automation tools for self-healing, scaling, diagnostics, and incident remediation.
  • Apply AIOps techniques to reduce alert fatigue and improve platform reliability.
  • Lead production incident analysis and ensure operational rigor and root-cause prevention.
  • Mentor engineers and influence stakeholders across a geographically distributed organization.

Requirements

5+ years of infrastructure engineering experience.

  • 5+ years troubleshooting complex end-to-end architectures(including CI/CD pipeline).
  • 5+ years Linux systems experience.
  • 4+ years supporting AI/ML platforms.
  • 4+ years of Kubernetes / container platform experience including production support.

Desired Qualifications

  • Experience with Generative AI and Predictive AI platforms.
  • Hands-on GPU platform operations including scheduling, quota, and performance tuning.
  • Experience with OpenShift in GPU-enabled, multi-tenant environments.
  • Experience designing or operating GPU Super Pods.
  • Deep experience with observability using Grafana, Splunk, and custom telemetry pipelines.
  • Experience building AI- or agent-driven automation tooling (AIOps).
  • Hands-on experience supporting AI/ML workloads on Google Cloud Platform and Azure, including GPU-backed services and managed AI infrastructure
  • Experience operating hybrid or multi-cloud AI platforms, with an understanding of cloud-native services, networking, identity, and cost optimization for Generative and Predictive AI
  • Strong monitoring of AI signals such as inference latency and GPU utilization.
  • Experience with BCP/DR, resiliency, and highly available architectures.

Apply for this position