AI Infrastructure Platform Engineer

Ark Infotech Spectrum

Charlotte, United States of America

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Shift work

Languages

English

Experience level

Intermediate

Job location

Charlotte, United States of America

Tech stack

Artificial Intelligence

Azure

Computer Clusters

Continuous Integration

Linux

Machine Learning

Openshift

Performance Tuning

Azure

AI Infrastructure

Google Cloud Platform

Grafana

Multi-Cloud

Generative AI

AI Platforms

Kubernetes

Splunk

Serverless Computing

Job description

Lead complex infrastructure initiatives supporting Generative AI and Predictive AI platforms from design to production operations.

Serve as a technical lead for platforms supporting AI/ML model training, inference, and batch workloads.
Design, build, deploy, and operate OpenShift-based container platforms optimized for high-performance GPU workloads.
Build, support and operate scalable GPU SuperPod architecture with large multi-node GPU clusters.
Own monitoring, alerting, and observability using Grafana, Splunk, and enterprise telemetry tools.
Define SLIs/SLOs and build actionable alerts to proactively detect performance, capacity, and resiliency risks.
Build AI- and agent-based automation tools for self-healing, scaling, diagnostics, and incident remediation.
Apply AIOps techniques to reduce alert fatigue and improve platform reliability.
Lead production incident analysis and ensure operational rigor and root-cause prevention.
Mentor engineers and influence stakeholders across a geographically distributed organization.

Requirements

5+ years of infrastructure engineering experience.

5+ years troubleshooting complex end-to-end architectures(including CI/CD pipeline).
5+ years Linux systems experience.
4+ years supporting AI/ML platforms.
4+ years of Kubernetes / container platform experience including production support.

Desired Qualifications

Experience with Generative AI and Predictive AI platforms.
Hands-on GPU platform operations including scheduling, quota, and performance tuning.
Experience with OpenShift in GPU-enabled, multi-tenant environments.
Experience designing or operating GPU Super Pods.
Deep experience with observability using Grafana, Splunk, and custom telemetry pipelines.
Experience building AI- or agent-driven automation tooling (AIOps).
Hands-on experience supporting AI/ML workloads on Google Cloud Platform and Azure, including GPU-backed services and managed AI infrastructure
Experience operating hybrid or multi-cloud AI platforms, with an understanding of cloud-native services, networking, identity, and cost optimization for Generative and Predictive AI
Strong monitoring of AI signals such as inference latency and GPU utilization.
Experience with BCP/DR, resiliency, and highly available architectures.

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all