Senior Operations Lead - AI Infrastructure (Data Center)

Next Orbits Inc.

Los Angeles, United States of America

1 month ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Compensation

$ 150K

Job location

Los Angeles, United States of America

Tech stack

Java

Microsoft Windows

API

Agile Methodologies

Artificial Intelligence

Amazon Web Services (AWS)

Azure

Databases

Microsoft Management Console

System Configuration

Data Centers

Linux

DevOps

Networking Hardware

Python

Network Architecture

Network Monitoring

Oracle Applications

Systems Development Life Cycle

Prometheus

Runbook

Software Engineering

SQL Databases

VMware Infrastructure

AI Infrastructure

Scripting (Bash/Python/Go/Ruby)

High Performance Computing

Delivery Pipeline

Grafana

Reliability of Systems

Kubernetes

Information Technology

Performance Monitor

Data Management

Hardware Infrastructure

Cloud Integration

Network Server

Api Management

Programming Languages

Job description

We operate a GPU-dense AI infrastructure environment in Downtown Los Angeles. This is the senior onsite delivery role - the person who owns operational outcomes, leads the team, and holds the line on runbook compliance and SLAs every single shift. If you have deep data center experience, have led technical teams in mission-critical environments, and are looking for a high-accountability role with real ownership, read on.

Responsibilities

Lead onsite operations across power, network, compute, storage, and platform layers
Act as escalation authority for all incidents - you are the final call onsite
Enforce runbook compliance and governance standards across the team
Oversee MELT observability stack implementation (Metrics, Events, Logs, Traces)
Manage SLA adherence and drive incident resolution with full accountability
Lead, mentor, and develop onsite technical staff
Own expansion readiness for Phase 2 and Phase 3 infrastructure scaling
Serve as primary client-facing point of contact for operational matters, * Oversee the daily operations of AI data centers, ensuring optimal performance and uptime for critical infrastructure components.
Lead the planning, deployment, and scaling of IT infrastructure including servers, storage systems, networking hardware, and cloud integrations (AWS, Azure).
Collaborate with software engineering teams to support AI development environments using operating systems such as Windows and Linux, ensuring compatibility with tools like Java, Python, SQL, and APIs.
Implement and manage DevOps practices to streamline deployment cycles (SDLC), automate workflows, and enhance system reliability.
Monitor system health through comprehensive CCTV surveillance, network monitoring tools, and asset management systems; respond swiftly to incidents or anomalies.
Coordinate with cybersecurity teams to enforce security protocols across physical and virtual infrastructure components.
Drive continuous improvement initiatives by adopting Agile methodologies for project management and operational workflows.
Maintain detailed documentation of system configurations, procedures, incident reports, and compliance records to ensure transparency and audit readiness.

Requirements

Do you have experience in System performance monitoring?, * 10+ years in data center or infrastructure operations

Proven leadership in shift-based, mission-critical environments
Strong working knowledge of networking, storage, and Kubernetes
Experience with observability tools - Prometheus, Grafana, or equivalent
Demonstrated ability to build and enforce operational process discipline
Clear, confident communication with both technical teams and client stakeholders, * Direct experience with GPU-dense infrastructure environments
Familiarity with MELT observability frameworks
Background in managed services delivery, * GPU infrastructure operations
Prometheus / Grafana / MELT stack
Kubernetes
Enterprise networking and storage
ITSM and incident management
Runbook governance and documentation, * We are seeking an energetic and detail-oriented Senior Operations Lead specializing in AI Infrastructure within Data Center environments. This pivotal role drives the management, optimization, and scaling of complex data center operations that support cutting-edge artificial intelligence initiatives. You will lead cross-functional teams to ensure the seamless deployment, maintenance, and security of AI-focused IT infrastructure, fostering innovation and operational excellence. Your expertise will empower our organization to deliver high-performance AI solutions while maintaining robust system reliability and security standards., * Proven experience managing large-scale IT infrastructure within data centers supporting AI or high-performance computing environments.
Strong background in computer science principles with a focus on operating systems such as Windows and Linux.
Hands-on expertise with cloud platforms including AWS and Azure for infrastructure deployment and management.
Familiarity with programming languages like Java and Python for automation, scripting, and API integrations.
Knowledge of SQL databases such as Oracle or similar systems for data management tasks.
Experience implementing DevOps pipelines utilizing tools aligned with SDLC best practices in Agile settings.
Solid understanding of computer management concepts including hardware lifecycle management, APIs integration, network architecture, and system security protocols.
Ability to lead cross-disciplinary teams effectively while managing multiple projects in a fast-paced environment. Join us to lead transformative AI infrastructure initiatives that push technological boundaries! Your expertise will shape the future of intelligent systems while ensuring operational excellence across our data centers.

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all