Senior Operations Lead - AI Infrastructure (Data Center)

Next Orbits Inc.
Los Angeles, United States of America
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
$ 150K

Job location

Los Angeles, United States of America

Tech stack

Java
Microsoft Windows
API
Agile Methodologies
Artificial Intelligence
Amazon Web Services (AWS)
Azure
Databases
Microsoft Management Console
System Configuration
Data Centers
Linux
DevOps
Networking Hardware
Python
Network Architecture
Network Monitoring
Oracle Applications
Systems Development Life Cycle
Prometheus
Runbook
Software Engineering
SQL Databases
VMware Infrastructure
AI Infrastructure
Scripting (Bash/Python/Go/Ruby)
High Performance Computing
Delivery Pipeline
Grafana
Reliability of Systems
Kubernetes
Information Technology
Performance Monitor
Data Management
Hardware Infrastructure
Cloud Integration
Network Server
Api Management
Programming Languages

Job description

We operate a GPU-dense AI infrastructure environment in Downtown Los Angeles. This is the senior onsite delivery role - the person who owns operational outcomes, leads the team, and holds the line on runbook compliance and SLAs every single shift. If you have deep data center experience, have led technical teams in mission-critical environments, and are looking for a high-accountability role with real ownership, read on.

Responsibilities

  • Lead onsite operations across power, network, compute, storage, and platform layers
  • Act as escalation authority for all incidents - you are the final call onsite
  • Enforce runbook compliance and governance standards across the team
  • Oversee MELT observability stack implementation (Metrics, Events, Logs, Traces)
  • Manage SLA adherence and drive incident resolution with full accountability
  • Lead, mentor, and develop onsite technical staff
  • Own expansion readiness for Phase 2 and Phase 3 infrastructure scaling
  • Serve as primary client-facing point of contact for operational matters, * Oversee the daily operations of AI data centers, ensuring optimal performance and uptime for critical infrastructure components.
  • Lead the planning, deployment, and scaling of IT infrastructure including servers, storage systems, networking hardware, and cloud integrations (AWS, Azure).
  • Collaborate with software engineering teams to support AI development environments using operating systems such as Windows and Linux, ensuring compatibility with tools like Java, Python, SQL, and APIs.
  • Implement and manage DevOps practices to streamline deployment cycles (SDLC), automate workflows, and enhance system reliability.
  • Monitor system health through comprehensive CCTV surveillance, network monitoring tools, and asset management systems; respond swiftly to incidents or anomalies.
  • Coordinate with cybersecurity teams to enforce security protocols across physical and virtual infrastructure components.
  • Drive continuous improvement initiatives by adopting Agile methodologies for project management and operational workflows.
  • Maintain detailed documentation of system configurations, procedures, incident reports, and compliance records to ensure transparency and audit readiness.

Requirements

Do you have experience in System performance monitoring?, * 10+ years in data center or infrastructure operations

  • Proven leadership in shift-based, mission-critical environments
  • Strong working knowledge of networking, storage, and Kubernetes
  • Experience with observability tools - Prometheus, Grafana, or equivalent
  • Demonstrated ability to build and enforce operational process discipline
  • Clear, confident communication with both technical teams and client stakeholders, * Direct experience with GPU-dense infrastructure environments
  • Familiarity with MELT observability frameworks
  • Background in managed services delivery, * GPU infrastructure operations
  • Prometheus / Grafana / MELT stack
  • Kubernetes
  • Enterprise networking and storage
  • ITSM and incident management
  • Runbook governance and documentation, * We are seeking an energetic and detail-oriented Senior Operations Lead specializing in AI Infrastructure within Data Center environments. This pivotal role drives the management, optimization, and scaling of complex data center operations that support cutting-edge artificial intelligence initiatives. You will lead cross-functional teams to ensure the seamless deployment, maintenance, and security of AI-focused IT infrastructure, fostering innovation and operational excellence. Your expertise will empower our organization to deliver high-performance AI solutions while maintaining robust system reliability and security standards., * Proven experience managing large-scale IT infrastructure within data centers supporting AI or high-performance computing environments.
  • Strong background in computer science principles with a focus on operating systems such as Windows and Linux.
  • Hands-on expertise with cloud platforms including AWS and Azure for infrastructure deployment and management.
  • Familiarity with programming languages like Java and Python for automation, scripting, and API integrations.
  • Knowledge of SQL databases such as Oracle or similar systems for data management tasks.
  • Experience implementing DevOps pipelines utilizing tools aligned with SDLC best practices in Agile settings.
  • Solid understanding of computer management concepts including hardware lifecycle management, APIs integration, network architecture, and system security protocols.
  • Ability to lead cross-disciplinary teams effectively while managing multiple projects in a fast-paced environment. Join us to lead transformative AI infrastructure initiatives that push technological boundaries! Your expertise will shape the future of intelligent systems while ensuring operational excellence across our data centers.

Apply for this position