Infrastructure Operations Engineer Greensboro, NC

NSCALE, LLC

Winston-Salem, United States of America

1 month ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Compensation

$ 160K

Job location

Winston-Salem, United States of America

Tech stack

Bash

Border Gateway Protocol

Cloud Computing

Continuous Integration

Data Centers

Linux

File Systems

DNS

Github

Issue Tracking Systems

InfiniBand

IP Addressing

Subnetting

Job Scheduling

Python

Networking Basics

Routing

OpenStack

Remote Direct Memory Access

Ansible

Runbook

Virtual Local Area Networks

Software Troubleshooting

GIT

Kubernetes

Terraform

Job description

We're looking for an Engineer that has good people, leadership & technical skills.

A technical expert responsible for ensuring the efficiency, reliability, and scalability of data centre infrastructure., * Join the Support duty rotation and handle day-to-day tickets and alerts, escalating early and appropriately. Collaborate with Engineering with guidance when incidents or changes require it.
Accurately record, update, manage and resolve tickets using the ticketing system whilst keeping all parties informed of the tickets progression.
Follow established runbooks to resolve common issues. Propose improvements and contribute incremental fixes with review.
Keep tickets up to date with clear notes, next steps, and customer communications via the agreed channels.
Learn the Platform fundamentals so you can help customers get value from our services, asking for support when deeper expertise is needed.
Participate in monitoring, troubleshooting, and triage. Capture logs and facts to enable efficient handover.
Deliver assigned tasks and project work to agreed quality and timelines. Flag blockers early and seek help when needed.
Share knowledge by documenting steps you've validated and by contributing to training materials. Shadow seniors during complex work to build capability.
Take part in incident reviews as a contributor and help track preventative follow-ups in your scope.
Identify areas for implementation for automation to optimize processes.
Constantly endeavor to learn and upskill.
Collaborate with cross-functional teams for service improvements. Be the escalation point for onsite operations staff.
Participate in on-call or out-of-hours work when scheduled and after onboarding.
Availability to travel to Nscale or Customer locations to assist with deployments, trouble shooting and operational tasks and attendance of supplier related training courses.

Requirements

Do you have experience in Customer communication?, * You're comfortable problem solving & making decisions on complex topics with high levels of ambiguity in a results driven environment.

You're comfortable influencing without authority and exceptional at building relationships with senior stakeholders across the business to get things done.
You have the understanding and skillset to grasp technical concepts and problems quickly
You have strong analytical skills
You're a doer who is extremely organized and diligent
You're a self starter, curious, and quick to learn, knowing what questions to ask to get up to speed quickly, * Growth mindset. Curious, dependable, and collaborative. You seek feedback, ask questions, and invest in learning to progress toward Senior.
Platform and DC fundamentals. Awareness of servers, networks, storage, and virtualization concepts, ideally from a support or operations background.
Linux fundamentals. Comfortable with the CLI, services via systemd, filesystems, permissions, and basic networking tools. Able to troubleshoot common issues and know when to escalate.
Networking basics. Solid grasp of IP addressing, subnets, VLANs, routing at a high level, DNS, and firewalls. Advanced topics like BGP or VXLAN are a plus, not required.
Kubernetes exposure. Understand core concepts like nodes, pods, services, and logs. Can perform basic troubleshooting and follow runbooks. Cluster-level administration experience is a nice to have.
GPU awareness. Familiar with basic diagnostics such as nvidia-smi.
Observability foundations. Able to use dashboards and alerts to identify symptoms, gather evidence, and follow runbooks. Comfortable proposing simple alert or dashboard tweaks with review.
Scripting and automation basics. Comfortable reading and writing simple Bash or Python snippets and using Git for version control. Experience with Ansible or Terraform is beneficial but not required.
Cloud and virtualization basics. Familiarity with common hypervisor or cloud troubleshooting flows. OpenStack experience is a plus, not a requirement.

Nice to Have:

Hands-on exposure to Kubernetes administration, operators, and storage or networking add-ons.

Deeper GPU/HPC concepts such as RDMA/InfiniBand, performant distributed workload basics, or job schedulers. Awareness and used NCCL for performance troubleshooting.
Infrastructure as Code and config management tools like Ansible or Terraform.
GitOps and CI/CD participation. Contributing to pipelines and modernizing scripts using GitHub Actions or similar.
Experience with access and security tooling used at Nscale, such as Teleport or Vault.
Progress toward relevant certifications over time (e.g., Linux, Kubernetes, cloud, or security)

Benefits & conditions

Highly competitive package (base + equity) with reviews every 12 months.
Join the fastest-growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting-edge AI.
Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support.
Human-First Flexibility: We treat you as humans first. Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.

About the company

Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility.

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

About the company

Apply for this position

Good distractions

Moments

Videos View all