GPU Systems Engineer

Kforce Inc.

Bethesda, United States of America

1 month ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Bethesda, United States of America

Tech stack

Artificial Intelligence

Systems Engineering

Bash

Big Data

Cloud Computing

Configuration Management

Profiling

Data Centers

Software Debugging

Monitoring of Systems

Python

Linux System Administration

Network Interface

Systems Architecture

High Performance Computing

Grafana

Kubernetes

Infrastructure Automation Frameworks

Hardware Infrastructure

Job description

We are seeking a highly experienced systems engineer with deep expertise in high-performance computing environments, including GPU-based infrastructure, operating systems, and high-speed networking. This role focuses on designing, optimizing, and maintaining large-scale GPU-enabled environments that support advanced computational workloads, including AI/ML processing and data-intensive applications. This position requires hands-on work within a secure, on-site environment supporting complex technical systems and mission-critical operations., Design, configure, and maintain GPU-based compute clusters supporting large-scale processing workloads Collaborate with cross-functional engineering teams to define system architectures that meet performance, scalability, and efficiency requirements Integrate GPU platforms into Linux-based environments, ensuring compatibility, reliability, and optimized performance Analyze system and GPU performance, identify bottlenecks, and implement improvements across hardware and software layers Develop and maintain tools for debugging, profiling, and performance analysis in Linux environments Leverage scripting and automation tools such as Python, Bash, and configuration management frameworks to streamline operations Maintain technical documentation including system architectures, configurations, and operational procedures Support compliance efforts and ensure adherence to required security and operational standards

Requirements

Extensive experience in systems engineering, with a focus on high-performance or GPU-enabled environments Strong background working with GPU data center platforms, including modern accelerator technologies Experience with enterprise server hardware components, including storage systems, network interfaces, and high-speed interconnects Advanced knowledge of Linux operating systems (such as common enterprise distributions) Proven ability to troubleshoot complex system and infrastructure issues across hardware and software layers Strong collaboration and problem-solving skills within technical team environments Relevant industry certification aligned with information assurance or system security requirements

Preferred / Nice-to-Have Skills

Experience managing containerized or orchestrated environments, including Kubernetes-based systems Familiarity with AI/ML workflow orchestration tools or similar pipeline frameworks Exposure to GPU virtualization or cloud-based GPU infrastructure Experience implementing or supporting system monitoring and observability tools Familiarity with distributed workload scheduling systems used in high-performance computing environments

Work Environment

Full-time, on-site role within a secure operational facility Daily on-site presence required Collaboration with multidisciplinary technical teams supporting advanced computing environments

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all