GPU Systems Engineer

Kforce Inc.
Bethesda, United States of America
3 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Bethesda, United States of America

Tech stack

Artificial Intelligence
Systems Engineering
Bash
Big Data
Cloud Computing
Configuration Management
Profiling
Data Centers
Software Debugging
Monitoring of Systems
Python
Linux System Administration
Network Interface
Systems Architecture
High Performance Computing
Grafana
Kubernetes
Infrastructure Automation Frameworks
Hardware Infrastructure

Job description

We are seeking a highly experienced systems engineer with deep expertise in high-performance computing environments, including GPU-based infrastructure, operating systems, and high-speed networking. This role focuses on designing, optimizing, and maintaining large-scale GPU-enabled environments that support advanced computational workloads, including AI/ML processing and data-intensive applications. This position requires hands-on work within a secure, on-site environment supporting complex technical systems and mission-critical operations., Design, configure, and maintain GPU-based compute clusters supporting large-scale processing workloads Collaborate with cross-functional engineering teams to define system architectures that meet performance, scalability, and efficiency requirements Integrate GPU platforms into Linux-based environments, ensuring compatibility, reliability, and optimized performance Analyze system and GPU performance, identify bottlenecks, and implement improvements across hardware and software layers Develop and maintain tools for debugging, profiling, and performance analysis in Linux environments Leverage scripting and automation tools such as Python, Bash, and configuration management frameworks to streamline operations Maintain technical documentation including system architectures, configurations, and operational procedures Support compliance efforts and ensure adherence to required security and operational standards

Requirements

Extensive experience in systems engineering, with a focus on high-performance or GPU-enabled environments Strong background working with GPU data center platforms, including modern accelerator technologies Experience with enterprise server hardware components, including storage systems, network interfaces, and high-speed interconnects Advanced knowledge of Linux operating systems (such as common enterprise distributions) Proven ability to troubleshoot complex system and infrastructure issues across hardware and software layers Strong collaboration and problem-solving skills within technical team environments Relevant industry certification aligned with information assurance or system security requirements

Preferred / Nice-to-Have Skills

Experience managing containerized or orchestrated environments, including Kubernetes-based systems Familiarity with AI/ML workflow orchestration tools or similar pipeline frameworks Exposure to GPU virtualization or cloud-based GPU infrastructure Experience implementing or supporting system monitoring and observability tools Familiarity with distributed workload scheduling systems used in high-performance computing environments

Work Environment

Full-time, on-site role within a secure operational facility Daily on-site presence required Collaboration with multidisciplinary technical teams supporting advanced computing environments

Apply for this position