Software Engineer - AI Research Clusters/Remote

Apetan Consulting

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Remote

Artificial Intelligence

Amazon Web Services (AWS)

Azure

C++

Cloud Computing

Nvidia CUDA

Computer Programming

Distributed Systems

Fault Tolerance

General-Purpose Computing on Graphics Processing Units

Python

Performance Tuning

TensorFlow

Data Processing

Google Cloud Platform

PyTorch

Spark

Parallel Computation

Reliability of Systems

Infrastructure as Code (IaC)

Containerization

Kubernetes

Infrastructure Automation Frameworks

Information Technology

Data Management

Slurm

Machine Learning Operations

Docker

Design and manage scalable AI research infrastructure and compute clusters.
Build and optimize distributed systems for large-scale model training and data processing.
Develop tools and frameworks to support researchers and ML engineers.
Work closely with AI researchers to understand workload requirements and improve system efficiency.
Optimize GPU/CPU utilization, storage, and networking performance.
Implement scheduling, resource allocation, and workload orchestration systems.
Ensure system reliability, monitoring, and fault tolerance.
Automate infrastructure provisioning using Infrastructure as Code (IaC).
Troubleshoot performance bottlenecks and system failures.

Bachelor's or Master's degree in Computer Science, Engineering, or related field.
Strong programming skills in Python, Go, C++, or similar.
Experience with distributed systems and parallel computing.
Hands-on experience with containerization and orchestration tools (Docker, Kubernetes).
Familiarity with cloud platforms (AWS, Azure, or Google Cloud Platform) or on-prem HPC clusters.
Understanding of networking, storage systems, and system performance tuning., * Experience with ML frameworks (TensorFlow, PyTorch).
Familiarity with GPU computing (CUDA, NCCL).
Knowledge of cluster schedulers (Slurm, Kubernetes schedulers).
Experience with big data tools (Spark, Ray).
Exposure to MLOps and experiment tracking tools., * Strong problem-solving and systems thinking
Collaboration with research and engineering teams
Performance optimization mindset
Ownership and accountability