Software Engineer - AI Research Clusters/Remote

Apetan Consulting
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Remote

Tech stack

Artificial Intelligence
Amazon Web Services (AWS)
Azure
C++
Cloud Computing
Nvidia CUDA
Computer Programming
Distributed Systems
Fault Tolerance
General-Purpose Computing on Graphics Processing Units
Python
Performance Tuning
TensorFlow
Data Processing
Google Cloud Platform
PyTorch
Spark
Parallel Computation
Reliability of Systems
Infrastructure as Code (IaC)
Containerization
Kubernetes
Infrastructure Automation Frameworks
Information Technology
Data Management
Slurm
Machine Learning Operations
Docker
Go

Job description

  • Design and manage scalable AI research infrastructure and compute clusters.
  • Build and optimize distributed systems for large-scale model training and data processing.
  • Develop tools and frameworks to support researchers and ML engineers.
  • Work closely with AI researchers to understand workload requirements and improve system efficiency.
  • Optimize GPU/CPU utilization, storage, and networking performance.
  • Implement scheduling, resource allocation, and workload orchestration systems.
  • Ensure system reliability, monitoring, and fault tolerance.
  • Automate infrastructure provisioning using Infrastructure as Code (IaC).
  • Troubleshoot performance bottlenecks and system failures.

Requirements

  • Bachelor's or Master's degree in Computer Science, Engineering, or related field.
  • Strong programming skills in Python, Go, C++, or similar.
  • Experience with distributed systems and parallel computing.
  • Hands-on experience with containerization and orchestration tools (Docker, Kubernetes).
  • Familiarity with cloud platforms (AWS, Azure, or Google Cloud Platform) or on-prem HPC clusters.
  • Understanding of networking, storage systems, and system performance tuning., * Experience with ML frameworks (TensorFlow, PyTorch).
  • Familiarity with GPU computing (CUDA, NCCL).
  • Knowledge of cluster schedulers (Slurm, Kubernetes schedulers).
  • Experience with big data tools (Spark, Ray).
  • Exposure to MLOps and experiment tracking tools., * Strong problem-solving and systems thinking
  • Collaboration with research and engineering teams
  • Performance optimization mindset
  • Ownership and accountability

Apply for this position