Machine Learning Systems Engineer (Remote - EU)

Jobgether
31 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Intermediate

Job location

Remote

Tech stack

Artificial Intelligence
Systems Engineering
C++
Nvidia CUDA
Software Debugging
Distributed Systems
Python
Machine Learning
Open Source Technology
Performance Tuning
System Programming
Large Language Models
Deep Learning
Containerization
Kubernetes
HuggingFace
Slurm
Machine Learning Operations
Docker

Job description

We are seeking a talented Machine Learning Systems Engineer to join a remote-first, globally distributed team working on cutting-edge AI infrastructure. In this role, you will contribute to the development of large-scale language model systems, focusing on high-performance training, inference, and self-improving AI agents. You will work at the intersection of machine learning research, distributed systems, and high-performance computing, building tools and frameworks that enable researchers and organizations worldwide to deploy advanced AI solutions.

This role offers the chance to work on technically demanding, open-source projects while collaborating with a passionate international team. Your work will have a direct impact on the future of scalable AI systems. Accountabilities

  • Contribute to the development and optimization of large-scale language model frameworks
  • Implement high-performance distributed training algorithms using frameworks such as Megatron-LM, DeepSpeed, and vLLM
  • Develop and optimize inference engines and tools for model deployment, fine-tuning, and AI agent self-improvement
  • Integrate diverse machine learning ecosystems including HuggingFace and other LLM tools
  • Optimize performance across multi-GPU, multi-node architectures, leveraging HPC and CUDA/ROCm programming
  • Collaborate with the open-source community to enhance the codebase, implement features, and resolve issues
  • Research and implement advanced techniques for self-improving AI agents and high-efficiency ML pipelines

Requirements

  • 3+ years of experience in machine learning engineering or research
  • Proficiency in Python and C/C++, with strong systems programming skills
  • Deep understanding of high-performance computing concepts, including MPI, BSP, and distributed multi-GPU training
  • Solid experience with transformer architectures, gradient descent, backpropagation, and deep learning training
  • Familiarity with distributed training strategies: data parallelism, model parallelism, pipeline parallelism
  • Experience with containerization (Docker, Kubernetes) and cluster orchestration
  • Demonstrated experience with ML frameworks like vLLM, Megatron-LM, HuggingFace, or similar
  • Commitment to open-source development and community collaboration
  • Excellent problem-solving, debugging, and performance optimization skills
  • Bonus: Advanced degrees (MS/PhD), experience with SLURM, mixed-precision training, MLOps, or prior contributions to major open-source ML projects

Benefits & conditions

  • Competitive compensation including salary and equity participation
  • Fully remote, work-from-anywhere flexibility
  • Comprehensive global benefits including mental health support
  • Open PTO policy and flexible working hours
  • Paid parental leave and support for personal well-being
  • Opportunities for continuous learning and professional development
  • Regular team offsites, virtual events, and global gatherings to foster team collaboration
  • Inclusive, transparent, and supportive culture prioritizing growth and knowledge-sharing

Apply for this position