Machine Learning Systems Engineer (Remote - EU)
Role details
Job location
Tech stack
Job description
We are seeking a talented Machine Learning Systems Engineer to join a remote-first, globally distributed team working on cutting-edge AI infrastructure. In this role, you will contribute to the development of large-scale language model systems, focusing on high-performance training, inference, and self-improving AI agents. You will work at the intersection of machine learning research, distributed systems, and high-performance computing, building tools and frameworks that enable researchers and organizations worldwide to deploy advanced AI solutions.
This role offers the chance to work on technically demanding, open-source projects while collaborating with a passionate international team. Your work will have a direct impact on the future of scalable AI systems. Accountabilities
- Contribute to the development and optimization of large-scale language model frameworks
- Implement high-performance distributed training algorithms using frameworks such as Megatron-LM, DeepSpeed, and vLLM
- Develop and optimize inference engines and tools for model deployment, fine-tuning, and AI agent self-improvement
- Integrate diverse machine learning ecosystems including HuggingFace and other LLM tools
- Optimize performance across multi-GPU, multi-node architectures, leveraging HPC and CUDA/ROCm programming
- Collaborate with the open-source community to enhance the codebase, implement features, and resolve issues
- Research and implement advanced techniques for self-improving AI agents and high-efficiency ML pipelines
Requirements
- 3+ years of experience in machine learning engineering or research
- Proficiency in Python and C/C++, with strong systems programming skills
- Deep understanding of high-performance computing concepts, including MPI, BSP, and distributed multi-GPU training
- Solid experience with transformer architectures, gradient descent, backpropagation, and deep learning training
- Familiarity with distributed training strategies: data parallelism, model parallelism, pipeline parallelism
- Experience with containerization (Docker, Kubernetes) and cluster orchestration
- Demonstrated experience with ML frameworks like vLLM, Megatron-LM, HuggingFace, or similar
- Commitment to open-source development and community collaboration
- Excellent problem-solving, debugging, and performance optimization skills
- Bonus: Advanced degrees (MS/PhD), experience with SLURM, mixed-precision training, MLOps, or prior contributions to major open-source ML projects
Benefits & conditions
- Competitive compensation including salary and equity participation
- Fully remote, work-from-anywhere flexibility
- Comprehensive global benefits including mental health support
- Open PTO policy and flexible working hours
- Paid parental leave and support for personal well-being
- Opportunities for continuous learning and professional development
- Regular team offsites, virtual events, and global gatherings to foster team collaboration
- Inclusive, transparent, and supportive culture prioritizing growth and knowledge-sharing