Machine Learning Systems Engineer (Remote - EU)

Jobgether

31 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Job location

Remote

Tech stack

Artificial Intelligence

Systems Engineering

C++

Nvidia CUDA

Software Debugging

Distributed Systems

Python

Machine Learning

Open Source Technology

Performance Tuning

System Programming

Large Language Models

Deep Learning

Containerization

Kubernetes

HuggingFace

Slurm

Machine Learning Operations

Docker

Job description

We are seeking a talented Machine Learning Systems Engineer to join a remote-first, globally distributed team working on cutting-edge AI infrastructure. In this role, you will contribute to the development of large-scale language model systems, focusing on high-performance training, inference, and self-improving AI agents. You will work at the intersection of machine learning research, distributed systems, and high-performance computing, building tools and frameworks that enable researchers and organizations worldwide to deploy advanced AI solutions.

This role offers the chance to work on technically demanding, open-source projects while collaborating with a passionate international team. Your work will have a direct impact on the future of scalable AI systems. Accountabilities

Contribute to the development and optimization of large-scale language model frameworks
Implement high-performance distributed training algorithms using frameworks such as Megatron-LM, DeepSpeed, and vLLM
Develop and optimize inference engines and tools for model deployment, fine-tuning, and AI agent self-improvement
Integrate diverse machine learning ecosystems including HuggingFace and other LLM tools
Optimize performance across multi-GPU, multi-node architectures, leveraging HPC and CUDA/ROCm programming
Collaborate with the open-source community to enhance the codebase, implement features, and resolve issues
Research and implement advanced techniques for self-improving AI agents and high-efficiency ML pipelines

Requirements

3+ years of experience in machine learning engineering or research
Proficiency in Python and C/C++, with strong systems programming skills
Deep understanding of high-performance computing concepts, including MPI, BSP, and distributed multi-GPU training
Solid experience with transformer architectures, gradient descent, backpropagation, and deep learning training
Familiarity with distributed training strategies: data parallelism, model parallelism, pipeline parallelism
Experience with containerization (Docker, Kubernetes) and cluster orchestration
Demonstrated experience with ML frameworks like vLLM, Megatron-LM, HuggingFace, or similar
Commitment to open-source development and community collaboration
Excellent problem-solving, debugging, and performance optimization skills
Bonus: Advanced degrees (MS/PhD), experience with SLURM, mixed-precision training, MLOps, or prior contributions to major open-source ML projects

Benefits & conditions

Competitive compensation including salary and equity participation
Fully remote, work-from-anywhere flexibility
Comprehensive global benefits including mental health support
Open PTO policy and flexible working hours
Paid parental leave and support for personal well-being
Opportunities for continuous learning and professional development
Regular team offsites, virtual events, and global gatherings to foster team collaboration
Inclusive, transparent, and supportive culture prioritizing growth and knowledge-sharing