Senior Site Reliability Engineer - AI/ML optimized GPU clusters
Role details
Job location
Tech stack
Job description
Your responsibilities will include:
- Ensure fault-tolerance, scale, and uninterrupted operations for the service.
- Use cutting-edge cloud technology to solve a variety of infrastructure problems.
- Implement and improve CI/CD processes.
Requirements
Do you have experience in UNIX?, * Solid experience with programming languages (like Go, Python, or C++), beyond scripting;
- You have experience in environments with a multitude of GPUs distributed over multiple nodes;
- Good understanding of classic algorithms and data structures;
- Commercial experience with, and deep understanding of, Unix/Linux systems and network technology;
- Solid experience with CI/CD and IaC;
- Experience with containerization and configuration management (Ansible, Salt, Terraform, Docker, Kubenetes, Helm).
It will be an added bonus if you have:
- A desire to be involved in backend development;
- Experience designing, developing, and running high-load distributed systems;
- Experience with a variety of cloud platforms.
Coding interviews are part of the process.
Benefits & conditions
- Competitive salary and comprehensive benefits package.
- Opportunities for professional growth and taking ownership in a massivley scaling environment.
- Flexible working arrangements.
- A dynamic and collaborative work environment that values initiative and innovation.
- On-site in Amsterdam or full-remote (across Europe).
Business unit The Next Chapter W&S Locations Europe, Amsterdam Remote status Hybrid Is work permit / visa sponsorship offered? Yes, but only for candidates already based in Europe. Is remote possible? This role is open for both on-site in The Netherlands as well as full-remote Is freelance possible? No, this is a permanent job with a regular contract of employment. Which language skills are required (professional level)? English Employment type Full-time, Regular - indefinite, Regular - temporary