DevOps / AI Infrastructure Engineer - GPU & Kubernetes

ITproposal B.V.

Eindhoven, Netherlands

1 month ago

Role details

Contract type

Contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Eindhoven, Netherlands

Tech stack

Artificial Intelligence

Amazon Web Services (AWS)

Azure

Bash

Big Data

Cloud Computing

Cloud Engineering

Configuration Management

Computer Programming

Continuous Integration

DevOps

Python

Performance Tuning

Ansible

Prometheus

Data Processing

Scripting (Bash/Python/Go/Ruby)

Grafana

Kubernetes

Machine Learning Operations

Terraform

Job description

Senior DevOps / AI Infrastructure Engineer to design, build and operate GPU-accelerated AI/ML infrastructure. You will enable high-performance training and inference workflows by managing cloud/GPU platforms, Kubernetes clusters, IaC, and AI tooling (Triton, Kubeflow, MLflow). The role combines deep platform engineering with automation and close collaboration with ML engineers and R&D teams., * Design, deploy and operate GPU-enabled Kubernetes clusters and associated platform services for training and inference.

Build and maintain CI/CD, model CI and MLOps pipelines using tools such as Kubeflow, MLflow and Triton.
Implement and manage cloud infrastructure on Azure (and other clouds as needed), with GPU instances and storage for large datasets.
Automate provisioning and configuration using Terraform, Ansible and scripting (Python, Bash).
Optimize container orchestration, scheduling and GPU utilization for high-performance workloads.
Integrate AI inference platforms (NVIDIA Triton) and support model serving at scale.
Work with PLM/simulation and data teams to integrate model training and inference into engineering workflows.
Monitor, troubleshoot and tune platform performance, reliability and cost.
Define and enforce best practices for security, resource governance and data handling in AI pipelines.
Document architectures, runbooks and operational procedures; transfer knowledge to engineering teams.

Requirements

Start: ASAP (or as agreed) Duration: 6 months (with possibility to extend) Experience: 8-10 years (including at least 1.5 years in DevOps/cloud/SRE focused on AI/ML) Language: English (fluent), * 8-10 years industry experience; minimum 1.5 years in DevOps, cloud engineering or SRE with AI/ML focus.

Hands-on experience with major cloud providers (Azure preferred; experience with GCP/AWS is valuable).
Experience with GPU-accelerated environments and NVIDIA ecosystem.
Deep understanding of Kubernetes and container orchestration for high-performance computing and model serving.
Experience with AI platform tools: NVIDIA Triton, Kubeflow, MLflow (setup, pipelines, serving).
Strong scripting and programming skills (Python, Bash) for automation and data processing.
Proficiency with Infrastructure-as-Code: Terraform and configuration management with Ansible.
Solid knowledge of storage, networking and security considerations for large ML workloads.
Good communication and collaboration skills; able to work with ML researchers and engineering teams.

Preferred skills

Experience integrating AI/ML workflows with PLM or simulation platforms.
Familiarity with GPU scheduling solutions (NVIDIA GPU Operator, device plugins, Volcano, etc.).
Knowledge of monitoring/observability for ML platforms (Prometheus, Grafana, ELK, metrics for GPU workloads).
Experience with cost-optimisation and autoscaling strategies for GPU clusters.
Familiarity with model optimization techniques (quantization, batching, mixed precision) and inference performance tuning.

Benefits & conditions

What we offer

Work on cutting-edge AI infrastructure supporting R&D and engineering use cases.
Opportunity to shape GPU and MLOps practices in a collaborative technical environment.
Competitive compensation and Eindhoven-based role with flexible working arrangements.