AI Platform Engineer - OpenShift & Kubernetes F/H

Arkadin Cloud Communications
1 month ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Tech stack

Agile Methodologies
Artificial Intelligence
Apache HTTP Server
Systems Engineering
Ubuntu (Operating System)
CentOS
Continuous Integration
Linux
DevOps
InfiniBand
Python
Linux Distribution
Nginx
Node.js
Openshift
Red Hat Enterprise Linux - RHEL
Simple Object Access Protocol (SOAP)
Systems Architecture
Web Services
CircleCI
Data Logging
Scripting (Bash/Python/Go/Ruby)
Enterprise Software Applications
Istio
System Availability
Gitlab-ci
Kubernetes
Information Technology
Deployment Automation
Operational Systems
Web Technologies
Machine Learning Operations
REST
Api Management
Docker
Jenkins
Microservices

Job description

We are seeking a skilled and forward-thinking Systems Engineer with deep expertise in Linux-based operating systems, RedHat OpenShift & Kubernetes services. The ideal candidate will have hands-on experience managing large-scale, GPU-accelerated environments and a strong grasp of DevOps practices. You will play a pivotal role in deploying and maintaining AI/ML infrastructure, ensuring high availability, performance, and security across OpenShift clusters, containerized workloads, and high-throughput networking fabrics such as RoCE and InfiniBand. Your contributions will directly support the scalability and reliability of AI and data-driven platforms., * Manage, configure, and optimize Linux/Unix-based Operating Systems to support enterprise applications and services.

  • Design, deploy, and maintain Kubernetes clusters in production, ensuring reliability, scalability, and security
  • Design and manage RedHat OpenShift clusters with a focus on integrating AI/ML workflows, leveraging platforms such as OpenShift AI, Mistral AI or similar to support scalable and reproducible AI and machine learning operations.
  • Architect and operate OpenShift environments optimized for GPU workloads, leveraging NVIDIA Enterprise, RUN:AI, and related orchestration tools to enable efficient resource allocation and accelerate AI/ML model training and inference at scale.
  • Implement, monitor, and troubleshoot Web Services (REST, SOAP, microservices) ensuring high availability and performance.
  • Collaborate with development teams to automate CI/CD pipelines using tools like Jenkins, GitLab CI, or similar.
  • Monitor system health and performance metrics; proactively address issues to minimize downtime.
  • Implement security best practices for OS, container orchestration, and web services.
  • Manage container lifecycle, including image creation, registry management, and deployment automation.
  • Provide support for incident management and root cause analysis.
  • Collaborate with cross-functional teams to enhance overall system architecture and deployment workflows.
  • Document system configurations, procedures, and best practices.

Requirements

  • Strong experience with Operating Systems (Linux distributions such as Ubuntu, CentOS, RedHat, or similar).
  • Hands-on experience with RedHat OpenShift administration, including cluster setup, networking, ingress, storage, and security.
  • Experience with GPU resource management in OpenShift, including configuration, scheduling, and monitoring using NVIDIA Enterprise Suite, RUN:AI, and related tools to support high-performance AI/ML workloads.
  • Good understanding of Web Services architecture, API management, and common protocols (HTTP, REST, SOAP).
  • Experience with containerization tools like Docker.
  • Familiarity with CI/CD pipelines and DevOps tooling (e.g., Jenkins, GitLab CI/CD, CircleCI).
  • Basic scripting and automation skills using Shell, Python, or similar languages.
  • Knowledge of monitoring and logging tools
  • Ability to troubleshoot and resolve complex infrastructure and deployment issues.
  • Strong collaboration, communication, and documentation skills.

Preferred (Bonus) Skills:

  • Experience with web technologies (Node.js, Nginx, Apache, Traefik or similar).
  • Familiarity with AI and Machine Leaning frameworks and platforms (RedHat OpenShift AI, Mistral AI, ZenML, ClearML)
  • Familiarity with DevOps practices and tools like Helm, Istio, or other service mesh solutions.
  • Knowledge of security standards, SSL/TLS, and compliance frameworks.
  • Experience working in Agile and CI/CD environments.
  • Certification like Red Hat Certified Specialist in OpenShift Administration (EX280)., * Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent experience).

About the company

NTT DATA is a $30+ billion trusted global innovator of business and technology services. We serve 75% of the Fortune Global 100 and are committed to helping clients innovate, optimize and transform for long-term success. We invest over $3.6 billion each year in R&D to help organizations and society move confidently and sustainably into the digital future. As a Global Top Employer, we have diverse experts in more than 50 countries and a robust partner ecosystem of established and start-up companies. Our services include business and technology consulting, data and artificial intelligence, industry solutions, as well as the development, implementation and management of applications, infrastructure, and connectivity. We are also one of the leading providers of digital and AI infrastructure in the world. NTT DATA is part of NTT Group and headquartered in Tokyo.

Apply for this position