AI Platform Engineer - OpenShift & Kubernetes F/H
Role details
Job location
Tech stack
Job description
We are seeking a skilled and forward-thinking Systems Engineer with deep expertise in Linux-based operating systems, RedHat OpenShift & Kubernetes services. The ideal candidate will have hands-on experience managing large-scale, GPU-accelerated environments and a strong grasp of DevOps practices. You will play a pivotal role in deploying and maintaining AI/ML infrastructure, ensuring high availability, performance, and security across OpenShift clusters, containerized workloads, and high-throughput networking fabrics such as RoCE and InfiniBand. Your contributions will directly support the scalability and reliability of AI and data-driven platforms., * Manage, configure, and optimize Linux/Unix-based Operating Systems to support enterprise applications and services.
- Design, deploy, and maintain Kubernetes clusters in production, ensuring reliability, scalability, and security
- Design and manage RedHat OpenShift clusters with a focus on integrating AI/ML workflows, leveraging platforms such as OpenShift AI, Mistral AI or similar to support scalable and reproducible AI and machine learning operations.
- Architect and operate OpenShift environments optimized for GPU workloads, leveraging NVIDIA Enterprise, RUN:AI, and related orchestration tools to enable efficient resource allocation and accelerate AI/ML model training and inference at scale.
- Implement, monitor, and troubleshoot Web Services (REST, SOAP, microservices) ensuring high availability and performance.
- Collaborate with development teams to automate CI/CD pipelines using tools like Jenkins, GitLab CI, or similar.
- Monitor system health and performance metrics; proactively address issues to minimize downtime.
- Implement security best practices for OS, container orchestration, and web services.
- Manage container lifecycle, including image creation, registry management, and deployment automation.
- Provide support for incident management and root cause analysis.
- Collaborate with cross-functional teams to enhance overall system architecture and deployment workflows.
- Document system configurations, procedures, and best practices.
Requirements
- Strong experience with Operating Systems (Linux distributions such as Ubuntu, CentOS, RedHat, or similar).
- Hands-on experience with RedHat OpenShift administration, including cluster setup, networking, ingress, storage, and security.
- Experience with GPU resource management in OpenShift, including configuration, scheduling, and monitoring using NVIDIA Enterprise Suite, RUN:AI, and related tools to support high-performance AI/ML workloads.
- Good understanding of Web Services architecture, API management, and common protocols (HTTP, REST, SOAP).
- Experience with containerization tools like Docker.
- Familiarity with CI/CD pipelines and DevOps tooling (e.g., Jenkins, GitLab CI/CD, CircleCI).
- Basic scripting and automation skills using Shell, Python, or similar languages.
- Knowledge of monitoring and logging tools
- Ability to troubleshoot and resolve complex infrastructure and deployment issues.
- Strong collaboration, communication, and documentation skills.
Preferred (Bonus) Skills:
- Experience with web technologies (Node.js, Nginx, Apache, Traefik or similar).
- Familiarity with AI and Machine Leaning frameworks and platforms (RedHat OpenShift AI, Mistral AI, ZenML, ClearML)
- Familiarity with DevOps practices and tools like Helm, Istio, or other service mesh solutions.
- Knowledge of security standards, SSL/TLS, and compliance frameworks.
- Experience working in Agile and CI/CD environments.
- Certification like Red Hat Certified Specialist in OpenShift Administration (EX280)., * Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent experience).