IaaS / Kubernetes Platform Engineer

CloudLinux

Palo Alto, United States of America

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Shift work

Languages

English

Experience level

Intermediate

Compensation

€ 80K

Job location

Remote

Municipality of Madrid, Spain

Tech stack

Proxmox

Kubernetes Security

API

JIRA

Intelligent Platform Management Interface

Border Gateway Protocol

Common ISDN Application Programming Interface (CAPI)

Data Centers

Linux

RAID

File Systems

Distributed Data Store

DNS

Infrastructure as a Service (IaaS)

Networking Hardware

Internet Protocol Security (IP SEC)

Junos

Python

Networking Basics

Routing

Network Segmentation

Performance Tuning

Software Architecture

Reliability Engineering

Site Reliability Engineering Practices

Ansible

Runbook

Server Administration

Virtual Local Area Networks

Ceph

Policy as Code

Data Import/Export

Load Balancing

Grafana

Software Troubleshooting

Reliability of Systems

Juniper

Kubernetes

Infrastructure Automation Frameworks

Iptables

Bare Metal

Cloud Optimization

Terraform

Nvme

VMware

Job description

Our infrastructure powers 500+ VMs across multiple datacenters, serving 20+ engineering teams. We are evolving from an OpenNebula-based virtualization platform toward a Kubernetes-native multi-tenant cloud with KubeVirt for VM orchestration-while maintaining reliability and operational excellence throughout the transition. What You Will Do Kubernetes Platform Engineering (Primary Focus - 40%)

Design, build, and operate a multi-tenant Kubernetes platform using Cluster API (CAPI) with bare-metal providers (Metal3/Sidero)
Implement hard multi-tenancy using vCluster (Loft Labs) or similar technology, providing isolated Kubernetes API servers per tenant
Deploy and manage KubeVirt for VM orchestration within Kubernetes, including CPU pinning, NUMA awareness, and HugePages configuration
Implement GitOps-driven infrastructure using ArgoCD or Flux as the single source of truth for all cluster configurations
Deploy and manage Policy-as-Code using Kyverno or OPA Gatekeeper for admission control, resource quotas, and security policies
Build self-service capabilities using Crossplane or similar Kubernetes-native infrastructure provisioning tools

Storage Engineering (20%)

Operate and optimize Ceph distributed storage clusters (currently 1 PiB raw, 149 OSDs, Quincy 17.2.5)
Manage Rook-Ceph operator deployments at scale on modern Kubernetes (v1.28+)
Implement storage tiering: Ceph for bulk storage, local NVMe for high-IOPS workloads, LINSTOR/DRBD or TopoLVM for ultra-fast replicated storage
Design and implement per-VM / per-tenant I/O isolation on shared Ceph clusters
Manage CDI (Containerized Data Importer) for VM image lifecycle in KubeVirt environments

Networking (15%)

Deploy and manage overlay networks for pod networking, micro-segmentation, and WireGuard/IPsec encryption
Implement Cluster Mesh for multi-datacenter pod-to-pod connectivity
Configure Multus CNI and SR-IOV for multi-NIC VM support in KubeVirt
Work with physical network infrastructure: Juniper switches (JunOS), BGP (eBGP/iBGP), EVPN/VXLAN, VLANs
Maintain IPSec site-to-site connectivity between datacenters

Reliability and Operations (15%)

Practice SRE discipline: define and maintain SLOs with error budgets, implement proactive capacity management with 6-12 month forecasting
Design and execute chaos engineering experiments to validate system resilience
Participate in on-call rotation for IaaS infrastructure (OpenNebula, Ceph, networking)
Write and maintain runbooks, DRP documentation, and postmortem analyses
Drive proactive improvement: identify reliability risks, performance bottlenecks, and toil, then propose and implement solutions without waiting for incidents

Infrastructure as Code and Automation (10%)

Develop and maintain Terraform/OpenTofu modules for multi-cloud infrastructure provisioning
Write Ansible playbooks for bare-metal server configuration and fleet management
Automate infrastructure lifecycle: PXE boot images, hardware provisioning (Foreman), IPMI management
Implement FinOps practices: cost attribution, resource utilization analysis, right-sizing recommendations using OpenCost/Kubecost

Requirements

CloudLinux is seeking a Senior IaaS / Kubernetes Platform Engineer to contribute to our Infrastructure Department. This role involves designing and managing Kubernetes platforms, optimizing Ceph storage, and ensuring network reliability. The ideal candidate will have over 5 years of engineering experience with Kubernetes and Linux systems and a proactive approach to improvements. Benefits include flexible hours, remote work, and professional development opportunities., * 5+ years in infrastructure/platform engineering roles with Kubernetes production experience.

Deep Linux systems knowledge and infrastructure as Code expertise.
Experience with Ceph distributed storage and network fundamentals., * Design, build, and operate a multi-tenant Kubernetes platform.
Manage and optimize Ceph distributed storage clusters.
Implement network solutions for pod connectivity and security.

Conocimientos

Kubernetes management Infrastructure automation Linux systems knowledge Proactive mindset, Must have

5+ years in infrastructure/platform engineering roles, with at least 3 years operating production Kubernetes clusters (not just deploying apps on K8s, but building and managing the platform itself)
Production experience with at least 3 of the following:

KubeVirt or similar VM-on-K8s technology
Cluster API (CAPI) for declarative cluster lifecycle management
Cilium or Calico (advanced CNI with eBPF or BGP integration)
Rook-Ceph or other Kubernetes storage operators at scale (100+ OSDs)
ArgoCD or Flux for GitOps-driven infrastructure management

Deep Linux systems knowledge: kernel tuning, networking stack (iptables/nftables, routing, bonding, VLAN), filesystem operations, performance troubleshooting
Ceph distributed storage experience: cluster operations, OSD lifecycle, pool management, performance tuning, troubleshooting degraded states
Infrastructure as Code: Terraform/OpenTofu + Ansible at production scale
Bare-metal infrastructure experience: IPMI/iDRAC, PXE boot, RAID configuration, hardware diagnostics, datacenter operations
Networking fundamentals: BGP, VLAN, IPSec/WireGuard, DNS, load balancing
Strong written and verbal English (B2+ minimum) - documentation, postmortems, and cross-team communication are in English
Proactive mindset: demonstrated history of identifying problems before they become incidents and driving improvements without being asked

Nice to have

Experience building multi-tenant Kubernetes platforms (vCluster, Capsule, or custom namespace isolation)
Crossplane or similar Kubernetes-native infrastructure abstraction
Policy-as-Code: Kyverno, OPA Gatekeeper, or Kubewarden
Container security: image signing (Sigstore/cosign), runtime security (Falco), sandboxed execution (Kata Containers, gVisor)
SRE practices: SLO/SLI design, error budget policies, chaos engineering (LitmusChaos, Chaos Mesh), incident management frameworks
FinOps: OpenCost, Kubecost, cloud cost optimization
Immutable OS experience: Talos Linux, Flatcar Container Linux, or similar
OpenNebula experience (we are migrating FROM it, so understanding it accelerates the transition)
Experience with LINSTOR/DRBD or TopoLVM for local high-performance storage
SR-IOV and DPDK experience for hardware-accelerated networking
Experience migrating from traditional virtualization (VMware, OpenNebula, Proxmox) to Kubernetes/KubeVirt
Grafana LGTM stack (Mimir, Loki, Tempo) for observability
Compliance environment experience (SOC2, ISO 27001, NIS2)
Go or Python programming for infrastructure tooling
Experience with Juniper JunOS switch configuration

What We're Looking For

Proactive mindset. Our current IaaS workload is still around 50% unplanned work, including incidents and ad-hoc support requests. We're looking for someone who can reduce that through better automation, preventive controls, and more resilient systems.
Platform-minded. You look for ways to replace repetitive support work with scalable solutions, for example, building self-service workflows instead of provisioning VMs manually, or introducing automated QoS policies instead of handling limits case by case.
Able to work across the current and future stack. We operate OpenNebula and Ceph today while moving toward a Kubernetes-native platform. This role requires someone who can keep the current environment reliable while helping build the next stage in a practical way.
Transparent in communication. We value technical discussions, architectural decisions, and incident reviews happening in shared channels and documented formats. That includes ADRs, postmortems, and clear written updates.
Focused on knowledge sharing. You document your work, write runbooks as you go, and help make the platform easier for others to operate and support.
Strong English communication. Documentation, postmortems, Jira updates, Slack discussions, and cross-team collaboration are conducted in English

Benefits & conditions

A focus on professional development
Interesting and challenging projects
Fully remote work with flexible working hours, allowing you to schedule your day and work from any location worldwide
Paid 24 days of vacation per year, 10 days of national holidays, and unlimited sick leaves
Compensation for private medical insurance
Co-working and gym/sports reimbursement
Budget for education
The opportunity to receive a reward for the most innovative idea that the company can patent

About the company

Join a company where people build innovative products and thrive in a remote-friendly environment.

🔗 Learn more:
cloudlinux.com | imunify360.com | tuxcare.com

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

About the company

Apply for this position

Good distractions

Moments

Videos View all