IaaS / Kubernetes Platform Engineer
Role details
Job location
Tech stack
Job description
Our infrastructure powers 500+ VMs across multiple datacenters, serving 20+ engineering teams. We are evolving from an OpenNebula-based virtualization platform toward a Kubernetes-native multi-tenant cloud with KubeVirt for VM orchestration-while maintaining reliability and operational excellence throughout the transition. What You Will Do Kubernetes Platform Engineering (Primary Focus - 40%)
- Design, build, and operate a multi-tenant Kubernetes platform using Cluster API (CAPI) with bare-metal providers (Metal3/Sidero)
- Implement hard multi-tenancy using vCluster (Loft Labs) or similar technology, providing isolated Kubernetes API servers per tenant
- Deploy and manage KubeVirt for VM orchestration within Kubernetes, including CPU pinning, NUMA awareness, and HugePages configuration
- Implement GitOps-driven infrastructure using ArgoCD or Flux as the single source of truth for all cluster configurations
- Deploy and manage Policy-as-Code using Kyverno or OPA Gatekeeper for admission control, resource quotas, and security policies
- Build self-service capabilities using Crossplane or similar Kubernetes-native infrastructure provisioning tools
Storage Engineering (20%)
- Operate and optimize Ceph distributed storage clusters (currently 1 PiB raw, 149 OSDs, Quincy 17.2.5)
- Manage Rook-Ceph operator deployments at scale on modern Kubernetes (v1.28+)
- Implement storage tiering: Ceph for bulk storage, local NVMe for high-IOPS workloads, LINSTOR/DRBD or TopoLVM for ultra-fast replicated storage
- Design and implement per-VM / per-tenant I/O isolation on shared Ceph clusters
- Manage CDI (Containerized Data Importer) for VM image lifecycle in KubeVirt environments
Networking (15%)
- Deploy and manage overlay networks for pod networking, micro-segmentation, and WireGuard/IPsec encryption
- Implement Cluster Mesh for multi-datacenter pod-to-pod connectivity
- Configure Multus CNI and SR-IOV for multi-NIC VM support in KubeVirt
- Work with physical network infrastructure: Juniper switches (JunOS), BGP (eBGP/iBGP), EVPN/VXLAN, VLANs
- Maintain IPSec site-to-site connectivity between datacenters
Reliability and Operations (15%)
- Practice SRE discipline: define and maintain SLOs with error budgets, implement proactive capacity management with 6-12 month forecasting
- Design and execute chaos engineering experiments to validate system resilience
- Participate in on-call rotation for IaaS infrastructure (OpenNebula, Ceph, networking)
- Write and maintain runbooks, DRP documentation, and postmortem analyses
- Drive proactive improvement: identify reliability risks, performance bottlenecks, and toil, then propose and implement solutions without waiting for incidents
Infrastructure as Code and Automation (10%)
- Develop and maintain Terraform/OpenTofu modules for multi-cloud infrastructure provisioning
- Write Ansible playbooks for bare-metal server configuration and fleet management
- Automate infrastructure lifecycle: PXE boot images, hardware provisioning (Foreman), IPMI management
- Implement FinOps practices: cost attribution, resource utilization analysis, right-sizing recommendations using OpenCost/Kubecost
Requirements
CloudLinux is seeking a Senior IaaS / Kubernetes Platform Engineer to contribute to our Infrastructure Department. This role involves designing and managing Kubernetes platforms, optimizing Ceph storage, and ensuring network reliability. The ideal candidate will have over 5 years of engineering experience with Kubernetes and Linux systems and a proactive approach to improvements. Benefits include flexible hours, remote work, and professional development opportunities., * 5+ years in infrastructure/platform engineering roles with Kubernetes production experience.
- Deep Linux systems knowledge and infrastructure as Code expertise.
- Experience with Ceph distributed storage and network fundamentals., * Design, build, and operate a multi-tenant Kubernetes platform.
- Manage and optimize Ceph distributed storage clusters.
- Implement network solutions for pod connectivity and security.
Conocimientos
Kubernetes management Infrastructure automation Linux systems knowledge Proactive mindset, Must have
- 5+ years in infrastructure/platform engineering roles, with at least 3 years operating production Kubernetes clusters (not just deploying apps on K8s, but building and managing the platform itself)
- Production experience with at least 3 of the following:
- KubeVirt or similar VM-on-K8s technology
- Cluster API (CAPI) for declarative cluster lifecycle management
- Cilium or Calico (advanced CNI with eBPF or BGP integration)
- Rook-Ceph or other Kubernetes storage operators at scale (100+ OSDs)
- ArgoCD or Flux for GitOps-driven infrastructure management
- Deep Linux systems knowledge: kernel tuning, networking stack (iptables/nftables, routing, bonding, VLAN), filesystem operations, performance troubleshooting
- Ceph distributed storage experience: cluster operations, OSD lifecycle, pool management, performance tuning, troubleshooting degraded states
- Infrastructure as Code: Terraform/OpenTofu + Ansible at production scale
- Bare-metal infrastructure experience: IPMI/iDRAC, PXE boot, RAID configuration, hardware diagnostics, datacenter operations
- Networking fundamentals: BGP, VLAN, IPSec/WireGuard, DNS, load balancing
- Strong written and verbal English (B2+ minimum) - documentation, postmortems, and cross-team communication are in English
- Proactive mindset: demonstrated history of identifying problems before they become incidents and driving improvements without being asked
Nice to have
- Experience building multi-tenant Kubernetes platforms (vCluster, Capsule, or custom namespace isolation)
- Crossplane or similar Kubernetes-native infrastructure abstraction
- Policy-as-Code: Kyverno, OPA Gatekeeper, or Kubewarden
- Container security: image signing (Sigstore/cosign), runtime security (Falco), sandboxed execution (Kata Containers, gVisor)
- SRE practices: SLO/SLI design, error budget policies, chaos engineering (LitmusChaos, Chaos Mesh), incident management frameworks
- FinOps: OpenCost, Kubecost, cloud cost optimization
- Immutable OS experience: Talos Linux, Flatcar Container Linux, or similar
- OpenNebula experience (we are migrating FROM it, so understanding it accelerates the transition)
- Experience with LINSTOR/DRBD or TopoLVM for local high-performance storage
- SR-IOV and DPDK experience for hardware-accelerated networking
- Experience migrating from traditional virtualization (VMware, OpenNebula, Proxmox) to Kubernetes/KubeVirt
- Grafana LGTM stack (Mimir, Loki, Tempo) for observability
- Compliance environment experience (SOC2, ISO 27001, NIS2)
- Go or Python programming for infrastructure tooling
- Experience with Juniper JunOS switch configuration
What We're Looking For
- Proactive mindset. Our current IaaS workload is still around 50% unplanned work, including incidents and ad-hoc support requests. We're looking for someone who can reduce that through better automation, preventive controls, and more resilient systems.
- Platform-minded. You look for ways to replace repetitive support work with scalable solutions, for example, building self-service workflows instead of provisioning VMs manually, or introducing automated QoS policies instead of handling limits case by case.
- Able to work across the current and future stack. We operate OpenNebula and Ceph today while moving toward a Kubernetes-native platform. This role requires someone who can keep the current environment reliable while helping build the next stage in a practical way.
- Transparent in communication. We value technical discussions, architectural decisions, and incident reviews happening in shared channels and documented formats. That includes ADRs, postmortems, and clear written updates.
- Focused on knowledge sharing. You document your work, write runbooks as you go, and help make the platform easier for others to operate and support.
- Strong English communication. Documentation, postmortems, Jira updates, Slack discussions, and cross-team collaboration are conducted in English
Benefits & conditions
- A focus on professional development
- Interesting and challenging projects
- Fully remote work with flexible working hours, allowing you to schedule your day and work from any location worldwide
- Paid 24 days of vacation per year, 10 days of national holidays, and unlimited sick leaves
- Compensation for private medical insurance
- Co-working and gym/sports reimbursement
- Budget for education
- The opportunity to receive a reward for the most innovative idea that the company can patent
About the company
Join a company where people build innovative products and thrive in a remote-friendly environment.
🔗 Learn more:
cloudlinux.com | imunify360.com | tuxcare.com