Cloudera Public Cloud Platform Engineer

PROPERTY CONSULTANT FIN SVC
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Remote

Tech stack

Artificial Intelligence
Amazon Web Services (AWS)
Amazon Web Services (AWS)
Data analysis
Azure
Bash
Command-Line Interface
Cloud Computing
Cloudera Impala
Computer Networks
System Configuration
Data as a Services
Data Validation
Information Engineering
Data Infrastructure
Data Integration
Data Security
Data Visualization
Software Debugging
Disaster Recovery
Hive
Identity and Access Management
Python
Kerberos (Protocol)
Log Analysis
Metadata
Performance Tuning
Reliability Engineering
Azure
Cloud Services
Prometheus
Software Defined Everything
Cloudera
Azure
Service Pack
Data Streaming
Transport Layer Security
Google Cloud Platform
Cloud Platform System
Data Ingestion
Autoscaling
Cloudera Manager
System Availability
Grafana
Spark
Amazon Web Services (AWS)
Containerization
Data Lake
AI Platforms
Kubernetes
Data Lineage
Kafka
Apache Nifi
Data Management
Machine Learning Operations
Terraform
Software Version Control
Data Pipelines
Serverless Computing
Legacy Systems

Job description

  • Enterprise-scale Cloudera CDP platform supporting data engineering, analytics, and AI workloads across multiple applications

  • Modernization of legacy platforms and applications into cloud-native CDP services

  • Operational support and scaling of:

  • Data services (CDE, CDW, CDF, CDL)

  • AI/ML platforms (CAI, inference, workbenches)

Platform performance optimization, observability, and reliability engineering for mission-critical workloads

Why This Role Matters

  • Ensures availability, stability, and performance of the CDP platform supporting all data and AI workloads
  • Enables successful modernization of legacy applications into scalable, cloud-native services
  • Maintains high availability, observability, and operational excellence across enterprise platforms
  • Acts as the backbone for data engineering, analytics, and AI initiatives
  • This role focuses on platform reliability and infrastructure operations and does not include data-layer ownership (e.g., Iceberg table management or data validation)., We are seeking a highly skilled Cloudera Public Cloud Platform Engineer to operate and manage the end-to-end CDP platform ecosystem, including data services, NiFI, Kafka, AI/ML platforms, and enterprise observability.

This role is responsible for ensuring availability, scalability, security, and performance of all platform services supporting data, analytics, and AI workloads across environments.

The ideal candidate brings strong expertise in CDP on-prem, public cloud services, cloud infrastructure, Kubernetes-based runtime environments, and platform observability, supporting high-concurrency, mission-critical workloads at multi-terabyte to petabyte scale

This role is critical to ensuring uninterrupted operation of data, analytics, and AI platforms-any degradation directly impacts downstream business reporting, data pipelines, and model execution.

Key Responsibilities

CDP Platform & Multi-Service Operations

  • Own end-to-end operational responsibility for Cloudera Public Cloud services across Dev / Stage / UAT / Prod:

  • CDE, CDW, COD, CDL, CDF (NiFi), CDV, CAI, Kafka

Ensure multi-cluster stability, workload isolation, and SLA adherence

Support onboarding and operations of multiple applications across environments

Manage and support multi-environment, multi-cluster deployments with strict isolation, governance, and release coordination across Dev/UAT/Prod

AI/ML Platform Operations

  • Operate and support Cloudera AI (CAI) environments:

  • AI Workbenches, AI Studios

  • Model training and development environments

  • AI inference endpoints and model serving

Troubleshoot:

  • Resource contention (CPU/GPU)
  • Model deployment/runtime failures

CDP Runtime & Kubernetes-Aware Operations

  • Operate CDP services running on Cloudera-managed Kubernetes infrastructure

  • Apply strong understanding of containerized workloads and Kubernetes concepts for troubleshooting

  • Diagnose and resolve:

  • Pod failures, restarts, and resource contention

  • Spark job failures in containerized environments (CDE)

  • Service-to-service communication issues

Analyze logs and metrics to identify runtime failures and performance issues

Collaborate with Cloudera support for managed service-level issues

Data Integration & Platform Services

  • Operate and support:

  • CDF (NiFi) for ingestion pipelines

  • CDV (Data Visualization) for reporting workloads

  • Octopai for data lineage and catalog integration

Ensure reliability and performance of data pipelines and integrations

Monitor and troubleshoot Kafka environments:

  • Topic configurations, partitions, and replication
  • Consumer lag and throughput issues
  • Broker connectivity and performance bottlenecks

Security, Governance & SDX Administration

  • Implement and manage:

  • Kerberos, TLS/SSL, Ranger policies

Administer SDX for:

  • Centralized security
  • Metadata and policy enforcement

Support Atlas and Octopai integration

Manage and troubleshoot user access and identity mapping across layers, including:

  • Cloud IAM roles and permissions
  • CDP users/groups and identity providers
  • Ranger policies for fine-grained data access

Resolve access-related issues impacting:

  • Data access (S3/ADLS)
  • Query execution (CDW/CDE)
  • Application and service-level permissions

Cloud Infrastructure & Networking

  • Troubleshoot:

  • S3 / ADLS storage issues

  • IAM roles and permissions

  • VPC, subnets, routing, security groups

  • Bastion host access and connectivity

Ensure secure and reliable connectivity across services

Understand and troubleshoot S3-based data lake patterns, including:

  • Bucket structure, prefix design, and access patterns
  • Performance issues related to small files, request rates, and throughput limits
  • Encryption (SSE-S3, SSE-KMS) and access policies

Manage and troubleshoot cross-account IAM roles and access patterns for CDP environments

Ensure secure access between:

  • CDP environments and cloud resources
  • Multiple AWS accounts (dev/prod separation)

Disaster Recovery & Resiliency

  • Support and validate disaster recovery and failover strategies across CDP environments
  • Ensure backup, recovery, and environment resiliency for critical workloads
  • Participate in DR drills and recovery validation

Observability, Monitoring & Alerting (Critical)

  • Implement and manage end-to-end observability:

  • Metrics, logs, and alerting

Use:

  • Cloudera observability, Cloudera Manager, Prometheus, Grafana

Monitor:

  • Cluster health
  • Workload performance
  • AI inference endpoints

Enable proactive issue detection and prevention

Define and implement SLIs/SLOs and alerting thresholds to ensure platform reliability and performance

Support high-severity (P1/P2) incident response, triage, and resolution within defined SLAs

Operational Support & On-Call

  • Participate in on-call rotation to support 24/7 platform operations
  • Respond to production incidents, alerts, and service disruptions within defined SLAs
  • Handle P1/P2 incidents, including triage, troubleshooting, and resolution
  • Perform root cause analysis (RCA) and implement preventive measures

Upgrades, Patching & Platform Lifecycle

  • Execute:

  • CDP upgrades and version management

  • Security patches and hotfixes

Perform:

  • Rolling upgrades
  • Validation and rollback strategies

Performance Optimization & Cost Efficiency

  • Optimize:

  • Platform-level performance (Spark, Hive, Impala workloads)

  • Cluster utilization and workload distribution

Drive:

  • Autoscaling strategies
  • Cost optimization (FinOps practices)

Automation & Operational Excellence

  • Utilize and support existing automation frameworks for:

  • Platform provisioning

  • Monitoring and alerting

  • Routine operational tasks

Work with infrastructure teams that manage Infrastructure-as-Code (Terraform) for environment setup and changes

Leverage scripting (Python / Shell) for:

Operational support

Task automation

Troubleshooting and diagnostics

Maintain and follow runbooks, SOPs, and operational procedures to ensure consistent platform operations

Requirements

  • 12+ years of experience in Big Data Platform Engineering / Cloud Platform Operations / Infrastructure roles

  • 6+ years of hands-on experience with Cloudera ecosystem (CDH/CDP/ Cloudera Public Cloud)

  • Demonstrated ability to quickly learn and adapt to new technologies and evolving platform capabilities, beyond the currently defined CDP stack

  • Strong expertise in:

  • End-to-end CDP platform operations (CDE, CDW, CDF, CDL, CAI)

  • Advanced troubleshooting across multi-cluster, multi-environment deployments

  • Kubernetes-based runtime environments (troubleshooting and diagnostics)

  • Observability frameworks, including SLIs/SLOs, alerting, and performance tuning

Proven experience in:

  • Leading P1/P2 incident response, triage, and resolution
  • Managing platform upgrades, patching, and lifecycle events
  • Supporting large-scale environments (TB/PB scale, high concurrency workloads)

Strong understanding of:

  • Cloud infrastructure (IAM, VPC, networking, storage)
  • Security and governance (Ranger, Kerberos, TLS/SSL, SDX)

Expected to:

  • Lead complex troubleshooting and drive root cause resolution across platform layers
  • Mentor and guide L2 engineers
  • Coordinate with Cloudera support and infrastructure teams for critical issues

Hands-on experience in developing and troubleshooting NiFi (CDF) data flows, including:

  • Flow design and configuration
  • Processor-level debugging and performance tuning
  • Handling backpressure, throughput optimization, and failure recovery

Required Skills

  • Strong experience with Cloudera CDP Public Cloud

  • Expertise in:

  • Cloud platforms (AWS/Azure/Google Cloud Platform)

  • Kubernetes concepts (troubleshooting-focused)

Hands-on with:

  • CDE, CDW, CDF (NiFi), CAI

knowledge of:

  • IAM, networking, observability tools

Platforms operating at multi-terabyte to petabyte scale with high concurrency workloads

Hands-on experience with:

  • Kafka (or similar streaming platforms) including monitoring, troubleshooting, and performance tuning

Experience with Cloudera CDP CLI (Command Line Interface) for:

  • Platform operations and administration
  • Job execution and service management (CDE/CDW/CDL)
  • Automation of routine operational tasks

Strong working knowledge of:

  • Cloud IAM (AWS IAM / Azure AD) including roles, policies, and cross-service access
  • User and group mapping across CDP, cloud IAM, and Ranger policies
  • Troubleshooting access issues across storage (S3/ADLS), CDP services, and data access layers

Preferred Skills

  • Experience with:

  • Modernization of legacy data platforms/applications to Cloudera CDP Public Cloud

  • Migration and onboarding of workloads to CDE, CDW, and CAI environments

  • Supporting hybrid or multi-environment transitions (on-prem * cloud)

Familiarity with:

  • Cloud platforms (AWS, Azure, Google Cloud Platform) including storage, IAM, and networking concepts
  • Kubernetes-based runtime environments (troubleshooting-focused)

Strong scripting and automation skills (Python, Shell, Terraform) for platform operations

Apply for this position