Cloudera Public Cloud Platform Engineer
Role details
Job location
Tech stack
Job description
-
Enterprise-scale Cloudera CDP platform supporting data engineering, analytics, and AI workloads across multiple applications
-
Modernization of legacy platforms and applications into cloud-native CDP services
-
Operational support and scaling of:
-
Data services (CDE, CDW, CDF, CDL)
-
AI/ML platforms (CAI, inference, workbenches)
Platform performance optimization, observability, and reliability engineering for mission-critical workloads
Why This Role Matters
- Ensures availability, stability, and performance of the CDP platform supporting all data and AI workloads
- Enables successful modernization of legacy applications into scalable, cloud-native services
- Maintains high availability, observability, and operational excellence across enterprise platforms
- Acts as the backbone for data engineering, analytics, and AI initiatives
- This role focuses on platform reliability and infrastructure operations and does not include data-layer ownership (e.g., Iceberg table management or data validation)., We are seeking a highly skilled Cloudera Public Cloud Platform Engineer to operate and manage the end-to-end CDP platform ecosystem, including data services, NiFI, Kafka, AI/ML platforms, and enterprise observability.
This role is responsible for ensuring availability, scalability, security, and performance of all platform services supporting data, analytics, and AI workloads across environments.
The ideal candidate brings strong expertise in CDP on-prem, public cloud services, cloud infrastructure, Kubernetes-based runtime environments, and platform observability, supporting high-concurrency, mission-critical workloads at multi-terabyte to petabyte scale
This role is critical to ensuring uninterrupted operation of data, analytics, and AI platforms-any degradation directly impacts downstream business reporting, data pipelines, and model execution.
Key Responsibilities
CDP Platform & Multi-Service Operations
-
Own end-to-end operational responsibility for Cloudera Public Cloud services across Dev / Stage / UAT / Prod:
-
CDE, CDW, COD, CDL, CDF (NiFi), CDV, CAI, Kafka
Ensure multi-cluster stability, workload isolation, and SLA adherence
Support onboarding and operations of multiple applications across environments
Manage and support multi-environment, multi-cluster deployments with strict isolation, governance, and release coordination across Dev/UAT/Prod
AI/ML Platform Operations
-
Operate and support Cloudera AI (CAI) environments:
-
AI Workbenches, AI Studios
-
Model training and development environments
-
AI inference endpoints and model serving
Troubleshoot:
- Resource contention (CPU/GPU)
- Model deployment/runtime failures
CDP Runtime & Kubernetes-Aware Operations
-
Operate CDP services running on Cloudera-managed Kubernetes infrastructure
-
Apply strong understanding of containerized workloads and Kubernetes concepts for troubleshooting
-
Diagnose and resolve:
-
Pod failures, restarts, and resource contention
-
Spark job failures in containerized environments (CDE)
-
Service-to-service communication issues
Analyze logs and metrics to identify runtime failures and performance issues
Collaborate with Cloudera support for managed service-level issues
Data Integration & Platform Services
-
Operate and support:
-
CDF (NiFi) for ingestion pipelines
-
CDV (Data Visualization) for reporting workloads
-
Octopai for data lineage and catalog integration
Ensure reliability and performance of data pipelines and integrations
Monitor and troubleshoot Kafka environments:
- Topic configurations, partitions, and replication
- Consumer lag and throughput issues
- Broker connectivity and performance bottlenecks
Security, Governance & SDX Administration
-
Implement and manage:
-
Kerberos, TLS/SSL, Ranger policies
Administer SDX for:
- Centralized security
- Metadata and policy enforcement
Support Atlas and Octopai integration
Manage and troubleshoot user access and identity mapping across layers, including:
- Cloud IAM roles and permissions
- CDP users/groups and identity providers
- Ranger policies for fine-grained data access
Resolve access-related issues impacting:
- Data access (S3/ADLS)
- Query execution (CDW/CDE)
- Application and service-level permissions
Cloud Infrastructure & Networking
-
Troubleshoot:
-
S3 / ADLS storage issues
-
IAM roles and permissions
-
VPC, subnets, routing, security groups
-
Bastion host access and connectivity
Ensure secure and reliable connectivity across services
Understand and troubleshoot S3-based data lake patterns, including:
- Bucket structure, prefix design, and access patterns
- Performance issues related to small files, request rates, and throughput limits
- Encryption (SSE-S3, SSE-KMS) and access policies
Manage and troubleshoot cross-account IAM roles and access patterns for CDP environments
Ensure secure access between:
- CDP environments and cloud resources
- Multiple AWS accounts (dev/prod separation)
Disaster Recovery & Resiliency
- Support and validate disaster recovery and failover strategies across CDP environments
- Ensure backup, recovery, and environment resiliency for critical workloads
- Participate in DR drills and recovery validation
Observability, Monitoring & Alerting (Critical)
-
Implement and manage end-to-end observability:
-
Metrics, logs, and alerting
Use:
- Cloudera observability, Cloudera Manager, Prometheus, Grafana
Monitor:
- Cluster health
- Workload performance
- AI inference endpoints
Enable proactive issue detection and prevention
Define and implement SLIs/SLOs and alerting thresholds to ensure platform reliability and performance
Support high-severity (P1/P2) incident response, triage, and resolution within defined SLAs
Operational Support & On-Call
- Participate in on-call rotation to support 24/7 platform operations
- Respond to production incidents, alerts, and service disruptions within defined SLAs
- Handle P1/P2 incidents, including triage, troubleshooting, and resolution
- Perform root cause analysis (RCA) and implement preventive measures
Upgrades, Patching & Platform Lifecycle
-
Execute:
-
CDP upgrades and version management
-
Security patches and hotfixes
Perform:
- Rolling upgrades
- Validation and rollback strategies
Performance Optimization & Cost Efficiency
-
Optimize:
-
Platform-level performance (Spark, Hive, Impala workloads)
-
Cluster utilization and workload distribution
Drive:
- Autoscaling strategies
- Cost optimization (FinOps practices)
Automation & Operational Excellence
-
Utilize and support existing automation frameworks for:
-
Platform provisioning
-
Monitoring and alerting
-
Routine operational tasks
Work with infrastructure teams that manage Infrastructure-as-Code (Terraform) for environment setup and changes
Leverage scripting (Python / Shell) for:
Operational support
Task automation
Troubleshooting and diagnostics
Maintain and follow runbooks, SOPs, and operational procedures to ensure consistent platform operations
Requirements
-
12+ years of experience in Big Data Platform Engineering / Cloud Platform Operations / Infrastructure roles
-
6+ years of hands-on experience with Cloudera ecosystem (CDH/CDP/ Cloudera Public Cloud)
-
Demonstrated ability to quickly learn and adapt to new technologies and evolving platform capabilities, beyond the currently defined CDP stack
-
Strong expertise in:
-
End-to-end CDP platform operations (CDE, CDW, CDF, CDL, CAI)
-
Advanced troubleshooting across multi-cluster, multi-environment deployments
-
Kubernetes-based runtime environments (troubleshooting and diagnostics)
-
Observability frameworks, including SLIs/SLOs, alerting, and performance tuning
Proven experience in:
- Leading P1/P2 incident response, triage, and resolution
- Managing platform upgrades, patching, and lifecycle events
- Supporting large-scale environments (TB/PB scale, high concurrency workloads)
Strong understanding of:
- Cloud infrastructure (IAM, VPC, networking, storage)
- Security and governance (Ranger, Kerberos, TLS/SSL, SDX)
Expected to:
- Lead complex troubleshooting and drive root cause resolution across platform layers
- Mentor and guide L2 engineers
- Coordinate with Cloudera support and infrastructure teams for critical issues
Hands-on experience in developing and troubleshooting NiFi (CDF) data flows, including:
- Flow design and configuration
- Processor-level debugging and performance tuning
- Handling backpressure, throughput optimization, and failure recovery
Required Skills
-
Strong experience with Cloudera CDP Public Cloud
-
Expertise in:
-
Cloud platforms (AWS/Azure/Google Cloud Platform)
-
Kubernetes concepts (troubleshooting-focused)
Hands-on with:
- CDE, CDW, CDF (NiFi), CAI
knowledge of:
- IAM, networking, observability tools
Platforms operating at multi-terabyte to petabyte scale with high concurrency workloads
Hands-on experience with:
- Kafka (or similar streaming platforms) including monitoring, troubleshooting, and performance tuning
Experience with Cloudera CDP CLI (Command Line Interface) for:
- Platform operations and administration
- Job execution and service management (CDE/CDW/CDL)
- Automation of routine operational tasks
Strong working knowledge of:
- Cloud IAM (AWS IAM / Azure AD) including roles, policies, and cross-service access
- User and group mapping across CDP, cloud IAM, and Ranger policies
- Troubleshooting access issues across storage (S3/ADLS), CDP services, and data access layers
Preferred Skills
-
Experience with:
-
Modernization of legacy data platforms/applications to Cloudera CDP Public Cloud
-
Migration and onboarding of workloads to CDE, CDW, and CAI environments
-
Supporting hybrid or multi-environment transitions (on-prem * cloud)
Familiarity with:
- Cloud platforms (AWS, Azure, Google Cloud Platform) including storage, IAM, and networking concepts
- Kubernetes-based runtime environments (troubleshooting-focused)
Strong scripting and automation skills (Python, Shell, Terraform) for platform operations