Senior Technical Lead - DevOps, Python, Kubernetes

HCL America Inc.
Santa Clara, United States of America
7 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
$ 134K

Job location

Santa Clara, United States of America

Tech stack

Amazon Web Services (AWS)
Bash
Configuration Management
Data as a Services
DevOps
Disaster Recovery
Distributed Data Store
Distributed Systems
Python
Lightweight Directory Access Protocols (LDAP)
PostgreSQL
Linux System Administration
Performance Tuning
Prometheus
Service Discovery
Software Vulnerability Management
Apache Zookeeper
Scripting (Bash/Python/Go/Ruby)
Cloud Monitoring
Grafana
Apigee
Data Layers
Kubernetes
Infrastructure Automation Frameworks
Cassandra
Terraform

Job description

We are seeking an experienced Data Services Lead Engineer to own the technical direction, architecture, and operational excellence of our data platform. This role requires deep expertise in Cassandra, ZooKeeper, and Consul operations, strong leadership skills, and a passion for building robust, scalable distributed data systems. You will guide the team on best practices, lead complex technical projects, and act as the primary escalation point for data-platform-related issues. The team is also responsible for ZooKeeper, Consul, LDAP, PostgreSQL, and Qpid., Lead the design, architecture, and implementation of highly available, scalable, and performant distributed data stores (including Cassandra and PostgreSQL) across cloud and OnPrem environments. Define and drive the technical roadmap and strategy for the persistence services layer within Apigee Edge Data Services. Lead incident response and management with clear communication. Lead comprehensive post-mortem analyses for production incidents to identify root causes, document findings, and drive the implementation of preventative measures across the data platform. Lead vulnerability management initiatives, including the execution of regular version and security upgrades for all supported data services. Establish and enforce best practices for distributed systems data modeling, capacity planning, performance tuning, security, and disaster recovery. Develop and improve automation for cluster provisioning, configuration management, and upgrades. Serve as the primary technical escalation point for complex production issues, including root cause analysis. Mentor and provide technical guidance to other engineers across the organization. Collaborate with Engineering, SRE, and Support teams to align the data layer with platform requirements. Drive continuous improvement initiatives to enhance reliability and maintainability. Participate in the team's on-call rotation for production support.

Requirements

7+ years of experience managing large-scale, mission-critical distributed data systems (e.g., Cassandra, ZooKeeper) in a production environment. Understanding of Consul for service discovery and configuration management. Deep understanding of distributed system architectures, data modeling, internals, and performance tuning. Proficiency in Linux environments and scripting languages (e.g., Python, Bash). Experience with infrastructure-as-code tools (e.g., Terraform). Experience with monitoring and alerting systems (e.g., Prometheus, Grafana, Cloud Monitoring). Experience working in cloud environments (GCP, AWS, etc.).

Apply for this position