Systems Engineer
Role details
Job location
Tech stack
Job description
· Design, develop, and maintain large-scale data processing pipelines using Apache Spark.
· Monitor and troubleshoot Spark job failures, including driver/executor crashes and performance bottlenecks.
· Manage and optimize workloads running on Hadoop (HDFS, YARN) clusters.
· Provide support for onboarding new data pipelines and services into the platform.
· Analyze and resolve resource allocation issues such as CPU/memory quota exceedance in Kubernetes environments.
· Build and maintain ETL pipelines for ingesting, transforming, and loading large datasets.
· Ensure data quality, consistency, and integrity across distributed data systems.
· Implement alerting rules and thresholds using monitoring platforms (e.g., Prometheus-based systems).
· Work with Kafka to manage data streaming pipelines, including topic configuration and access control.
· Troubleshoot Kafka consumer/producer issues, including lag, permissions, and connectivity errors.
· Implement and maintain data retention policies in Lakehouse architectures.
· Perform log analysis and debugging using distributed logging tools.
· Coordinate with infrastructure teams to resolve cluster-level or networking issues.
· Configure and manage storage paths, table-level retention, and lifecycle policies for datasets.
· Develop and execute SQL queries for data analysis, validation, and reporting.
· Automate workflows and monitoring using Python scripts and APIs (e.g., GitHub API, SQLAlchemy).
· Continuously improve system efficiency, scalability, and cost optimization.
· Analyze alerts from monitoring systems and take proactive action to prevent outages.
· Investigate production incidents (P1/P2) and perform root cause analysis (RCA).
· Collaborate with cross-functional teams (developers, SREs, data engineers) to resolve system issues.
· Conduct data validation and reconciliation between upstream and downstream systems.
· Maintain dashboards and observability tools (e.g., Hubble, internal monitoring systems).
· Optimize performance of distributed jobs by tuning configurations and execution plans.
· Handle identity and access management issues across systems (Kerberos, service accounts, ACLs).
· Support migration and integration of new data technologies into existing ecosystems.
· Work on Kubernetes-based Spark deployments and troubleshoot pod scheduling and quota issues.
· Ensure high availability and reliability of data pipelines and streaming jobs.
· Participate in on-call rotations and respond to critical production alerts.
· Document system architecture, troubleshooting steps, and operational procedures.
· Ensure compliance with organizational data governance and security standards.
Requirements
Do you have experience in Python?, Do you have a Bachelor's degree?, Bachelor's Degree is required in Computer Science or Computer Engineering or Computer Information Systems or Information Technology or Data Science.