ETL Developer
Role details
Job location
Tech stack
Job description
We are looking for an experienced ETL Developer to join our data engineering team. You will be responsible for building robust, scalable data pipelines that transform vast amounts of raw data into actionable business intelligence. The ideal candidate has deep expertise in the Hadoop ecosystem and is a master at optimizing PySpark-based processing. You will bridge the gap between complex raw data sources and clean, high-performance data lakes, ensuring our systems remain efficient, reliable, and secure., * Distributed Data Processing: Architect and develop high-performance ETL pipelines using PySpark to process large-scale datasets within a Hadoop environment
- Pipeline Optimization: Diagnose and resolve performance bottlenecks in data jobs, focusing on cluster resource utilization, memory management, and data partitioning strategies
- Data Architecture: Design scalable data models and storage structures within Hadoop (HDFS, Hive, etc) to support high-volume analytics
- Automation & Orchestration: Build and maintain automated data workflows, ensuring data quality, consistency, and timely delivery
- System Design & Performance: Apply advanced concepts to data engineering, including distributed system architecture, load balancing for processing nodes, and caching strategies to minimize latency
- AI Integration: Leverage modern AI tools to accelerate pipeline development, code generation, and automated data quality testing
Requirements
- Big Data Expertise: Extensive hands-on experience with the Hadoop ecosystem (HDFS, Hive, MapReduce, etc)
- PySpark Mastery: Advanced proficiency in PySpark for large-scale data manipulation and transformation
- ETL Workflow: Proven experience designing end-to-end ETL processes from ingestion to consumption
- Performance Tuning: Strong ability to diagnose system slowness, optimize Spark configurations, and resolve data skew issues
- Architectural Mindset: Ability to design distributed systems that are modular, scalable, and resilient
- Scripting/Programming: Advanced Python skills, with a focus on writing maintainable and testable code
Desired skills:
- Experience with cloud-based big data services (AWS EMR, Databricks, or Google Cloud Dataproc)
- Knowledge of NoSQL databases (eg, HBase, Cassandra) or modern data warehouses (Snowflake, BigQuery)