Senior PySpark Data Engineer
Role details
Job location
Tech stack
Job description
We are seeking a highly skilled and motivated Data Engineer to play a pivotal role in designing, building, and optimizing our next-generation scalable data pipelines. This position requires expertise in processing massive datasets using cutting-edge technologies like Apache Spark, PySpark, and Hive within a dynamic cloud environment. Your primary objective will be to ensure the utmost data reliability, speed, and efficiency, providing a robust foundation for downstream business intelligence and advanced analytics initiatives. Roles & Responsibilities:
-
Data Pipeline Development & Maintenance: Design, build, and maintain highly scalable and efficient ETL/ELT data pipelines utilizing PySpark and Spark SQL for complex data transformations.
-
Cloud Data Infrastructure Management: Deploy, manage, and scale critical data infrastructure components on leading cloud platforms such as Amazon Web Services (AWS) (e.g., EMR, Glue), Microsoft Azure (e.g., Databricks, Synapse), or Google Cloud Platform (GCP).
-
Data Warehousing & Storage Optimization: Strategically manage data layout, partitioning, and indexing within Apache Hive and various cloud data lake solutions to optimize performance and accessibility.
-
Performance Tuning & Optimization: Proactively identify and resolve performance bottlenecks in Spark jobs, leveraging Spark UI for in-depth analysis, effectively managing data skewness, and optimizing memory utilization.
-
Diverse Data Integration: Develop robust solutions for ingesting high-volume and diverse datasets from both structured relational databases and unstructured flat files into our data ecosystem.
-
Automated Workflow Orchestration: Implement and manage automated data workflows using industry-standard scheduling tools like Apache Airflow or platform-native schedulers, ensuring timely and reliable data delivery.
-
Strategic Collaboration: Partner closely with data scientists, business analysts, and cross-functional enterprise teams to translate complex business requirements into technically sound and efficient data solutions.
Requirements
Do you have experience in Spark?, Do you have a Bachelor's degree?, * Big Data Frameworks Expertise: Demonstrated high proficiency in Apache Spark architecture, including a deep understanding of drivers, executors, and Directed Acyclic Graphs (DAGs).
-
Advanced Programming: Exceptional coding skills in Python and extensive experience with the PySpark API for developing intricate data transformations and processing logic.
-
Querying & Schema Management: Strong command of HiveQL and ANSI SQL, coupled with expertise in data partitioning techniques and effective schema definition.
-
Optimized Storage Formats: In-depth understanding and practical experience with optimized big data storage file formats such as Parquet, ORC, and Avro.
-
Cloud Ecosystem Development: Hands-on development experience utilizing cloud-native big data utilities (e.g., AWS EMR, Azure Databricks) with in major cloud platforms.
-
Data Warehousing Fundamentals: Solid foundation in Dimensional Data Modeling, including Star and Snowflake schemas, and practical experience with Data Lakes concepts and implementation.
Preferred Qualifications
-
CI/CD & DevOps Automation: Experience with Continuous Integration/Continuous Deployment (CI/CD) practices and automation tools like Git, Jenkins, or Ansible.
-
NoSQL Database Integration: Exposure to and experience with NoSQL databases such as HBase, Cassandra, or MongoDB.
-
Professional Cloud Certifications: Relevant professional cloud certifications (e.g., AWS Certified Data Engineer, Microsoft Certified: Azure Data Engineer Associate) are highly valued, Qualifications : BACHELOR OF COMPUTER SCIENCE