Data Engineer (Python/PySpark/Apache Spark
Role details
Job location
Tech stack
Job description
We are seeking a highly skilled and motivated Data Engineer to join our dynamic team. As a Data Engineer, you will play a crucial role in designing, developing, and maintaining our clients' data infrastructure. Your expertise in Apache Spark, Python, PySpark, ETL processes, CI/CD (Jenkins or GitHub), and experience with both streaming and batch workflows will be essential in ensuring the efficient flow and processing of data to support our clients., * Data Architecture and Design: Collaborate with cross-functional teams to understand data requirements and design robust data architecture solutions. Develop data models and schema designs to optimize data storage and retrieval.
- ETL Development: Implement robust ETL processes to extract, transform, and load data from various sources. Ensure data quality, integrity, and consistency throughout the ETL pipeline.
- Distributed Computing & Spark Development: Utilize your expertise in Apache Spark, Python, and PySpark to develop efficient, large-scale data processing and analysis scripts. Optimize code for performance, memory management, and scalability, keeping up-to-date with the latest industry best practices.
- Data Integration: Integrate data from different systems and sources to provide a unified view for analytical purposes. Collaborate with data scientists and analysts to implement solutions that meet their data integration needs.
- Streaming and Batch Workflows: Design and implement streaming workflows using PySpark Streaming or other relevant technologies. Develop batch processing workflows for large-scale data processing and analysis.
- CI/CD Implementation: Implement and maintain continuous integration and continuous deployment (CI/CD) pipelines using Jenkins or GitHub Actions. Automate testing, code deployment, and monitoring processes to ensure the reliability of data pipelines.
Requirements
- Bachelor's or Master's degree in Computer Science, Information Technology, or a related field.
- Proven experience as a Data Engineer or similar role.
- Strong programming skills in Python and deep expertise in Apache Spark and PySpark for both batch and streaming data processing.
- Hands-on experience developing, tuning, and troubleshooting distributed data pipelines.
- Solid understanding of ETL tools, data modeling, database design, and data warehousing concepts.
- Familiarity with CI/CD tools such as Jenkins or GitHub Actions.
- Excellent problem-solving, analytical, communication, and collaboration skills., * Experience with Ab Initio (e.g., GDE, Co-Operating System, EME) or a strong background in enterprise ETL modernization.
- Knowledge of cloud platforms such as AWS, Azure, or Google Cloud.
- Experience with version control systems (e.g., Git).
- Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
- Understanding of data security and privacy best practices., Are you currently legally authorized to work in the US for any employer?* Do you now or will you ever require sponsorship from Infinitive to continue to be authorized to work in the US?* Do you have experience with Python?* Do you have experience with AWS Glue?* Do you have experience with AWS Lambda?* Do you have experience with AWS Cloudwatch?* Do you have experience with Databricks?* Do you have experience with Spark?* Do you have experience with Snowflake?* Do you have experience with DynamoDB?* Do you have experience with Apache Kafka?* Do you have experience with Apache Airflow?* Human Check*