Data Engineer Python

KLEEVER

Paris, France

31 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Compensation

€ 125K

Job location

Paris, France

Tech stack

Artificial Intelligence

Airflow

Automation of Tests

Batch Processing

Big Data

Code Coverage

Code Review

Databases

Continuous Integration

Data as a Services

Information Engineering

ETL

Data Retrieval

Data Stores

Relational Databases

Fault Tolerance

Python

PostgreSQL

Modular Design

NoSQL

Prometheus

Software Engineering

SQL Databases

Systems Integration

Data Storage Technologies

Azure

GIT

Pytest

Information Technology

Cassandra

Amazon Web Services (AWS)

Software Version Control

Data Pipelines

Job description

As a Senior Data Engineer, you will design, build, and maintain scalable data pipelines and workflows to support our growing data ecosystem. You will focus on creating production-ready ETL processes using Apache Airflow, integrating with diverse data stores, and ensuring all code meets rigorous development standards, including peer review, scalability, and comprehensive test coverage., Develop and optimize ETL pipelines using Apache Airflow to ingest, transform, and load data from various sources into target systems.

Implement production-ready code for data workflows, ensuring scalability, fault tolerance, and adherence to best practices such as modular design, error handling, and automated testing (unit, integration, and end-to-end).

Collaborate with data scientists, analysts, and engineering teams to build and maintain RAG pipelines that enhance AI/ML applications with accurate, context-aware data retrieval.

Participate in code reviews to enforce high coding standards, promote clean, readable code, and integrate CI/CD practices for automated testing and deployment.

Monitor and troubleshoot data pipelines for performance, reliability, and data quality, implementing observability tools to detect and resolve issues proactively.

Design and optimize data storage solutions, integrating with relational and NoSQL databases to support real-time and batch processing needs.

Requirements

Do you have experience in Python?, Do you have a Bachelor's degree?, The ideal candidate is a proficient developer who treats data engineering as software engineering, with hands-on experience in RAG (Retrieval-Augmented Generation) pipelines and a track record of delivering reliable, maintainable systems., Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).

5+ years of hands-on experience as a Data Engineer or in a similar role, with a proven background as a strong developer (e.g., proficiency in Python, SQL, and related languages).

Excellent proficiency with Apache Airflow for orchestrating complex ETL workflows, including DAG creation, scheduling, and dependency management.

Demonstrated experience building scalable ETL pipelines that handle large datasets, with a focus on production-ready implementation including comprehensive test coverage (e.g., using pytest or similar frameworks).

Strong emphasis on software engineering practices: Experience with peer code reviews, version control (e.g., Git), and ensuring code is modular, documented, and scalable to prevent common pitfalls like brittle or unmaintainable pipelines.

Familiarity with data modeling, transformation, and integration in distributed environments.

Excellent problem-solving skills and the ability to work in a fast-paced, collaborative environment.

Preferred Qualifications:

Experience with RAG pipelines, including vector databases and embedding techniques for AI-driven applications.

Hands-on experience with databases such as PostgreSQL (for relational data), StarRocks (for analytical workloads), Cassandra or ScyllaDB (for high-throughput NoSQL), and Qdrant (for vector search).

Knowledge of cloud data services (e.g., AWS Glue, Azure Data Factory) and orchestration tools beyond Airflow.

Familiarity with monitoring and observability tools like Prometheus or OpenSearch for data pipeline health.