Senior Data Engineer - AI Focused

DOCTOLIB SAS
Canton de Levallois-Perret, France
5 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Canton de Levallois-Perret, France

Tech stack

Artificial Intelligence
Airflow
Google BigQuery
Cloud Storage
Databases
Continuous Integration
Information Engineering
Data Governance
Data Flow Control
Python
Machine Learning
MongoDB
NoSQL
SQL Databases
Management of Software Versions
Google Cloud Platform
Data Ingestion
Large Language Models
Change Tracking
Kubernetes
Information Technology
Machine Learning Operations
Data Pipelines
Docker

Job description

  • Ensure high standards of data quality for AI model inputs.
  • Design, build, and maintain scalable data pipelines on Google Cloud Platform (GCP) for AI and machine learning use cases.
  • Implement data ingestion and transformation frameworks that power Retrieval systems and training datasets for LLMs and multimodal models.
  • Architect and manage NoSQL and Vector Databases to store and retrieve embeddings, documents, and model inputs efficiently.
  • Collaborate with ML and platform teams to define data schemas, partitioning strategies, and governance rules that ensure privacy, scalability, and reliability.
  • Integrate unstructured and structured data sources (text, speech,image, documents, metadata) into unified data models ready for AI consumption.
  • Optimize performance and cost of data pipelines using GCP native services (BigQuery, Dataflow, Pub/Sub, Cloud Storage, Vertex AI).
  • Contribute to data quality and lineage frameworks, ensuring AI models are trained on validated, auditable, and compliant datasets.
  • Continuously evaluate and improve our data stack to accelerate AI experimentation and deployment.

Requirements

Do you have a Master's degree?, * Master's or Ph.D. degree in Computer Science, Data Engineering, or a related field.

  • 5+ years of experience in Data Engineering, ideally supporting AI or ML workloads.
  • Strong experience with the GCP data ecosystem
  • Proficiency in Python and SQL, with experience in data pipeline orchestration (e.g., Airflow, Dagster, Cloud Composer).
  • Deep understanding of NoSQL systems (e.g., MongoDB) and vector databases (e.g., FAISS, Vector Search).
  • Experience designing data architectures for RAG, embeddings, or model training pipelines.
  • Knowledge of data governance, security, and compliance for sensitive or regulated data.
  • Familiarity with W&B / MLflow / Braintrust / DVC for experiment tracking and dataset versioning (extract snapshots, change tracking, reproducibility).
  • Familiarity with (Docker, Kubernetes) and CI/CD for data workflows.containerized environments
  • A collaborative mindset and passion for building the data foundations of next-generation AI systems.

Benefits & conditions

  • Free mental health and coaching services through our partner Moka.care
  • For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
  • Work from EU countries and the UK for up to 10 days per year, thanks to our flexibility days policy
  • Work Council subsidy to refund part of a sport club membership or a creative class
  • Up to 14 days of RTT
  • Lunch voucher with Swile card

Apply for this position