Senior Research Data Engineer - Foundation Models
Role details
Job location
Tech stack
Job description
- Work on ambitious frontier research projects as part of a team consisting of research scientists and research data engineers.
- Architect, design and build data pipelines that can handle petabytes of multi-modal unstructured data.
- Build a modern data engineering stack grounded in state-of-the-art technology for orchestration and parallel computation, and make extensive use of actively developing open-source solutions.
- From the lowest levels of components to the birds-eye view of a system - find performance bottlenecks, debug issues, and create pipelines with a focus on stability.
- Leverage our large on-prem data centers and AWS cloud infrastructure for blazing data processing.
- Go beyond "Big Data" and ETL, and engineer and operate complex Python data solutions for real-world unstructured data incl. text, code, image and audio modalities.
- Collaborate with stakeholders, research scientists, other research data engineers and data tooling and platform teams.
- Raise the standard for excellence and act as owner and champion for the quality and availability of our foundation model training data.
- Ensure mission-critical reliability of data pipeline jobs, and maintain high quality code.
Play to your strengths and contribute with creativity, thoroughness, pragmatism, foresight, ingenuity, persistence, and every part of you that elevates the team.
Requirements
Do you have experience in UX?, Do you have a Master's degree?, * Professional experience in data, platform or software engineering, ideally with a focus on large-scale unstructured data.
- Python: Extensive professional experience in Python software engineering. Ideally, experience in maintaining proprietary or open-source software products.
- Data: Experience with exploratory data analysis, cleaning, validation and quality control beyond business intelligence and analytics scale.
- Pipelines: Experience with building reproducible pipelines for storing and processing petabytes of data.
- Operations: Proficiency in containerization and automatic deployment. Ideally, experience with container orchestration with kubernetes and cloud infrastructure.
- Scaling: Experience with highly scalable, parallel compute workloads (e.g., Dask, Ray, Celery).
- Performance: Experience with writing and optimizing highly performant code.
- Cross-functional Affinity: Ability to work directly with our researchers and engineering stakeholders to translate their needs into data products with the desired user experience and performance.
- Soft Skills: Excellent problem-solving abilities, strong communication skills, and a collaborative mindset.
Ideally, you have domain-specific experiences:
- LLM or VLM training data preparation.
- NLP, text classification, reinforcement learning, model-based/GPU workflows.
- Dynamic workflow orchestration frameworks like Argo Workflows, Airflow, Dagster or Flyte.
- Linguistics expertise or speaking multiple languages.
- Experience in a high-performance programming language like C++, Go or Rust.
About the company
Helping people overcome communication barriers is the heart of what we do. Founded in Germany in 2017 by a team of engineers and researchers, DeepL has developed the world’s most accurate AI translation technology—enabling real-time, human-sounding translation.
Accessible via a web translator, browser extensions, desktop and mobile apps, and an API, DeepL supports a best-in-class translation experience in 34 languages and counting. Our 550-person team operates across four European hubs in Germany, the Netherlands, the UK, and Poland.