Data Engineer
Role details
Job location
Tech stack
Job description
Jasper Research is seeking an experienced Data Engineer who will play a pivotal role in supporting our image research team to help design, scale, and maintain our data infrastructure, as well as data processing pipelines powering the training of state-of-the-art multimodal models.
In this role, you will work closely with our research scientists and research engineers to collect, clean, and process large-scale datasets from a variety of sources, ensuring that our models are built on the best possible data foundations.
This role is open to candidates located in France. It will be a hybrid setup, which requires you to come into the office when necessary. The office is based at Station F in Paris, the vibrant hub of the French startup ecosystem. Our efficient and lean team at Station F thrives on innovation and collaboration.
What you will do at Jasper
- Design and implement end-to-end scalable data pipelines to ingest, transform, and load data into our data warehouse.
- Analyze existing datasets and implement robust data validation, deduplication, and bias mitigation processes to ensure the highest quality and diversity of training data.
- Create training sets from existing data, using classical computer vision algorithms, vision models and LLMs.
- Optimize data loading, preprocessing, and augmentation workflows to eliminate bottlenecks and maximize training efficiency.
- Document all data processes, schemas, and transformations to ensure full reproducibility and transparency for the research team.
- Work hand-in-hand with research scientists and engineers to understand their data needs, provide actionable insights, and rapidly iterate on pipeline improvements.
- Source new multi-modal data from public sources.
Requirements
- Bachelor's or Master's degree in Computer Science, Data Engineering, or a related field.
- Strong experience as a Data Engineer or in a similar data-focused role.
- Strong experience in image manipulation at scale and understanding of computer vision.
- Hands-on experience with distributed computing frameworks and cloud platforms for distributed ML training.
- Familiarity with cloud-based data warehousing and storage solutions (e.g., BigQuery).
- Strong attention to detail, commitment to data quality, and a proactive approach to supporting research needs.
Preferred Qualifications
- Knowledge of data transformation and enrichment techniques, including clustering, deduplication, and synthetic data generation
- Experience with vector databases for ML data
- Proficiency in Python and SQL for data manipulation and analysis.
- Proficiency in at least one ML library (TensorFlow, PyTorch, Jax). PyTorch preferred.
- Contributions to open-source data tools or projects.
- Familiarity with data privacy and compliance regulations (GDPR, CCPA).
Benefits & conditions
- Mutuelle coverage for hospitalisation and mental health care provided through Alan Comprehensive healthcare plan
- Flexible PTO with a FlexExperience budget (€552 annually) to help you make the most of your time away from work
- FlexWellness program (€1,640 annually) to help support your personal health goals
- Generous budget for home office set up
- €1,375 annual learning and development stipend