Generative AI Engineer

DeepRec

2 days ago

Role details

Contract type

Temporary contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English, Spanish

Experience level

Intermediate

Job location

Remote

Tech stack

Artificial Intelligence

Amazon Web Services (AWS)

Automation of Tests

Cloud Computing

Python

Machine Learning

Open Source Technology

Software Reliability Testing

Management of Software Versions

PyTorch

Large Language Models

Multi-Agent Systems

Generative AI

GIT

Information Technology

HuggingFace

Docker

Data Generation

Job description

Founded in 2019, our client had grown into one of Europe's most recognized deep-tech scale-ups, backed by major global strategic investors and EU innovation funds.

Their quantum and AI technologies had already transformed how enterprise clients built and deployed intelligent systems - achieving up to 95% model compression and 50-80% inference cost reduction.

The company was recognized by CB Insights (2023 & 2025) as one of the Top 100 most promising AI companies globally, often described as a "quantum-AI unicorn in the making."

Role Highlights

The AI Evaluation Data Scientist was responsible for:

Designing and leading evaluation strategies for Agentic AI and RAG systems, translating complex workflows into measurable performance metrics.
Developing multi-step task-based evaluations to capture reasoning quality, factual accuracy, and end-user success in real-world scenarios.
Building reproducible evaluation pipelines with automated test suites, dataset tracking, and performance versioning.
Curating and generating synthetic and adversarial datasets to strengthen system robustness.
Implementing LLM-as-a-judge frameworks aligned with human feedback.
Conducting error analysis and ablations to identify reasoning gaps, hallucinations, and tool-use failures.
Collaborating with ML engineers to create a continuous data flywheel linking evaluation outcomes to product improvements.
Defining and monitoring operational metrics such as latency, reliability, and cost to meet production standards.
Maintaining high standards in engineering, documentation, and reproducibility.

Requirements

Master's or Ph.D. in Computer Science, Machine Learning, Physics, Engineering, or related field.
3+ years (mid-level) or 5+ years (senior) of experience in Data Science, ML Engineering, or Research roles in applied AI/ML projects.
Proven experience designing and implementing evaluation methodologies for machine learning or Generative AI systems.
Hands-on experience with LLMs, RAG pipelines, and agentic architectures.
Proficiency in Python, Git, Docker, and major ML frameworks (PyTorch, HuggingFace, LangGraph, LlamaIndex).
Familiarity with cloud environments (AWS preferred).
Excellent communication skills and fluency in English.

Preferred

Ph.D. in a relevant technical discipline.
Experience with synthetic data generation, adversarial testing, and multi-agent evaluation frameworks.
Strong background in LLM error analysis and reliability testing.
Open-source contributions or publications related to AI evaluation.
Fluency in Spanish.