AI Evaluation Data Scientist - AI/ML/LLM - (Hybrid) - Madrid

European Tech Recruit

2 days ago

Role details

Contract type

Temporary contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Job location

Tech stack

Artificial Intelligence

Amazon Web Services (AWS)

Software Quality

Code Review

Databases

Python

Machine Learning

Quantum Computing

Software Engineering

PyTorch

Large Language Models

Multi-Agent Systems

Generative AI

GIT

Pandas

Information Technology

HuggingFace

Software Version Control

Docker

Data Generation

Job description

AI Evaluation Data Scientist A fantastic opportunity for a driven AI Data Scientist to join a leading Quantum AI company, who work on cutting-edge solutions that make AI faster, greener, and more accessible. You'll be working alongside world-leading experts in quantum computing and AI, with the opportunity to work on challenging projects and shape the future of Generative AI systems. This is initially a 9 Month Fixed Term Contract, with scope to extend - *Hybrid working from sites in Madrid or Barcelona. Responsibilities: * Design and lead the evaluation strategy for our Agentic AI and RAG systems, turning customer workflows and business needs into measurable metrics and clear success criteria. * Contribute to the end-to-end design of Agentic AI and RAG systems, injecting a data-and-evaluation perspective into retrieval strategies, orchestration policies, tool usage, and memory to solve complex, real-world problems across industries. * Develop task-based, multi-step evaluations that reflect how the different components of our systems (retrieval, planning, tool use, memory) perform in real-world scenarios across cloud and edge deployments. * Develop and refine rigorous evaluation frameworks that reflect real-world performance, going beyond model benchmarks to assess task success, reasoning capabilities, factual consistency, reliability, and user success metrics across diverse problem domains. * Build and maintain a reproducible evaluation pipeline, including datasets, scenarios, configs, test suites, versioned assets, and automated runs to track regressions and improvements over time. * Curate and generate high-quality datasets for evaluation, including synthetic and adversarial data, to strengthen coverage and robustness. * Implement and calibrate LLM-as-a-judge evaluations, aligning automated scoring with human feedback and ensuring fairness, robustness, and representativeness. * Perform deep error analyses and ablations to uncover failure patterns, maintain a taxonomy of failure modes (reasoning, grounding, hallucinations, tool failures), and provide actionable insights to engineers to improve model and system performance. * Partner with ML specialists to create a data flywheel, where evaluation continuously informs new dataset creation, improvements on prompts, tool usage, model training, and system refinements, quantifying improvements over time. * Define and monitor operational metrics (latency, cost, reliability) to ensure evaluations align with production and customer expectations. * Maintain high engineering standards, including clear documentation, reproducible experiments, robust version control, and well-structured ML pipelines. * Contribute to team learning and mentorship, guiding junior engineers and sharing expertise in LLM development, evaluation, and deployment best practices. * Participate in code reviews, offering thoughtful, constructive feedback to maintain code quality, readability, and

Requirements

consistency. Required minimum Qualifications * Master's or Ph.D. in Computer Science, Machine Learning, Data Science, Physics, Engineering, or related technical fields, with relevant industry experience. * Solid hands-on experience (3+ years for mid-level, 5+ years for senior) working as a Data Scientist, ML Engineer, or Research Scientist in applied AI/ML projects deployed in production environments. * Strong background in evaluation of machine learning systems, ideally with experience in LLMs, RAG pipelines, or multi-agent systems. * Proven ability to design and implement evaluation methodologies that go beyond static benchmarks, capturing real-world task success, reasoning, and robustness. * Hands-on experience with dataset creation and curation (including synthetic data generation) for training and evaluation. * Proven experience with agent-based architectures (task decomposition, tool use, reasoning workflows), RAG architectures (retrievers, vector databases, rerankers), and orchestration frameworks (LangGraph, LlamaIndex). * Strong problem-solving skills, with the ability to navigate ambiguity and design practical solutions to open-ended user or business needs. * Strong software engineering skills, with proficiency in Python, Docker, Git, and experience building robust, modular, and scalable ML codebases. * Familiarity with common ML and data libraries and frameworks (e.g., PyTorch, HuggingFace, LangGraph, LlamaIndex, Pandas, etc.). * Experience with cloud platforms (ideally AWS). * Fluent in English. By applying to this role, you understand that we may collect your personal data & store & process it on our systems. For more information please see our Privacy Notice