AI Evaluation Data Scientist - AI/ML/LLM - (Hybrid) - Madrid
Role details
Job location
Tech stack
Job description
AI Evaluation Data Scientist A fantastic opportunity for a driven AI Data Scientist to join a leading Quantum AI company, who work on cutting-edge solutions that make AI faster, greener, and more accessible. You'll be working alongside world-leading experts in quantum computing and AI, with the opportunity to work on challenging projects and shape the future of Generative AI systems. This is initially a 9 Month Fixed Term Contract, with scope to extend - *Hybrid working from sites in Madrid or Barcelona. Responsibilities: * Design and lead the evaluation strategy for our Agentic AI and RAG systems, turning customer workflows and business needs into measurable metrics and clear success criteria. * Contribute to the end-to-end design of Agentic AI and RAG systems, injecting a data-and-evaluation perspective into retrieval strategies, orchestration policies, tool usage, and memory to solve complex, real-world problems across industries. * Develop task-based, multi-step evaluations that reflect how the different components of our systems (retrieval, planning, tool use, memory) perform in real-world scenarios across cloud and edge deployments. * Develop and refine rigorous evaluation frameworks that reflect real-world performance, going beyond model benchmarks to assess task success, reasoning capabilities, factual consistency, reliability, and user success metrics across diverse problem domains. * Build and maintain a reproducible evaluation pipeline, including datasets, scenarios, configs, test suites, versioned assets, and automated runs to track regressions and improvements over time. * Curate and generate high-quality datasets for evaluation, including synthetic and adversarial data, to strengthen coverage and robustness. * Implement and calibrate LLM-as-a-judge evaluations, aligning automated scoring with human feedback and ensuring fairness, robustness, and representativeness. * Perform deep error analyses and ablations to uncover failure patterns, maintain a taxonomy of failure modes (reasoning, grounding, hallucinations, tool failures), and provide actionable insights to engineers to improve model and system performance. * Partner with ML specialists to create a data flywheel, where evaluation continuously informs new dataset creation, improvements on prompts, tool usage, model training, and system refinements, quantifying improvements over time. * Define and monitor operational metrics (latency, cost, reliability) to ensure evaluations align with production and customer expectations. * Maintain high engineering standards, including clear documentation, reproducible experiments, robust version control, and well-structured ML pipelines. * Contribute to team learning and mentorship, guiding junior engineers and sharing expertise in LLM development, evaluation, and deployment best practices. * Participate in code reviews, offering thoughtful, constructive feedback to maintain code quality, readability, and
Requirements
consistency. Required minimum Qualifications * Master's or Ph.D. in Computer Science, Machine Learning, Data Science, Physics, Engineering, or related technical fields, with relevant industry experience. * Solid hands-on experience (3+ years for mid-level, 5+ years for senior) working as a Data Scientist, ML Engineer, or Research Scientist in applied AI/ML projects deployed in production environments. * Strong background in evaluation of machine learning systems, ideally with experience in LLMs, RAG pipelines, or multi-agent systems. * Proven ability to design and implement evaluation methodologies that go beyond static benchmarks, capturing real-world task success, reasoning, and robustness. * Hands-on experience with dataset creation and curation (including synthetic data generation) for training and evaluation. * Proven experience with agent-based architectures (task decomposition, tool use, reasoning workflows), RAG architectures (retrievers, vector databases, rerankers), and orchestration frameworks (LangGraph, LlamaIndex). * Strong problem-solving skills, with the ability to navigate ambiguity and design practical solutions to open-ended user or business needs. * Strong software engineering skills, with proficiency in Python, Docker, Git, and experience building robust, modular, and scalable ML codebases. * Familiarity with common ML and data libraries and frameworks (e.g., PyTorch, HuggingFace, LangGraph, LlamaIndex, Pandas, etc.). * Experience with cloud platforms (ideally AWS). * Fluent in English. By applying to this role, you understand that we may collect your personal data & store & process it on our systems. For more information please see our Privacy Notice