AI Evaluation Engineer

Firstignite
Lalín, Spain
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Junior
Compensation
€ 80K

Job location

Lalín, Spain

Tech stack

Artificial Intelligence
Automated Storage and Retrieval Systems
Unit Testing
Cloud Computing
Information Retrieval
Regression Testing
Next.js
Delivery Pipeline
Large Language Models

Job description

FirstIgnite in Galicia, Spain is seeking an AI Evaluation Engineer to oversee the quality of AI features. You'll be responsible for designing the evaluation infrastructures that determine feature effectiveness. This role involves defining measurable success factors and running rigorous evaluations of model performance., We're hiring an AI Evaluation Engineer to own the quality bar for every LLM-powered feature we ship. You'll design, build, and scale the infrastructure that tells us - with evidence - whether a prompt change, model swap, or agent refactor made things better or worse.

This is a high-leverage role. Every customer-facing AI capability at FirstIgnite flows through your evals. You'll work directly with the Head of Engineering and partner closely with product, applied AI, and the full-stack team to establish evaluation as a first-class discipline across the company. What You'll Do

  • Build evaluation infrastructure: Design and maintain eval suites using Promptfoo, LLM-as-judge methodologies, and custom harnesses for features like our expert search system, natural language grants search, and AI SDR agents.
  • Define what 'good' means: Partner with product and domain experts to translate fuzzy customer outcomes ("does this surface the right principal investigator?") into precise, measurable rubrics.
  • Own the feedback loop: Instrument production traffic, curate golden datasets from real customer interactions, and build pipelines that turn user behavior into regression tests.
  • Ship quickly under uncertainty: We routinely run 48-hour eval sprints for greenfield features with no production traffic. You'll be comfortable bootstrapping quality signal from scratch.
  • Model and prompt evaluation: Run rigorous A/B comparisons across models (OpenAI, Anthropic, open-weight), prompt strategies, and agent architectures. Quantify tradeoffs between cost, latency, and quality.
  • Agent evaluation: Help us measure multi-step agent behavior built on the OpenAI Agents SDK, Vercel AI SDK, and Temporal Cloud - including tool-use correctness, trajectory quality, and end-to-end task completion.
  • Raise the floor for the team: Create templates, documentation, and tooling so every engineer can write and run evals as part of normal development. Evals should feel as natural as unit tests.

Requirements

With a strong teamwork focus, the ideal candidate will utilize their experience in LLM evaluation to improve our AI capabilities while working alongside product management and engineering teams., * 3+ years of engineering experience, including 1 year on LLM/ML evaluation.

  • Experience with LLM evaluation frameworks like Promptfoo and OpenAI Evals.

  • Ability to design experiments that answer specific questions., * Design and maintain evaluation infrastructure using various methodologies.

  • Partner with experts to create measurable success rubrics.

  • Instrument traffic and create pipelines from user behavior to tests., LLM evaluation frameworks Statistical literacy Strong written communication Product instincts, * 3+ years of engineering experience, with at least 1 year focused on LLM/ML evaluation, applied AI, or data quality systems.

  • Hands-on experience with LLM evaluation frameworks - Promptfoo, Braintrust, LangSmith, OpenAI Evals, DeepEval, or equivalent in-house tooling.

  • Strong grasp of LLM-as-judge methodology, including its failure modes (position bias, verbosity bias, judge-model drift) and how to mitigate them.

  • Statistical literacy - you know the difference between a real regression and noise, and you can design experiments that answer the question actually being asked.

  • Product instincts. You can sit with a customer success call transcript, identify the three failure modes that matter, and ship an eval for each by end of week.

  • Strong written communication. Evals are useless if the engineers shipping features don't trust or read the results., * Experience evaluating retrieval systems (RAG, hybrid search, reranking) - especially over structured or semi-structured domains like research, grants, or patents.

  • Exposure to agent orchestration frameworks (Temporal, LangGraph, OpenAI Agents SDK) and the specific challenges of evaluating multi-step, tool-using systems.

  • Background in information retrieval, search relevance, or a research-adjacent domain.

  • Experience building internal tooling or dashboards that non-engineers (PMs, domain experts) actually use to label and review model outputs.

About the company

Promptfoo OpenAI Evals DeepEval Descripción del empleo About FirstIgnite FirstIgnite is the AI-powered business development platform for university technology transfer offices (TTOs). We help research institutions turn breakthroughs into partnerships, licenses, and companies by combining deep LLM-driven workflows with the relationships that actually move deals forward. Our product suite spans expert discovery, grants search, and AI-driven outreach - all built on a modern, agentic stack. We ship fast, we measure everything, and we believe evaluations are the difference between AI features that demo well and AI features that work in production., * You'll be the first dedicated evals hire. The scope, standards, and tooling are yours to define. * AI quality is existential for our product. This isn't a compliance role tucked into a corner - it's directly on the critical path to revenue. * Small, senior team. ~10 engineers, distributed globally, with a strong bias toward shipping and measuring. * Direct access to real-world, high-stakes LLM use cases - research discovery, grants, outbound - across a customer base that deeply values accuracy.

Apply for this position