AI Evaluation Engineer

First Talent

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Junior

Job location

Tech stack

Artificial Intelligence

Automated Storage and Retrieval Systems

Unit Testing

Cloud Computing

Information Retrieval

Regression Testing

Next.js

Delivery Pipeline

Large Language Models

Job description

design, build, and scale the infrastructure that tells us - with evidence - whether a prompt change, model swap, or agent refactor made things better or worse. This is a high-leverage role. Every customer-facing AI capability at First Ignite flows through your evals. You'll work directly with the Head of Engineering and partner closely with product, applied AI, and the full-stack team to establish evaluation as a first-class discipline across the company. What You'll Do Build evaluation infrastructure: Design and maintain eval suites using Promptfoo, LLM-as-judge methodologies, and custom harnesses for features like our expert search system, natural language grants search, and AI SDR agents. Define what 'good' means: Partner with product and domain experts to translate fuzzy customer outcomes ( "does this surface the right principal investigator? ") into precise, measurable rubrics. Own the feedback loop: Instrument production traffic, curate golden datasets from real customer

Requirements

interactions, and build pipelines that turn user behavior into regression tests. Ship quickly under uncertainty: We routinely run 48-hour eval sprints for greenfield features with no production traffic. You'll be comfortable bootstrapping quality signal from scratch. Model and prompt evaluation: Run rigorous A/B comparisons across models (Open AI, Anthropic, open-weight), prompt strategies, and agent architectures. Quantify tradeoffs between cost, latency, and quality. Agent evaluation: Help us measure multi-step agent behavior built on the Open AI Agents SDK, Vercel AI SDK, and Temporal Cloud - including tool-use correctness, trajectory quality, and end-to-end task completion. Raise the floor for the team: Create templates, documentation, and tooling so every engineer can write and run evals as part of normal development. Evals should feel as natural as unit tests. Requirements 3+ years of engineering experience, with at least 1 year focused on LLM/ML evaluation, applied AI, or data quality systems. Hands-on experience with LLM evaluation frameworks - Promptfoo, Braintrust, Lang Smith, Open AI Evals, Deep Eval, or equivalent in-house tooling. Strong grasp of LLM-as-judge methodology, including its failure modes (position bias, verbosity bias, judge-model drift) and how to mitigate them. Statistical literacy - you know the difference between a real regression and noise, and you can design experiments that answer the question actually being asked. Product instincts. You can sit with a customer success call transcript, identify the three failure modes that matter, and ship an eval for each by end of week. Strong written communication. Evals are useless if the engineers shipping features don't trust or read the results. Preferred Qualifications Experience evaluating retrieval systems (RAG, hybrid search, reranking) - especially over structured or semi-structured domains like research, grants, or patents. Exposure to agent orchestration frameworks (Temporal, Lang Graph, Open AI Agents SDK) and the specific challenges of evaluating multi-step, tool-using systems. Background in information retrieval, search relevance, or a research-adjacent domain. Experience building internal tooling or dashboards that non-engineers (PMs, domain experts) actually use to label and review model outputs. Why This Role You'll be the first dedicated evals hire. The scope, standards, and tooling are yours to define. AI quality is existential for our product. This isn't a compliance role tucked into a corner - it's directly on the critical path to revenue. Small, senior team. ~10 engineers, distributed globally, with a strong bias toward shipping and measuring. Direct access to real-world, high-stakes LLM use cases - research discovery, grants, outbound - across a customer base that deeply values accuracy. As part of your Linked In application submission, please use Loom to record a video and email it to us within the

About the company

{ "@context": "http://schema.org", "@type": "JobPosting", "baseSalary" : { "@type": "MonetaryAmount", "currency": "EUR", "value": { "@type": "QuantitativeValue", "value": 0.00, "unitText": "MONTH" } }, "datePosted": "2026-05-20", "validThrough" : "2026-06-30", "description": " About First Ignite First Ignite is the AI-powered business development platform for university technology transfer offices (TTOs). We help research institutions turn breakthroughs into partnerships, licenses, and companies by combining deep LLM-driven workflows with the relationships that actually move deals forward. Our product suite spans expert discovery, grants search, and AI-driven outreach - all built on a modern, agentic stack. We ship fast, we measure everything, and we believe evaluations are the difference between AI features that demo well and AI features that work in production. The Role We're hiring an AI Evaluation Engineer to own the quality bar for every LLM-powered feature we ship. You'll

Role details

Job location

Tech stack

Job description

Requirements

About the company

Apply for this position

Good distractions

Moments

Videos View all