AI Evaluation Engineer
Role details
Job location
Tech stack
Job description
directly with the Head of Engineering and partner closely with product, applied AI, and the full-stack team to establish evaluation as a first-class discipline across the company. What You'll Do - Build evaluation infrastructure: Design and maintain eval suites using Promptfoo, LLM-as-judge methodologies, and custom harnesses for features like our expert search system, natural language grants search, and AI SDR agents. - Define what 'good' means: Partner with product and domain experts to translate fuzzy customer outcomes ("does this surface the right principal investigator?") into precise, measurable rubrics. - Own the feedback loop: Instrument production traffic, curate golden datasets from real customer interactions, and build pipelines that turn user behavior into regression tests. - Ship quickly under uncertainty: We routinely run 48-hour eval sprints for greenfield features with no production traffic. You'll be comfortable bootstrapping quality signal from scratch. - Model and
Requirements
prompt evaluation: Run rigorous A/B comparisons across models (OpenAI, Anthropic, open-weight), prompt strategies, and agent architectures. Quantify tradeoffs between cost, latency, and quality. - Agent evaluation: Help us measure multi-step agent behavior built on the OpenAI Agents SDK, Vercel AI SDK, and Temporal Cloud - including tool-use correctness, trajectory quality, and end-to-end task completion. - Raise the floor for the team: Create templates, documentation, and tooling so every engineer can write and run evals as part of normal development. Evals should feel as natural as unit tests. Requirements - 3+ years of engineering experience, with at least 1 year focused on LLM/ML evaluation, applied AI, or data quality systems. - Hands-on experience with LLM evaluation frameworks - Promptfoo, Braintrust, LangSmith, OpenAI Evals, DeepEval, or equivalent in-house tooling. - Strong grasp of LLM-as-judge methodology, including its failure modes (position bias, verbosity bias, judge-model drift) and how to mitigate them. - Statistical literacy - you know the difference between a real regression and noise, and you can design experiments that answer the question actually being asked. - Product instincts. You can sit with a customer success call transcript, identify the three failure modes that matter, and ship an eval for each by end of week. - Strong written communication. Evals are useless if the engineers shipping features don't trust or read the results. Preferred Qualifications - Experience evaluating retrieval systems (RAG, hybrid search, reranking) - especially over structured or semi-structured domains like research, grants, or patents. - Exposure to agent orchestration frameworks (Temporal, LangGraph, OpenAI Agents SDK) and the specific challenges of evaluating multi-step, tool-using systems. - Background in information retrieval, search relevance, or a research-adjacent domain. - Experience building internal tooling or dashboards that non-engineers (PMs, domain experts) actually use to label and review model outputs. Why This Role - You'll be the first dedicated evals hire. The scope, standards, and tooling are yours to define. - AI quality is existential for our product. This isn't a compliance role tucked into a corner - it's directly on the critical path to revenue. - Small, senior team. ~10 engineers, distributed globally, with a strong bias toward shipping and measuring. - Direct access to real-world, high-stakes LLM use cases - research discovery, grants, outbound - across a customer base that deeply values accuracy. As part of your LinkedIn appl