AI Evaluation Engineer
Role details
Job location
Tech stack
Job description
design, build, and scale the infrastructure that tells us - with evidence - whether a prompt change, model swap, or agent refactor made things better or worse. This is a high-leverage role. Every customer-facing AI capability at First Ignite flows through your evals. You'll work directly with the Head of Engineering and partner closely with product, applied AI, and the full-stack team to establish evaluation as a first-class discipline across the company. What You'll Do Build evaluation infrastructure: Design and maintain eval suites using Promptfoo, LLM-as-judge methodologies, and custom harnesses for features like our expert search system, natural language grants search, and AI SDR agents. Define what 'good' means: Partner with product and domain experts to translate fuzzy customer outcomes ( "does this surface the right principal investigator? ") into precise, measurable rubrics. Own the feedback loop: Instrument production traffic, curate golden datasets from real customer
Requirements
interactions, and build pipelines that turn user behavior into regression tests. Ship quickly under uncertainty: We routinely run 48-hour eval sprints for greenfield features with no production traffic. You'll be comfortable bootstrapping quality signal from scratch. Model and prompt evaluation: Run rigorous A/B comparisons across models (Open AI, Anthropic, open-weight), prompt strategies, and agent architectures. Quantify tradeoffs between cost, latency, and quality. Agent evaluation: Help us measure multi-step agent behavior built on the Open AI Agents SDK, Vercel AI SDK, and Temporal Cloud - including tool-use correctness, trajectory quality, and end-to-end task completion. Raise the floor for the team: Create templates, documentation, and tooling so every engineer can write and run evals as part of normal development. Evals should feel as natural as unit tests. Requirements 3+ years of engineering experience, with at least 1 year focused on LLM/ML evaluation, applied AI, or data quality systems. Hands-on experience with LLM evaluation frameworks - Promptfoo, Braintrust, Lang Smith, Open AI Evals, Deep Eval, or equivalent in-house tooling. Strong grasp of LLM-as-judge methodology, including its failure modes (position bias, verbosity bias, judge-model drift) and how to mitigate them. Statistical literacy - you know the difference between a real regression and noise, and you can design experiments that answer the question actually being asked. Product instincts. You can sit with a customer success call transcript, identify the three failure modes that matter, and ship an eval for each by end of week. Strong written communication. Evals are useless if the engineers shipping features don't trust or read the results. Preferred Qualifications Experience evaluating retrieval systems (RAG, hybrid search, reranking) - especially over structured or semi-structured domains like research, grants, or patents. Exposure to agent orchestration frameworks (Temporal, Lang Graph, Open AI Agents SDK) and the specific challenges of evaluating multi-step, tool-using systems. Background in information retrieval, search relevance, or a research-adjacent domain. Experience building internal tooling or dashboards that non-engineers (PMs, domain experts) actually use to label and review model outputs. Why This Role You'll be the first dedicated evals hire. The scope, standards, and tooling are yours to define. AI quality is existential for our product. This isn't a compliance role tucked into a corner - it's directly on the critical path to revenue. Small, senior team. ~10 engineers, distributed globally, with a strong bias toward shipping and measuring. Direct access to real-world, high-stakes LLM use cases - research discovery, grants, outbound - across a customer base that deeply values accuracy. As part of your Linked In application submission, please use Loom to record a video and email it to us within the