Data Scientist - $84.13 - $132.21 per hour
Role details
Job location
Tech stack
Job description
This opportunity is for an Applied Data Scientist focused on LLM evaluation, experimentation, quality measurement, and applied machine learning systems. The role is responsible for building an evaluation function from the ground up, defining what high-quality generated technical content means, and creating the infrastructure needed to measure model and pipeline changes with confidence.
The work centers on evaluating non-deterministic LLM outputs across a complex multi-stage content generation pipeline. This includes building statistical evaluation methods, developing gold-standard datasets, designing rubrics, creating automated quality signals, and helping engineering and product teams understand whether changes improve, degrade, or preserve output quality across languages, repository types, and content formats.
What You'll Do
- Own the LLM evaluation strategy from first principles through production-ready infrastructure.
- Build the evaluation function from the ground up and help grow the team as the function matures.
- Define quality metrics for generated technical content across multiple content types and abstraction levels.
- Build and curate gold-standard evaluation datasets across programming languages and repository archetypes, including monorepos, microservices, libraries, and applications.
- Design evaluation rubrics that measure accuracy, completeness, usefulness, readability, and overall content quality.
- Create automated evaluation pipelines that score generated output against reference datasets.
- Instrument content generation workflows to support A/B comparisons between models, context strategies, and pipeline approaches.
- Build tooling for LLM-as-judge evaluation, regression detection, and automated quality monitoring.
- Integrate evaluation into CI workflows so pipeline changes are supported by measurable quality evidence.
- Develop quality checks that flag degraded output without requiring manual review of every document.
- Monitor content quality trends over time and identify meaningful changes in output performance.
- Design sampling strategies for human review that maximize signal while minimizing annotation effort.
- Run experiments on model selection, context strategies, and pipeline architecture changes.
- Quantify cost, quality, and latency tradeoffs to support technical and product decisions.
- Partner with engineering teams to translate evaluation insights into shipped product and system improvements.
Requirements
- Bachelor's, Master's, or PhD in Statistics, Machine Learning, Data Science, Computational Linguistics, or a related quantitative field.
- 3-5 years of experience in applied science, machine learning engineering, or data science roles focused on evaluation, NLP, or generative AI.
- 7+ years of relevant experience is preferred.
- Strong foundation in experimental design, hypothesis testing, confidence intervals, effect sizes, and power analysis.
- Experience designing and running evaluations for LLM or NLP systems, especially open-ended text outputs.
- Proficiency in Python and the scientific data stack, including pandas, NumPy, scipy, and scikit-learn.
- Comfort working in Jupyter notebooks for exploration and prototyping, then converting that work into automated pipelines.
- Experience with LLM-as-judge approaches, inter-annotator agreement, and rubric design for subjective quality assessment.
- Familiarity with evaluating non-deterministic systems, including variance decomposition, multi-run methodology, and distinguishing signal from noise at scale.
- Strong data storytelling skills, with the ability to turn experiment results into clear recommendations for engineering and product decisions.
Preferred Skills
- Experience with LLM APIs and prompt engineering across multiple providers.
- Familiarity with evaluation frameworks such as RAGAS, DeepEval, or custom evaluation harnesses.
- Experience building data pipelines or ETL workflows using tools such as Airflow, Dagster, or similar systems.
- Comfort with SQL and working directly with production data stores.
- Experience with visualization tools such as Matplotlib, Plotly, or Streamlit for internal dashboards and reports.
- Background in code understanding, developer tools, or technical documentation.
- Experience building or managing annotation pipelines and human evaluation workflows.
Benefits & conditions
- Competitive cash compensation and equity package
- Flexible work culture
- Unlimited time off
- 12 paid company holidays
- Health, dental, and vision insurance
- Life insurance
- FSA accounts
- 401(k) retirement account options, including Traditional, Roth, or both
- Quarterly team offsites