Data Scientist - $84.13 - $132.21 per hour

RECRUITER LLC
3 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Intermediate
Compensation
$ 275K

Job location

Remote

Tech stack

Airflow
Computational Linguistics
ETL
Data Presentation
Data Visualization
Programming Tools
Statistical Hypothesis Testing
Python
Machine Learning
NumPy
Operational Databases
SciPy
SQL Databases
Jupyter Notebook
Delivery Pipeline
Large Language Models
Prompt Engineering
Pandas
Matplotlib
Scikit Learn
Power Analysis (Cryptography)
Plotly
Streamlit Framework
Data Pipelines
Programming Languages
Microservices

Job description

This opportunity is for an Applied Data Scientist focused on LLM evaluation, experimentation, quality measurement, and applied machine learning systems. The role is responsible for building an evaluation function from the ground up, defining what high-quality generated technical content means, and creating the infrastructure needed to measure model and pipeline changes with confidence.

The work centers on evaluating non-deterministic LLM outputs across a complex multi-stage content generation pipeline. This includes building statistical evaluation methods, developing gold-standard datasets, designing rubrics, creating automated quality signals, and helping engineering and product teams understand whether changes improve, degrade, or preserve output quality across languages, repository types, and content formats.

What You'll Do

  • Own the LLM evaluation strategy from first principles through production-ready infrastructure.
  • Build the evaluation function from the ground up and help grow the team as the function matures.
  • Define quality metrics for generated technical content across multiple content types and abstraction levels.
  • Build and curate gold-standard evaluation datasets across programming languages and repository archetypes, including monorepos, microservices, libraries, and applications.
  • Design evaluation rubrics that measure accuracy, completeness, usefulness, readability, and overall content quality.
  • Create automated evaluation pipelines that score generated output against reference datasets.
  • Instrument content generation workflows to support A/B comparisons between models, context strategies, and pipeline approaches.
  • Build tooling for LLM-as-judge evaluation, regression detection, and automated quality monitoring.
  • Integrate evaluation into CI workflows so pipeline changes are supported by measurable quality evidence.
  • Develop quality checks that flag degraded output without requiring manual review of every document.
  • Monitor content quality trends over time and identify meaningful changes in output performance.
  • Design sampling strategies for human review that maximize signal while minimizing annotation effort.
  • Run experiments on model selection, context strategies, and pipeline architecture changes.
  • Quantify cost, quality, and latency tradeoffs to support technical and product decisions.
  • Partner with engineering teams to translate evaluation insights into shipped product and system improvements.

Requirements

  • Bachelor's, Master's, or PhD in Statistics, Machine Learning, Data Science, Computational Linguistics, or a related quantitative field.
  • 3-5 years of experience in applied science, machine learning engineering, or data science roles focused on evaluation, NLP, or generative AI.
  • 7+ years of relevant experience is preferred.
  • Strong foundation in experimental design, hypothesis testing, confidence intervals, effect sizes, and power analysis.
  • Experience designing and running evaluations for LLM or NLP systems, especially open-ended text outputs.
  • Proficiency in Python and the scientific data stack, including pandas, NumPy, scipy, and scikit-learn.
  • Comfort working in Jupyter notebooks for exploration and prototyping, then converting that work into automated pipelines.
  • Experience with LLM-as-judge approaches, inter-annotator agreement, and rubric design for subjective quality assessment.
  • Familiarity with evaluating non-deterministic systems, including variance decomposition, multi-run methodology, and distinguishing signal from noise at scale.
  • Strong data storytelling skills, with the ability to turn experiment results into clear recommendations for engineering and product decisions.

Preferred Skills

  • Experience with LLM APIs and prompt engineering across multiple providers.
  • Familiarity with evaluation frameworks such as RAGAS, DeepEval, or custom evaluation harnesses.
  • Experience building data pipelines or ETL workflows using tools such as Airflow, Dagster, or similar systems.
  • Comfort with SQL and working directly with production data stores.
  • Experience with visualization tools such as Matplotlib, Plotly, or Streamlit for internal dashboards and reports.
  • Background in code understanding, developer tools, or technical documentation.
  • Experience building or managing annotation pipelines and human evaluation workflows.

Benefits & conditions

  • Competitive cash compensation and equity package
  • Flexible work culture
  • Unlimited time off
  • 12 paid company holidays
  • Health, dental, and vision insurance
  • Life insurance
  • FSA accounts
  • 401(k) retirement account options, including Traditional, Roth, or both
  • Quarterly team offsites

Apply for this position