Data Scientist - $84.13 - $132.21 per hour

RECRUITER LLC

3 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Compensation

$ 275K

Job location

Remote

Tech stack

Airflow

Computational Linguistics

ETL

Data Presentation

Data Visualization

Programming Tools

Statistical Hypothesis Testing

Python

Machine Learning

NumPy

Operational Databases

SciPy

SQL Databases

Jupyter Notebook

Delivery Pipeline

Large Language Models

Prompt Engineering

Pandas

Matplotlib

Scikit Learn

Power Analysis (Cryptography)

Plotly

Streamlit Framework

Data Pipelines

Programming Languages

Microservices

Job description

This opportunity is for an Applied Data Scientist focused on LLM evaluation, experimentation, quality measurement, and applied machine learning systems. The role is responsible for building an evaluation function from the ground up, defining what high-quality generated technical content means, and creating the infrastructure needed to measure model and pipeline changes with confidence.

The work centers on evaluating non-deterministic LLM outputs across a complex multi-stage content generation pipeline. This includes building statistical evaluation methods, developing gold-standard datasets, designing rubrics, creating automated quality signals, and helping engineering and product teams understand whether changes improve, degrade, or preserve output quality across languages, repository types, and content formats.

What You'll Do

Own the LLM evaluation strategy from first principles through production-ready infrastructure.
Build the evaluation function from the ground up and help grow the team as the function matures.
Define quality metrics for generated technical content across multiple content types and abstraction levels.
Build and curate gold-standard evaluation datasets across programming languages and repository archetypes, including monorepos, microservices, libraries, and applications.
Design evaluation rubrics that measure accuracy, completeness, usefulness, readability, and overall content quality.
Create automated evaluation pipelines that score generated output against reference datasets.
Instrument content generation workflows to support A/B comparisons between models, context strategies, and pipeline approaches.
Build tooling for LLM-as-judge evaluation, regression detection, and automated quality monitoring.
Integrate evaluation into CI workflows so pipeline changes are supported by measurable quality evidence.
Develop quality checks that flag degraded output without requiring manual review of every document.
Monitor content quality trends over time and identify meaningful changes in output performance.
Design sampling strategies for human review that maximize signal while minimizing annotation effort.
Run experiments on model selection, context strategies, and pipeline architecture changes.
Quantify cost, quality, and latency tradeoffs to support technical and product decisions.
Partner with engineering teams to translate evaluation insights into shipped product and system improvements.

Requirements

Bachelor's, Master's, or PhD in Statistics, Machine Learning, Data Science, Computational Linguistics, or a related quantitative field.
3-5 years of experience in applied science, machine learning engineering, or data science roles focused on evaluation, NLP, or generative AI.
7+ years of relevant experience is preferred.
Strong foundation in experimental design, hypothesis testing, confidence intervals, effect sizes, and power analysis.
Experience designing and running evaluations for LLM or NLP systems, especially open-ended text outputs.
Proficiency in Python and the scientific data stack, including pandas, NumPy, scipy, and scikit-learn.
Comfort working in Jupyter notebooks for exploration and prototyping, then converting that work into automated pipelines.
Experience with LLM-as-judge approaches, inter-annotator agreement, and rubric design for subjective quality assessment.
Familiarity with evaluating non-deterministic systems, including variance decomposition, multi-run methodology, and distinguishing signal from noise at scale.
Strong data storytelling skills, with the ability to turn experiment results into clear recommendations for engineering and product decisions.

Preferred Skills

Experience with LLM APIs and prompt engineering across multiple providers.
Familiarity with evaluation frameworks such as RAGAS, DeepEval, or custom evaluation harnesses.
Experience building data pipelines or ETL workflows using tools such as Airflow, Dagster, or similar systems.
Comfort with SQL and working directly with production data stores.
Experience with visualization tools such as Matplotlib, Plotly, or Streamlit for internal dashboards and reports.
Background in code understanding, developer tools, or technical documentation.
Experience building or managing annotation pipelines and human evaluation workflows.

Benefits & conditions

Competitive cash compensation and equity package
Flexible work culture
Unlimited time off
12 paid company holidays
Health, dental, and vision insurance
Life insurance
FSA accounts
401(k) retirement account options, including Traditional, Roth, or both
Quarterly team offsites

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

Apply for this position

Good distractions

Moments

Videos View all