AI Benchmark Software Engineer

Turing Technology, Inc.

yesterday

Role details

Contract type

Contract

Employment type

Part-time (≤ 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Remote

Tech stack

JavaScript

Artificial Intelligence

Software Bug Management

Software Debugging

Django

Python

Node.js

Open Source Technology

Software Engineering

Flask

Large Language Models

Multi-Agent Systems

FastAPI

Build Management

Pytest

Git Flow

Codebase

Docker

Job description

We are looking for experienced Engineers - Code / SWE to design and build high-quality multi-agent benchmark tasks based on real-world software engineering workflows.

In this role, you will create tasks grounded in real open-source code changes such as bug fixes, migrations, and refactors. These tasks are used to evaluate how effectively AI agents can understand large codebases, apply precise modifications, and produce correct, testable outputs.

You will work within a structured evaluation framework (Harbor), define clear task instructions, design verification logic, and decompose complex engineering problems across multiple specialized agents.

What does day-to-day look like:

Build multi-agent benchmark tasks based on real-world open-source code changes (bug fixes, migrations, refactors)
Work with the Harbor evaluation framework to run and validate tasks inside Docker environments
Write clear, precise task instructions specifying file paths, function signatures, expected behavior, and constraints
Design and implement Python-based verification scripts to validate correctness of agent-generated code changes
Create decomposition strategies that split complex code changes across multiple independent sub-agents
Run, debug, and refine tasks within containerized environments to ensure reproducibility and determinism
Evaluate task performance signals and improve task quality, clarity, and difficulty

Requirements

5+ years of experience in Python and JavaScript development
Experience with AI coding benchmarks (e.g., SWE-bench, Terminal-Bench)
Strong experience reading and navigating large open-source codebases (e.g., Django, Flask, FastAPI, Node.js, or similar)
Familiarity with Git workflows, including pull requests, diffs, cherry-picking, and working with specific commits
Comfortable working with Docker (writing Dockerfiles, building images, debugging container issues)
Experience writing test scripts (pytest, unittest, or custom assertion-based testing)
Ability to write clear, precise, and unambiguous technical specifications
Perks of Freelancing With Turing
Work on cutting-edge AI projects with leading foundation model companies
Collaborate on high-impact work at the frontier of LLM evaluation and reasoning
Remote, flexible opportunities with global teams

About the company

Turing is one of the world's fastest-growing AI companies, accelerating the advancement and deployment of powerful AI systems. Turing helps customers in two ways: working with the world's leading AI labs to advance frontier model capabilities in thinking, reasoning, coding, agentic behavior, multimodality, multilinguality, STEM, and frontier knowledge; and leveraging that work to build real-world AI systems that solve mission-critical priorities for companies., About Turing.com 201-500 Palo Alto, California, United States Website

Role details

Job location

Tech stack

Job description

Requirements

About the company

Apply for this position

Good distractions

Moments

Videos View all