Senior Software Engineer - AI Evaluation & Benchmarks (Python)

G2i Inc.

Delray Beach, United States of America

17 days ago

Role details

Contract type

Temporary to permanent

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Compensation

$ 208K

Job location

Remote

Delray Beach, United States of America

Tech stack

JavaScript

Artificial Intelligence

Unit Testing

C++

Continuous Integration

Software Debugging

JUnit

Python

Software Engineering

Software Organization

Large Language Models

GIT

Build Management

Pytest

Code Testing

Codebase

Mocha

Data Pipelines

Job description

Design and build the coding benchmarks and evaluation pipelines used to test frontier AI models on real software engineering work:

Design coding benchmarks that evaluate frontier models on real-world programming tasks - reasoning, debugging, and production-quality code
Build and maintain scalable data pipelines for evaluation workflows
Analyze model-generated code for correctness, reliability, and edge-case failures
Construct structured evaluation scenarios across large repos and multi-language environments
Provide detailed technical feedback on model performance and failure patterns
Contribute to evaluation frameworks that set the bar for how coding ability is measured

End result: benchmarks that meaningfully separate what frontier models can and can't do - and shape how the next generation is trained and improved.

AI coding evaluation in one line: Design task * build harness * run model * analyze failures * feed findings back into the benchmark * evaluations that actually distinguish strong models from weak ones.

Requirements

4+ years of professional software engineering experience (non-negotiable)
Expert Python - clean, performant, well-tested code
Hands-on experience working in large, complex codebases
Proven experience designing and implementing LLM coding benchmarks and evaluation data pipelines
Strong command of Git and modern development workflows
Track record at a high-growth tech company or top-tier software organization
Strong written English communication

Identity verification: Applicants will be required to verify their identity and confirm they have valid documentation to work as an independent contractor in their country of residence.

Nice to have

Senior or Lead-level profile with a history of technical ownership
Bachelor's or Master's in CS, ML, or related field (or equivalent professional experience)
Proficiency in additional languages: JavaScript, Go, C++, or others
CI/CD experience and writing robust unit tests (pytest, Mocha, JUnit)
Background in security engineering or significant open-source contributions
Familiarity with AI/ML evaluation methodologies or model benchmarking

Benefits & conditions

Location: Fully remote - work from anywhere on the accepted locations list
Compensation: $80-$100/hr based on location and seniority
Contract length: 3 months, with potential for extension
Hours: Full-time availability preferred - hours vary by project and are not guaranteed week to week
Engagement: 1099 independent contractor
Payment: Weekly via PayPal or Stripe

️ Important: Hours are project-dependent and can vary week to week. We recommend keeping other work options open alongside this engagement rather than relying on it as your sole source of income.

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

Apply for this position

Good distractions

Moments

Videos View all