Senior Software Engineer - AI Evaluation & Benchmarks (Python)
Role details
Job location
Tech stack
Job description
Design and build the coding benchmarks and evaluation pipelines used to test frontier AI models on real software engineering work:
- Design coding benchmarks that evaluate frontier models on real-world programming tasks - reasoning, debugging, and production-quality code
- Build and maintain scalable data pipelines for evaluation workflows
- Analyze model-generated code for correctness, reliability, and edge-case failures
- Construct structured evaluation scenarios across large repos and multi-language environments
- Provide detailed technical feedback on model performance and failure patterns
- Contribute to evaluation frameworks that set the bar for how coding ability is measured
End result: benchmarks that meaningfully separate what frontier models can and can't do - and shape how the next generation is trained and improved.
AI coding evaluation in one line: Design task * build harness * run model * analyze failures * feed findings back into the benchmark * evaluations that actually distinguish strong models from weak ones.
Requirements
- 4+ years of professional software engineering experience (non-negotiable)
- Expert Python - clean, performant, well-tested code
- Hands-on experience working in large, complex codebases
- Proven experience designing and implementing LLM coding benchmarks and evaluation data pipelines
- Strong command of Git and modern development workflows
- Track record at a high-growth tech company or top-tier software organization
- Strong written English communication
Identity verification: Applicants will be required to verify their identity and confirm they have valid documentation to work as an independent contractor in their country of residence.
Nice to have
- Senior or Lead-level profile with a history of technical ownership
- Bachelor's or Master's in CS, ML, or related field (or equivalent professional experience)
- Proficiency in additional languages: JavaScript, Go, C++, or others
- CI/CD experience and writing robust unit tests (pytest, Mocha, JUnit)
- Background in security engineering or significant open-source contributions
- Familiarity with AI/ML evaluation methodologies or model benchmarking
Benefits & conditions
- Location: Fully remote - work from anywhere on the accepted locations list
- Compensation: $80-$100/hr based on location and seniority
- Contract length: 3 months, with potential for extension
- Hours: Full-time availability preferred - hours vary by project and are not guaranteed week to week
- Engagement: 1099 independent contractor
- Payment: Weekly via PayPal or Stripe
️ Important: Hours are project-dependent and can vary week to week. We recommend keeping other work options open alongside this engagement rather than relying on it as your sole source of income.