Remote Senior Python Engineer - LLM Evaluation (US-based)
Role details
Job location
Tech stack
Job description
As a Software Engineering evaluator, you will create cutting-edge datasets for training, benchmarking, and advancing large language models, collaborating closely with researchers. This includes curating code examples, providing precise solutions, and making corrections across the full stack - in Python for backend and ML workflows, and JavaScript (React, Node.js) for frontend and API layers, alongside C/C++, Java, Rust, and Go. You will evaluate and refine AI-generated code for efficiency, scalability, and reliability, and work with cross-functional teams to enhance enterprise-level AI-driven coding solutions.
What Does a Typical Day Look Like?
- Work on AI model training initiatives by curating code examples, building solutions, and correcting code across both Python and JavaScript (React, Node.js), with additional work in C/C++, Java, Rust, and Go.
- Evaluate and refine AI-generated code across backend and frontend contexts to ensure that it is efficient, scalable, and reliable.
- Collaborate with cross-functional teams to enhance AI-driven coding solutions against industry performance benchmarks.
- Build agents that can verify the quality of the code and identify error patterns across full-stack applications.
- Hypothesize on steps in the software engineering cycle (prototyping, architecture design, API design, production implementation, launch, experiments, monitoring, operational maintenance) and evaluate model capabilities on them.
- Design verification mechanisms that can automatically verify a solution to a software engineering task.
Requirements
- Several years of software engineering experience (3 years or more)
- Strong expertise in building full-stack applications using Python and JavaScript (React, Node.js), with the ability to work across backend and frontend codebases.
- Experience deploying scalable, production-grade software using modern languages and tools.
- Deep understanding of software architecture, design, development, debugging, and code quality/review assessment.
- Excellent oral and written communication skills for clear, structured evaluation rationales.
Engagement Details:
- Commitment: flexible engagement, minimum 10 hrs/week, up to 40 hrs/week