Machine Learning and AI Engineer
Role details
Job location
Tech stack
Job description
Build and maintain ML/AI features
- Develop and improve the components that make AI agents intelligent: prompt engineering, classifier pipelines, goal evaluation logic, and post-call analysis
- Work with the LangChain/LangGraph agent framework to build, test, and refine conversation flows that handle real-world customer interactions
- Implement and evaluate data extraction pipelines - turning unstructured conversation transcripts into structured fields (names, dates, postcodes, appointment preferences) reliably
Integrate and evaluate models
- Integrate LLM providers (OpenAI, Anthropic, Groq, Google) into the platform's agent orchestration layer, including prompt construction, response parsing, and error handling
- Run model evaluations - comparing output quality, latency, and cost across providers and model versions to inform which models the platform uses in production
- Work with the existing Langsmith tracing infrastructure to monitor model performance and identify regressions
Support the voice and classification pipeline
- Contribute to the STT (speech-to-text) and TTS (text-to-speech) integration layer - understanding how audio becomes text, how text becomes an agent response, and how that response becomes audio again
- Help build and extend the classification system that determines conversation outcomes (was the call successful? did the customer want a callback? was it a voicemail?) - including writing evaluation prompts, defining ground truth datasets, and measuring accuracy
- Assist with data preparation, feature engineering, and dataset curation for evaluation and fine-tuning tasks
Write production-quality code
- Write clean, tested Python that runs in a production FastAPI application - not throwaway scripts
- Participate in code reviews, both giving and receiving - learning from the senior developer's feedback and contributing your own perspective
- Contribute to documentation that helps the rest of the engineering team understand how AI components work and how to use them correctly, * You've worked with LangChain, LangGraph, or similar agent frameworks - even in a personal project or hackathon
- You've built something with the OpenAI or Anthropic API that went beyond "hello world" - a chatbot, a classifier, a data extraction pipeline, an evaluation harness
- You understand the basics of how voice AI works: STT * LLM * TTS - even if you've only read about it rather than built it
- You've worked with structured evaluation of LLM outputs - comparing model responses against expected answers, not just eyeballing whether it "looks right"
- You have opinions about prompt engineering - you've iterated on prompts and observed how small changes affect output quality
What You Won't Be Doing
- Working in isolation on research problems - this is a product engineering role embedded in a delivery team
- Training large models from scratch - the platform uses hosted LLM APIs; your job is integration, evaluation, and orchestration, not pretraining
- Waiting to be told what to do - you'll have guidance and mentorship from the senior developer, but you're expected to take ownership of your tasks and ask questions when you're stuck
Requirements
Do you have experience in Python?, The platform runs on a practical AI stack: LangChain and LangGraph for agent orchestration, OpenAI and Anthropic for LLMs, Deepgram for speech-to-text, ElevenLabs for text-to-speech, and LiveKit for real-time voice infrastructure. You don't need to know all of these coming in, but you do need to be comfortable working with APIs, understanding model behaviour, and writing Python that runs in production - not just in notebooks., You've finished your degree or equivalent, and you've spent some time - whether through jobs, internships, or serious personal projects - working with ML or AI in a way that went beyond coursework.
- 1-2 years of experience working with ML/AI (including internships, placement years, or substantial personal/open-source projects)
- Solid Python skills - you can write functions, classes, and tests confidently, not just Jupyter notebooks
- Familiarity with at least some of: LLMs and prompt engineering, NLP, text classification, or information extraction - you don't need depth in all of them, but you need to have worked with at least one area hands-on
- Basic understanding of how ML models are evaluated - you know what precision, recall, and F1 mean and why they matter; you've compared model outputs against ground truth at least once
- Comfortable working with APIs and reading documentation - a significant part of this role involves integrating and configuring third-party AI services, not building models from scratch
- Familiar with Git and working in a team codebase - you've committed code that other people have reviewed, and you've reviewed other people's code
Benefits & conditions
We're building something global at Narwhal, and we mean that in every sense. The work we do requires different ways of thinking - and different ways of thinking come from different people.
At Narwhal, we're committed to building a diverse and inclusive team. We welcome applications from people of all backgrounds, identities, and experiences, and we actively work to ensure our hiring process is fair and accessible for everyone. Reasonable adjustments are available at every stage, just reach out and we'll make it happen.
Pay: £75,000.00-£100,000.00 per year