Machine Learning Engineer
Role details
Job location
Tech stack
Job description
Practices: Lead the team in adopting professional engineering standards. This includes owning the strategy for unit/integration testing, peer code reviews, and applying SOLID principles to ML codebases to ensure they remain modular and maintainable. - ML Observability: Establish and own the telemetry framework for the AI stack. Implement proactive monitoring for system health and model-specific metrics, such as data drift, concept drift, and prediction accuracy. - FinOps & Cost Management: Own the strategy for AI cloud spend. Build monitoring and alerting frameworks to track compute costs (training and inference) and implement optimization strategies like auto-scaling and spot-instance usage. - AI Systems Engineering: Act as a lead software engineer to integrate models into the product ecosystem. Develop high-performance, secure APIs and microservices that wrap our ML capabilities for production consumption. - Data & Model Governance: Own the versioning strategy for the "Holy
Requirements
Trinity" of ML: code, data, and model artifacts. Ensure clear documentation and audit trails for all production deployments. What we're looking for: Essential skills (entry requirements) - Demonstrating strong software engineering fundamentals, including production-quality Python, testing, CI/CD practices, and version control. - Designing and operating reliable, versioned REST APIs using an API-first approach. - Building, deploying, and operating backend services in cloud environments, with AWS as the primary platform (experience on other major clouds considered transferable). - Using containerisation and modern deployment approaches, including Docker, automated pipelines, and basic observability. - Working effectively with real-world data and production systems in collaboration with product, data, and platform teams. - Bringing either hands-on experience delivering machine-learning systems in production or a very strong software-engineering background with clear motivation to grow into ML and MLOps. Desirable skills (strong differentiators) - Using AWS SageMaker for training, deploying, and operating machine-learning workloads, or demonstrating equivalent experience on similar cloud ML platforms. - Exposing machine-learning models via APIs (e.g. FastAPI-based inference services) and operating them reliably at scale. - Applying MLOps practices, including model and version management, monitoring, and handling model or data drift. - Implementing advanced service patterns such as asynchronous processing, event-driven architectures, or multi-version services. - Serving LLM or GenAI-based capabilities in production, including model serving, RAG pipelines, and inference controls. - Designing reusable, platform-level services and shared ML patterns rather than one-off implementations. - Managing cloud operational trade-offs, including cost efficiency, latency, scalability, and reliability. Health and Safety Responsibilities - Fostering the safety culture leading by example. - Following established safety procedures and reporting potential hazards promptly to maintain a secure and efficient workplace. - Participating in safety training sessions and adhering to preventive guidelines and procedures, with the objective of minimizing risks and protecting yourself and your colleagues. Benefits - Medical and dental insurance: Fully funded medical and dental insurance. - Flexible benefits: Exchange part of your salary and make tax savings on meal and transport vouchers, childcare, and training. - Well-being: Free access to the Calm app (for up to 5 users), 24/7