Principal Machine Learning Engineer
Role details
Job location
Tech stack
Job description
We are seeking a Principal Machine Learning Engineer (SageMaker, MLOps, Model Governance & Explainability) to provide technical leadership across the full lifecycle of machine learning systems powering a new matching platform. This role is accountable for defining ML architecture, establishing engineering standards, driving MLOps maturity, and ensuring that our models are scalable, secure, explainable, and governed to enterprise-grade standards. You will contribute to the strategic direction of our ML platform-spanning data pipelines, model development, deployment automation, inference runtime design, telemetry, drift detection, and cross-account productionisation. You will mentor engineers, influence product and architectural decisions, and ensure that our ML systems operate reliably at scale, underpinned by a robust governance and compliance framework. This is a highly hands-on, highly technical, principal-level role that combines architectural vision with deep practical expertise in ML engineering and AWS-native MLOps., Technical Leadership & Architecture
- Define the end-to-end ML architecture for the matching platform, including data pipelines, model training workflows, inference runtimes, and telemetry ecosystems.
- Lead adoption of best-in-class MLOps patterns, platform tooling, and AWS SageMaker capabilities across training, processing, registry, monitoring, and deployment.
- Partner with platform, security, and data engineering teams to implement scalable data lakehouse oriented feature architectures and enterprise-grade ML governance.
- Champion engineering standards for model quality, documentation, observability, and platform resilience.
Feature Engineering & Data Architecture
- Architect highly scalable, production-ready feature pipelines within Lakehouse environments.
- Set the technical direction for fallback and resilience strategies (e.g., fallback pipelines).
- Establish and enforce data-quality guardrails, validation schemas, and monitoring frameworks.
- Drive adoption and standards for enterprise feature stores.
Model Development & Technical Excellence
- Lead the design of ranking, scoring, and similarity models tailored to the matching platform requirements.
- Define model calibration, scoring logic, confidence thresholds, and optimisation strategies.
- Mentor teams on advanced ML techniques using Model frameworks such as PyTorch, TensorFlow, and XGBoost.
- Review and approve technical designs for complex modeling workflows.
Explainability & Regulatory-Grade Reasoning
- Establish explainability standards across the ML stack, using SHAP or equivalent frameworks.
- Define patterns to generate regulator-ready reason codes, aligned with compliance requirements.
- Ensure explainability artefacts are accurate, robust, and traceable across model versions.
ML Deployment & Automation (MLOps)
- Architect automated training, deployment, and retraining pipelines using AWS SageMaker.
- Set standards for model registry usage, automated approvals, and rollback orchestration.
- Drive infrastructure-as-code and CI/CD maturity for ML systems across multiple environments.
- Lead design of enterprise-wide weight-update patterns and lineage-aware deployment strategies.
Inference Runtime & Cross-Account Productionisation
- Architect low-latency, high-throughput inference services that meet strict matching platform SLAs.
- Lead the design of secure cross-account IAM patterns for model consumption.
- Own end-to-end telemetry design, including scoring metrics, latency, error analytics, and SLOs.
- Partner with platform teams to optimise cost, scale, and reliability of inference endpoints.
Monitoring, Drift Detection & Observability
- Define observability standards for feature drift, concept drift, performance degradation, and data integrity.
- Lead the creation of dashboards, benchmarks, and automated alerting across the ML ecosystem.
- Ensure telemetry pipelines adhere to privacy, data minimisation, and compliance policies.
- Drive adoption of proactive failover, shadow-mode testing, and continuous validation patterns.
Security, Compliance & ML Governance
- Set and enforce ML-specific security standards including data minimisation, encryption, and PII handling.
- Oversee creation of Model Cards, lineage artefacts, and compliance documentation.
- Ensure ML systems meet governance standards for auditability, reproducibility, versioning, and traceability.
- Collaborate with InfoSec and Risk teams to define ML governance frameworks and secure cross-environment workflows.
Testing, Validation & Performance Engineering
- Lead validation strategies using golden datasets, behavioural tests, and benchmark suites.
- Architect performance testing for latency-sensitive inference paths and model hot paths.
- Establish standards for A/B testing, shadow deployments, canary rollouts, and controlled experiments.
Requirements
Do you have a Master's degree?, * Proven track record architecting and delivering production ML systems at scale in enterprise environments.
- Deep expertise with AWS SageMaker (training, processing, pipelines, endpoints, registry) and complementary AWS services.
- Expert-level Python and ML Model frameworks (e.g. PyTorch, TensorFlow, XGBoost).
- Strong thought leadership in MLOps automation, CI/CD for ML, and model lifecycle management.
- Advanced experience designing explainability systems, reason codes, and governance artefacts.
- Expertise in low-latency inference architectures and real-time model serving.
- Strong grounding in drift detection, telemetry pipelines, observability patterns, and model QA.
- Experience shaping ML security practices, including cross-account IAM, data minimisation, and PII-safe design.
- Ability to influence architecture, mentor senior engineers, and set long-term technical direction.
Nice to Have
- Experience building or leading feature store adoption.
- Background in ranking, search relevance, entity matching, or similarity modelling.
- Experience designing or governing multi-account AWS ML platforms.
- Knowledge of distributed training, GPU/accelerator optimisation, and scaling strategies.
- Bachelors in a STEM subject, e.g. mathematics, physics, engineering, computer science, or adjacent degrees.
- Masters or PhD or equivalent experience in STEM desirable but not essential