Principal Machine Learning Engineer - Production Systems

Production Systemsoverviewsoftinway Uk Ltd.
Bradley Stoke, United Kingdom
1 month ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Bradley Stoke, United Kingdom

Tech stack

.NET
API
Automation of Tests
C Sharp (Programming Language)
Nvidia CUDA
Data Validation
Protocol Buffers
Interoperability
Python
Machine Learning
TensorFlow
Management of Software Versions
Data Logging
PyTorch
Delivery Pipeline
Keras
Containerization
Kubernetes
Docker

Job description

Job DescriptionPrincipal Machine Learning Engineer - Production SystemsOverviewSoftInWay UK Ltd. Is seeking a highly experienced ML Systems Architect to design and implement a scalable, production-grade architecture for our machine learning solver. This role bridges research prototypes and commercial deployment, ensuring reliability, maintainability, and performance in a mixed technology stack. ResponsibilitiesArchitect the ML Solver Platform:Define modular architecture for data preprocessing, model execution, and post-processing.Establish clear API contracts between Python/TensorFlow and C# services.Productionize ML Workflows:Convert research code into robust, testable, and observable services.Implement CI/CD pipelines, automated testing, and reproducibility standards.Integration & Interoperability:Design REST/gRPC endpoints for cross-language communication.Ensure compatibility with C#/.NET services.Performance & Scalability:Optimize GPU/CPU utilization, batching strategies, and memory management.Plan for multi-model and multi-tenant scenarios.MLOps & Lifecycle Management:Implement model versioning, artifact registries, and deployment workflows.Set up monitoring, logging, and alerting for solver performance.Security & Compliance:Apply best practices for secrets management, dependency scanning, and secure artifact storage.

Requirements

Required Skills & ExperienceML Frameworks: Expert in TensorFlow (TF2/Keras), experience with ONNX Runtime for inference.Programming: Advanced Python for ML; strong understanding of packaging, type checking, and performance profiling.Architecture: Proven experience designing scalable ML systems for production.APIs: Proficiency in gRPC/Protobuf and REST for cross-language integration.MLOps: CI/CD pipelines, containerization (Docker/Kubernetes), model registries, reproducibility.Performance Optimization: GPU acceleration (CUDA/cuDNN), mixed precision, XLA, profiling.Observability: Metrics, tracing, structured logging, dashboards.Security: SBOM, image signing, role-based access, vulnerability scanning.Preferred QualificationsExperience with ONNX Runtime Training, PyTorch, or hybrid ML architectures.Familiarity with distributed training strategies and multi-GPU setups.Knowledge of feature stores and data validation frameworks.Exposure to regulated environments and compliance frameworks.

Benefits & conditions

Tools & TechnologiesML: TensorFlow, ONNX Runtime, tf2onnx.APIs: FastAPI, gRPC.DevOps: GitLab CI/GitHub Actions, Docker, Kubernetes.Monitoring: Prometheus, Grafana, OpenTelemetry.Security: HashiCorp Vault, Sigstore. Why Join Us?Work on cutting-edge ML solutions integrated into commercial engineering software.Define architecture that scales across global deployments.Collaborate with a team of experts in ML, software engineering, and UI development.Competitive salary and benefits. To apply: Send your resume and a brief cover letter to #####

Apply for this position