Principal Machine Learning Engineer - Production Systems

SoftInWay Inc.
Weston-Super-Mare, United Kingdom
27 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Weston-Super-Mare, United Kingdom

Tech stack

.NET
API
Automation of Tests
C Sharp (Programming Language)
Profiling
Nvidia CUDA
Computer Programming
Continuous Integration
Data Validation
DevOps
Distributed Computing Environment
Memory Management
Github
Protocol Buffers
Design of User Interfaces
Interoperability
Python
Key Management
Machine Learning
Performance Tuning
TensorFlow
Prometheus
Software Engineering
Data Logging
PyTorch
Delivery Pipeline
Grafana
Keras
FastAPI
Containerization
Gitlab-ci
Kubernetes
ONNX (Open Neural Network Exchange) Format
Hashicorp
Machine Learning Operations
Software Version Control
Api Management
Docker
Vulnerability Analysis

Job description

We are seeking a highly experienced ML Systems Architect to design and implement a scalable, production-grade architecture for our machine learning solver. This role bridges research prototypes and commercial deployment, ensuring reliability, maintainability, and performance in a mixed technology stack., * Architect the ML Solver Platform:

  • Define modular architecture for data preprocessing, model execution, and post-processing.
  • Establish clear API contracts between Python/TensorFlow and C# services.
  • Productionize ML Workflows:
  • Convert research code into robust, testable, and observable services.
  • Implement CI/CD pipelines, automated testing, and reproducibility standards.
  • Integration & Interoperability:
  • Design REST/gRPC endpoints for cross-language communication.
  • Ensure compatibility with C#/.NET services.
  • Performance & Scalability:
  • Optimize GPU/CPU utilization, batching strategies, and memory management.
  • Plan for multi-model and multi-tenant scenarios.
  • MLOps & Lifecycle Management:
  • Implement model versioning, artifact registries, and deployment workflows.
  • Set up monitoring, logging, and alerting for solver performance.
  • Security & Compliance:
  • Apply best practices for secrets management, dependency scanning, and secure artifact storage., * ML: TensorFlow, ONNX Runtime, tf2onnx.
  • APIs: FastAPI, gRPC.
  • DevOps: GitLab CI/GitHub Actions, Docker, Kubernetes.
  • Monitoring: Prometheus, Grafana, OpenTelemetry.
  • Security: HashiCorp Vault, Sigstore.

Why Join Us?

  • Work on cutting-edge ML solutions integrated into commercial engineering software.
  • Define architecture that scales across global deployments.
  • Collaborate with a team of experts in ML, software engineering, and UI development.

Requirements

  • ML Frameworks: Expert in TensorFlow (TF2/Keras), experience with ONNX Runtime for inference.
  • Programming: Advanced Python for ML; strong understanding of packaging, type checking, and performance profiling.
  • Architecture: Proven experience designing scalable ML systems for production.
  • APIs: Proficiency in gRPC/Protobuf and REST for cross-language integration.
  • MLOps: CI/CD pipelines, containerization (Docker/Kubernetes), model registries, reproducibility.
  • Performance Optimization: GPU acceleration (CUDA/cuDNN), mixed precision, XLA, profiling.
  • Observability: Metrics, tracing, structured logging, dashboards.
  • Security: SBOM, image signing, role-based access, vulnerability scanning., * Experience with ONNX Runtime Training, PyTorch, or hybrid ML architectures.
  • Familiarity with distributed training strategies and multi-GPU setups.
  • Knowledge of feature stores and data validation frameworks.
  • Exposure to regulated environments and compliance frameworks.

Apply for this position