Principal Machine Learning Engineer - Production Systems

SoftInWay Inc.

Weston-Super-Mare, United Kingdom

2 months ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Job location

Weston-Super-Mare, United Kingdom

Tech stack

.NET

API

Automation of Tests

C Sharp (Programming Language)

Profiling

Nvidia CUDA

Computer Programming

Continuous Integration

Data Validation

DevOps

Distributed Computing Environment

Memory Management

Github

Protocol Buffers

Design of User Interfaces

Interoperability

Python

Key Management

Machine Learning

Performance Tuning

TensorFlow

Prometheus

Software Engineering

Data Logging

PyTorch

Delivery Pipeline

Grafana

Keras

FastAPI

Containerization

Gitlab-ci

Kubernetes

ONNX (Open Neural Network Exchange) Format

Hashicorp

Machine Learning Operations

Software Version Control

Api Management

Docker

Vulnerability Analysis

Job description

We are seeking a highly experienced ML Systems Architect to design and implement a scalable, production-grade architecture for our machine learning solver. This role bridges research prototypes and commercial deployment, ensuring reliability, maintainability, and performance in a mixed technology stack., * Architect the ML Solver Platform:

Define modular architecture for data preprocessing, model execution, and post-processing.
Establish clear API contracts between Python/TensorFlow and C# services.

Productionize ML Workflows:

Convert research code into robust, testable, and observable services.
Implement CI/CD pipelines, automated testing, and reproducibility standards.

Integration & Interoperability:

Design REST/gRPC endpoints for cross-language communication.
Ensure compatibility with C#/.NET services.

Performance & Scalability:

Optimize GPU/CPU utilization, batching strategies, and memory management.
Plan for multi-model and multi-tenant scenarios.

MLOps & Lifecycle Management:

Implement model versioning, artifact registries, and deployment workflows.
Set up monitoring, logging, and alerting for solver performance.

Security & Compliance:

Apply best practices for secrets management, dependency scanning, and secure artifact storage., * ML: TensorFlow, ONNX Runtime, tf2onnx.

APIs: FastAPI, gRPC.
DevOps: GitLab CI/GitHub Actions, Docker, Kubernetes.
Monitoring: Prometheus, Grafana, OpenTelemetry.
Security: HashiCorp Vault, Sigstore.

Why Join Us?

Work on cutting-edge ML solutions integrated into commercial engineering software.
Define architecture that scales across global deployments.
Collaborate with a team of experts in ML, software engineering, and UI development.

Requirements

ML Frameworks: Expert in TensorFlow (TF2/Keras), experience with ONNX Runtime for inference.
Programming: Advanced Python for ML; strong understanding of packaging, type checking, and performance profiling.
Architecture: Proven experience designing scalable ML systems for production.
APIs: Proficiency in gRPC/Protobuf and REST for cross-language integration.
MLOps: CI/CD pipelines, containerization (Docker/Kubernetes), model registries, reproducibility.
Performance Optimization: GPU acceleration (CUDA/cuDNN), mixed precision, XLA, profiling.
Observability: Metrics, tracing, structured logging, dashboards.
Security: SBOM, image signing, role-based access, vulnerability scanning., * Experience with ONNX Runtime Training, PyTorch, or hybrid ML architectures.
Familiarity with distributed training strategies and multi-GPU setups.
Knowledge of feature stores and data validation frameworks.
Exposure to regulated environments and compliance frameworks.

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all