Founding Machine learning Engineer - Evaluation

Talisman Brands, Inc.

Los Altos, United States of America

11 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Los Altos, United States of America

Tech stack

Artificial Intelligence

Data analysis

Computer Vision

Clinical Data Repository

Python

Machine Learning

Machine Learning Operations

Job description

My client is building evaluation and evidence infrastructure for safety-critical AI systems, starting with diagnostic medical imaging.

AI systems are increasingly used in settings where their outputs affect clinical decisions and patient outcomes. In medical imaging, benchmark accuracy alone is not enough. Hospitals, regulators, and clinical stakeholders need evidence that models will behave reliably across real-world deployment environments, populations, scanners, and workflows.

This role sits at the intersection of:

medical imaging AI,
model robustness and evaluation,
regulatory evidence generation,
and real-world deployment behavior.

The work is highly investigative and requires strong technical judgment, scientific reasoning, and the ability to operate effectively in ambiguous environments.

The Role

This is not a traditional "train models on benchmark datasets" ML role.

You will work directly with medical imaging companies and healthcare stakeholders to investigate how AI systems behave in practice and what evidence is required for deployment, regulatory, and clinical decisions.

You will:

Design and execute evaluations for medical imaging AI systems
Investigate model failure modes, robustness, and generalization gaps
Analyze behavior across populations, scanners, imaging protocols, and clinical settings
Determine what evidence is sufficient for stakeholders making deployment or regulatory decisions
Translate technical findings into actionable recommendations for customers and clinical stakeholders
Build reusable evaluation pipelines, evidence schemas, and model assessment frameworks
Work with messy, incomplete, and noisy real-world clinical data
Help shape how evaluation investigations are conducted across the organization

The important work is not simply running experiments. It is identifying what questions actually matter, what evidence is missing, and how to generate defensible conclusions under real-world constraints.

Requirements

Strong experience in machine learning for medical imaging (radiology, pathology, cardiology imaging, or related domains)
Experience evaluating or validating real-world ML systems, not just training models
Deep understanding of:
model robustness,
distribution shift,
uncertainty,
failure analysis,
and real-world deployment behavior
Strong Python skills across the full investigation workflow:
data analysis,
experimentation,
evaluation,
and reporting
Experience working with noisy or imperfect clinical datasets
Ability to communicate technical findings clearly to both technical and non-technical stakeholders
High tolerance for ambiguity and open-ended investigative work

Strongly Preferred:

Experience with FDA-regulated AI/ML systems or medical device submissions (510(k), De Novo, SaMD, etc.)
Experience with medical imaging deployment evaluation or clinical validation
Experience with interpretability, post-deployment monitoring, uncertainty estimation, or model auditing
Experience designing reproducible evaluation frameworks or benchmarking systems
Background in healthcare AI or other safety-critical ML domains
Customer-facing or cross-functional technical leadership experience
PhD or equivalent research depth in ML, medical imaging, computer vision, or related areas

Benefits & conditions

Candidates who tend to succeed in this role often come from backgrounds such as:

Medical imaging ML research
FDA or healthcare AI evaluation
Clinical AI validation
AI robustness and reliability research
Applied ML investigation in safety-critical environments
Healthcare-focused computer vision research

What Success Looks Like:

The strongest people in this role become experts in how medical AI systems behave in the real world.

They develop the judgment to answer questions such as:

Where are the model's true weaknesses?
Which deployment conditions introduce risk?
What concerns are real versus theoretical?
What evidence is sufficient for a hospital or regulator to trust the system?
What additional validation is required before deployment proceeds?

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

Apply for this position

Good distractions

Moments

Videos View all