Data scientist
Role details
Job location
Tech stack
Job description
Vision-Language-Action (VLA) models and diffusion policies are currently defining the state-of-the-art in robot learning, yet they typically rely on 2D inputs that limit spatial reasoning. This research explores the potential of integrating advanced 3D foundation models, such as VGGT, directly into VLA-based diffusion architectures. By leveraging pre-trained 3D knowledge-capable of understanding non-overlapping views and complex geometry-the project aims to surpass the limitations of standard RGB approaches. The core challenge involves effectively fusing these rich spatial modalities to achieve faster learning and superior generalization on a physical robot arm.
Requirements
Have an MSc degree in Artificial Intelligece, Computer Science, or Robotics - Have strong programming skills - Have experience with deep learning - Have affinity with natural language processing or 3D computer vision - Have profound interest in human-technology interaction - Embrace open science principles and modern communication tools