论文标题
手术机器人技术中自动识别的多模式和自我监督的表示学习
Multimodal and self-supervised representation learning for automatic gesture recognition in surgical robotics
论文作者
论文摘要
自我监督,多模式学习在复杂场景的整体表示方面已经成功。这对于合并具有多种多功能用途的多种模式的信息可能很有用。它在手术机器人技术中的应用可能会导致同时发展对手术过程的通用机器理解,并降低对质量的依赖,专家注释通常难以获得。我们开发了一种自我监管的多模式表示学习范式,该学习范式从视频和运动学中学习了手术手势的表示。我们使用编码器网络配置,该配置编码来自外科视频的表示形式并解码它们以产生运动学。我们定量证明了我们学到的表示形式对手势识别的疗效(精度在69.6%和77.8%之间),跨多个任务的转移学习(精度在44.6%至64.8%)和外科医生的技能分类(准确性在76.8%和81.2%之间)。此外,我们定性地证明,我们的自我监管的表示集中在语义上有意义的属性(外科医生的技能和手势)中。
Self-supervised, multi-modal learning has been successful in holistic representation of complex scenarios. This can be useful to consolidate information from multiple modalities which have multiple, versatile uses. Its application in surgical robotics can lead to simultaneously developing a generalised machine understanding of the surgical process and reduce the dependency on quality, expert annotations which are generally difficult to obtain. We develop a self-supervised, multi-modal representation learning paradigm that learns representations for surgical gestures from video and kinematics. We use an encoder-decoder network configuration that encodes representations from surgical videos and decodes them to yield kinematics. We quantitatively demonstrate the efficacy of our learnt representations for gesture recognition (with accuracy between 69.6 % and 77.8 %), transfer learning across multiple tasks (with accuracy between 44.6 % and 64.8 %) and surgeon skill classification (with accuracy between 76.8 % and 81.2 %). Further, we qualitatively demonstrate that our self-supervised representations cluster in semantically meaningful properties (surgeon skill and gestures).