论文标题
使用CTC损失和扩展前缀梁搜索单阶段的手势检测
Single-stage intake gesture detection using CTC loss and extended prefix beam search
论文作者
论文摘要
准确检测单个摄入手势是迈向自动饮食监测的关键步骤。手腕运动的惯性传感器数据和描述上半身的视频数据均已用于此目的。迄今为止,最先进的方法使用了两阶段的方法,其中(i)使用深神经网络从传感器数据中学到了(i)帧级进气概率,然后(ii)通过找到框架级别概率的最大值来检测到稀疏的进气门事件。在这项研究中,我们提出了一种单阶段方法,该方法将从传感器数据中学到的概率直接解码为稀疏的摄入检测。这是通过使用Connectionist暂时分类(CTC)损失的弱监督训练来实现的,并使用新型的扩展前缀梁搜索解码算法进行解码。这种方法的好处包括(i)检测的端到端培训,(ii)与现有方法相比,(iii)对摄入手势标签的简化计时要求以及(iii)的检测性能提高。在两个单独的数据集中,对于视频和惯性传感器,我们在两阶段的进气探测和饮食检测任务上,相对$ f_1 $得分提高了1.9%和6.2%。
Accurate detection of individual intake gestures is a key step towards automatic dietary monitoring. Both inertial sensor data of wrist movements and video data depicting the upper body have been used for this purpose. The most advanced approaches to date use a two-stage approach, in which (i) frame-level intake probabilities are learned from the sensor data using a deep neural network, and then (ii) sparse intake events are detected by finding the maxima of the frame-level probabilities. In this study, we propose a single-stage approach which directly decodes the probabilities learned from sensor data into sparse intake detections. This is achieved by weakly supervised training using Connectionist Temporal Classification (CTC) loss, and decoding using a novel extended prefix beam search decoding algorithm. Benefits of this approach include (i) end-to-end training for detections, (ii) simplified timing requirements for intake gesture labels, and (iii) improved detection performance compared to existing approaches. Across two separate datasets, we achieve relative $F_1$ score improvements between 1.9% and 6.2% over the two-stage approach for intake detection and eating/drinking detection tasks, for both video and inertial sensors.