语音无序的非语言声音检测

论文标题

语音无序的非语言声音检测

Nonverbal Sound Detection for Disordered Speech

论文作者

Lea, Colin, Huang, Zifang, Jain, Dhruv, Tooley, Lauren, Liaghat, Zeinab, Thelapurath, Shrinath, Findlater, Leah, Bigham, Jeffrey P.

论文摘要

语音助手已成为各种残疾人的重要工具，因为它们可以实现复杂的基于电话或平板电脑的互动，而无需进行细粒度的电机控制，例如使用触摸屏。但是，这些系统并未针对言语障碍的个体的独特特征（包括许多患有运动言论障碍的患者）的独特特征，聋哑或听力很难，患有严重的口吃或最少的口头口头。我们介绍了一种基于语音的替代输入系统，该系统依赖于使用15个非语言嘴的声音事件检测，例如“ pop”，“ click”或“ eh”。该系统旨在工作，无论其语音能力如何，并允许完全访问现有技术。在本文中，我们描述了数据集的设计，现实部署的模型注意事项以及为模型个性化的努力。我们完全监督的模型在710名成年人的内部数据集上实现了细分级的精度和88.6％和88.4％的召回，同时在侵略者（例如语音）上每小时达到0.31个误报。五弹性个性化可以使84.5％的通用模型失败的情况令人满意。

Voice assistants have become an essential tool for people with various disabilities because they enable complex phone- or tablet-based interactions without the need for fine-grained motor control, such as with touchscreens. However, these systems are not tuned for the unique characteristics of individuals with speech disorders, including many of those who have a motor-speech disorder, are deaf or hard of hearing, have a severe stutter, or are minimally verbal. We introduce an alternative voice-based input system which relies on sound event detection using fifteen nonverbal mouth sounds like "pop," "click," or "eh." This system was designed to work regardless of ones' speech abilities and allows full access to existing technology. In this paper, we describe the design of a dataset, model considerations for real-world deployment, and efforts towards model personalization. Our fully-supervised model achieves segment-level precision and recall of 88.6% and 88.4% on an internal dataset of 710 adults, while achieving 0.31 false positives per hour on aggressors such as speech. Five-shot personalization enables satisfactory performance in 84.5% of cases where the generic model fails.

下载PDF全文

下载文献需遵守相关版权规定

论文标题