端到端的语音学习2D特征 - 特色手的特征 - 假肢

论文标题

端到端的语音学习2D特征 - 特色手的特征 - 假肢

End-to-End Learning of Speech 2D Feature-Trajectory for Prosthetic Hands

论文作者

Jafarzadeh, Mohsen, Tadesse, Yonas

论文摘要

言语是人类中最常见的交流形式之一。语音命令是假肢多模式控制的重要部分。在过去的几十年中，研究人员使用自动语音识别系统通过使用语音命令来控制假肢。自动语音识别系统学习如何将人的语音映射到文本。然后，他们使用自然语言处理或查找表将估计的文本映射到轨迹。但是，传统的言语控制的假肢的表现仍然不令人满意。通用图形处理单元（GPGPU）的最新进步使智能设备能够实时运行深层神经网络。因此，智能系统的体系结构已从复合子系统优化的范式迅速转变为端到端优化的范式。在本文中，我们提出了一个端到端的卷积神经网络（CNN），该网络（CNN）将语音2D直接绘制为假肢的轨迹。提出的卷积神经网络轻量级，因此它在嵌入式GPGPU中实时运行。所提出的方法可以使用任何类型的语音2D功能，这些功能在每个维度中具有局部相关性，例如频谱图，MFCC或PNCC。我们省略了在本文中控制假肢的文本步骤的文字步骤。该网络是用带有张量的后端的keras库编写的。我们为NVIDIA JETSON TX2开发人员套件优化了CNN。我们在此CNN上的实验表明，根平方误差为0.119和20ms运行时间，以产生与语音输入数据相对应的轨迹输出。为了实时实现较低的误差，我们可以为更强大的嵌入式GPGPU（例如NVIDIA AGX Xavier）优化类似的CNN。

Speech is one of the most common forms of communication in humans. Speech commands are essential parts of multimodal controlling of prosthetic hands. In the past decades, researchers used automatic speech recognition systems for controlling prosthetic hands by using speech commands. Automatic speech recognition systems learn how to map human speech to text. Then, they used natural language processing or a look-up table to map the estimated text to a trajectory. However, the performance of conventional speech-controlled prosthetic hands is still unsatisfactory. Recent advancements in general-purpose graphics processing units (GPGPUs) enable intelligent devices to run deep neural networks in real-time. Thus, architectures of intelligent systems have rapidly transformed from the paradigm of composite subsystems optimization to the paradigm of end-to-end optimization. In this paper, we propose an end-to-end convolutional neural network (CNN) that maps speech 2D features directly to trajectories for prosthetic hands. The proposed convolutional neural network is lightweight, and thus it runs in real-time in an embedded GPGPU. The proposed method can use any type of speech 2D feature that has local correlations in each dimension such as spectrogram, MFCC, or PNCC. We omit the speech to text step in controlling the prosthetic hand in this paper. The network is written in Python with Keras library that has a TensorFlow backend. We optimized the CNN for NVIDIA Jetson TX2 developer kit. Our experiment on this CNN demonstrates a root-mean-square error of 0.119 and 20ms running time to produce trajectory outputs corresponding to the voice input data. To achieve a lower error in real-time, we can optimize a similar CNN for a more powerful embedded GPGPU such as NVIDIA AGX Xavier.

下载PDF全文

下载文献需遵守相关版权规定

论文标题