论文标题
EDGE上的WAV2VEC2.0:性能评估
Wav2Vec2.0 on the Edge: Performance Evaluation
论文作者
论文摘要
WAV2VEC2.0是一种最先进的模型,通过未标记的语音数据(又称自我监督学习)来学习语音表示。然后对预处理的模型进行微调,以少量的标记数据将其用于语音到文本和机器翻译任务。 WAV2VEC 2.0是低资源语言的变革解决方案,因为它主要是使用未标记的音频数据开发的。获得大量标记的数据是资源密集的,对于低资源语言(例如Swahilli,Tatar等)。鉴于它在对世界的7000种语言中启用语音任务方面的重要性和巨大潜力,因此,它是评估该模型在低资源和低功率边缘设备上的准确性,潜伏期和效率的关键,并研究在此类设备中使用IT用于私人,安全和可靠的基于语音的任务的可行性。在设备上的语音任务排除将音频数据发送到服务器,因此固有地提供了隐私,降低了延迟和增强的可靠性。在本文中,WAV2VEC2.0模型的准确性和延迟已在Raspberry Pi上评估,以及用于语音识别任务的KENLM语言模型。如何调整某些参数,以达到CPU,产品的记忆和能量预算,以达到所需的水平和延迟水平。
Wav2Vec2.0 is a state-of-the-art model which learns speech representations through unlabeled speech data, aka, self supervised learning. The pretrained model is then fine tuned on small amounts of labeled data to use it for speech-to-text and machine translation tasks. Wav2Vec 2.0 is a transformative solution for low resource languages as it is mainly developed using unlabeled audio data. Getting large amounts of labeled data is resource intensive and especially challenging to do for low resource languages such as Swahilli, Tatar, etc. Furthermore, Wav2Vec2.0 word-error-rate(WER) matches or surpasses the very recent supervised learning algorithms while using 100x less labeled data. Given its importance and enormous potential in enabling speech based tasks on world's 7000 languages, it is key to evaluate the accuracy, latency and efficiency of this model on low resource and low power edge devices and investigate the feasibility of using it in such devices for private, secure and reliable speech based tasks. On-device speech tasks preclude sending audio data to the server hence inherently providing privacy, reduced latency and enhanced reliability. In this paper, Wav2Vec2.0 model's accuracy and latency has been evaluated on Raspberry Pi along with the KenLM language model for speech recognition tasks. How to tune certain parameters to achieve desired level of WER rate and latency while meeting the CPU, memory and energy budgets of the product has been discussed.