论文标题
学习有效的唇部阅读模型无痛苦
Learn an Effective Lip Reading Model without Pains
论文作者
论文摘要
唇读,也称为视觉语音识别,旨在通过分析唇部动力学来识别视频中的语音内容。近年来,取得了一些吸引人的进步,从快速发展的深度学习技术和最近的大规模唇读数据集中受益匪浅。大多数现有方法通过构建复杂的神经网络以及几种定制的培训策略而获得了高性能,这些策略始终在非常简短的描述中给出,甚至仅在源代码中显示。我们发现,正确使用这些策略总是可以带来令人兴奋的改进,而无需更改大部分模型。考虑到这些策略的不可忽略的影响以及训练有效的唇部阅读模型的现有艰难地位,我们首次进行了全面的定量研究和比较分析,以表明几种不同选择的唇部阅读效果。通过仅向基线管道介绍一些易于确定的改进,我们可以分别在两个最大的公共可用唇读数据集(LRW和LRW-1000)上,表现从83.7%到88.4%,从38.2%到55.7%。它们是可比的,甚至超过了现有的最新结果。
Lip reading, also known as visual speech recognition, aims to recognize the speech content from videos by analyzing the lip dynamics. There have been several appealing progress in recent years, benefiting much from the rapidly developed deep learning techniques and the recent large-scale lip-reading datasets. Most existing methods obtained high performance by constructing a complex neural network, together with several customized training strategies which were always given in a very brief description or even shown only in the source code. We find that making proper use of these strategies could always bring exciting improvements without changing much of the model. Considering the non-negligible effects of these strategies and the existing tough status to train an effective lip reading model, we perform a comprehensive quantitative study and comparative analysis, for the first time, to show the effects of several different choices for lip reading. By only introducing some easy-to-get refinements to the baseline pipeline, we obtain an obvious improvement of the performance from 83.7% to 88.4% and from 38.2% to 55.7% on two largest public available lip reading datasets, LRW and LRW-1000, respectively. They are comparable and even surpass the existing state-of-the-art results.