论文标题
估计人类声音的i-vector表示的唯一性
Estimating Uniqueness of I-Vector Representation of Human Voice
论文作者
论文摘要
我们研究了人类声音的个性,相对于语音话语的广泛使用的特征表示,即I-Vector模型。作为朝着这一目标的第一步,我们比较了针对不同生物识别方式提出的对比的独特度量。然后,我们介绍了一种新的独特度量,该度量在考虑说话者级别的变化的同时,评估了I-向量的熵。我们的度量在离散的特征空间中运行,并依赖于对i-向量分布的准确估计。因此,在确保量化和原始表示的同时量化I-向量的同时会产生类似的说话者验证性能。独特性估计值是从两个新生成的数据集和公共Voxceleb数据集获得的。第一个自定义数据集包含从TEDX Talks Videos获得的20,741位演讲者的一百五十万个语音样本。第二个包括从电影对话中提取的1,595个演员中的二十千多种演讲样本。使用这些数据,我们分析了几个因素如何,例如说话者的数量,每个说话者的样本数量,样本持续时间以及话语多样性会影响独特性估计。最值得注意的是,我们确定I-向量的离散化不会导致说话者识别性能的降低。我们的结果表明,考虑到5秒长的语音样本,基于I-Vector的表示所提供的独特性可能达到43-70位;但是,在语音差异较少的情况下,发现唯一性估计值可减少30位。我们还发现,将样品持续时间加倍会使I-vector表示的独特性增加约20位。
We study the individuality of the human voice with respect to a widely used feature representation of speech utterances, namely, the i-vector model. As a first step toward this goal, we compare and contrast uniqueness measures proposed for different biometric modalities. Then, we introduce a new uniqueness measure that evaluates the entropy of i-vectors while taking into account speaker level variations. Our measure operates in the discrete feature space and relies on accurate estimation of the distribution of i-vectors. Therefore, i-vectors are quantized while ensuring that both the quantized and original representations yield similar speaker verification performance. Uniqueness estimates are obtained from two newly generated datasets and the public VoxCeleb dataset. The first custom dataset contains more than one and a half million speech samples of 20,741 speakers obtained from TEDx Talks videos. The second one includes over twenty one thousand speech samples from 1,595 actors that are extracted from movie dialogues. Using this data, we analyzed how several factors, such as the number of speakers, number of samples per speaker, sample durations, and diversity of utterances affect uniqueness estimates. Most notably, we determine that the discretization of i-vectors does not cause a reduction in speaker recognition performance. Our results show that the degree of distinctiveness offered by i-vector-based representation may reach 43-70 bits considering 5-second long speech samples; however, under less constrained variations in speech, uniqueness estimates are found to reduce by around 30 bits. We also find that doubling the sample duration increases the distinctiveness of the i-vector representation by around 20 bits.