论文标题
CIWGAN和FIWGAN:在声学数据中编码信息,以模拟使用生成的对抗网络的词汇学习
CiwGAN and fiwGAN: Encoding information in acoustic data to model lexical learning with Generative Adversarial Networks
论文作者
论文摘要
深度神经网络如何编码与人类语音中的单词相对应的原始声学数据的信息? This paper proposes two neural network architectures for modeling unsupervised lexical learning from raw acoustic inputs, ciwGAN (Categorical InfoWaveGAN) and fiwGAN (Featural InfoWaveGAN), that combine a Deep Convolutional GAN architecture for audio data (WaveGAN; arXiv:1705.07904) with an information theoretic extension of GAN -- InfoGAN (ARXIV:1606.03657),并提出了一种新的潜在空间结构,可以同时建模具有更高级别的分类的功能学习,并允许对词汇项目的非常低的尺寸矢量表示。词汇学习是从构造迫使深层神经网络进入输出数据的建筑中建模的,从而可以从其声学输出中检索到独特的信息。对词汇项目进行培训的网络学会学习以潜在空间中的分类变量形式编码与词汇项目相对应的独特信息。通过操纵这些变量,网络输出特定的词汇项目。该网络偶尔会输出违反培训数据的创新词汇项目,但对于认知建模和神经网络的解释性,在语言上可以解释且信息丰富。创新的输出表明,网络学到的语音和语音表示可以有效地重新组合并直接与人类语音中的生产力直接平行:在“ suit”和“ Dark”输出创新的“开始”和“ Dark” Outputs the Intovative'start'中,即使它从未看到“启动”甚至是[ST]序列。我们还认为,将潜在功能代码设置为远远超出训练范围的价值,从而导致了几乎分类的典型词汇项目,并揭示了每个潜在代码的基本值。
How can deep neural networks encode information that corresponds to words in human speech into raw acoustic data? This paper proposes two neural network architectures for modeling unsupervised lexical learning from raw acoustic inputs, ciwGAN (Categorical InfoWaveGAN) and fiwGAN (Featural InfoWaveGAN), that combine a Deep Convolutional GAN architecture for audio data (WaveGAN; arXiv:1705.07904) with an information theoretic extension of GAN -- InfoGAN (arXiv:1606.03657), and propose a new latent space structure that can model featural learning simultaneously with a higher level classification and allows for a very low-dimension vector representation of lexical items. Lexical learning is modeled as emergent from an architecture that forces a deep neural network to output data such that unique information is retrievable from its acoustic outputs. The networks trained on lexical items from TIMIT learn to encode unique information corresponding to lexical items in the form of categorical variables in their latent space. By manipulating these variables, the network outputs specific lexical items. The network occasionally outputs innovative lexical items that violate training data, but are linguistically interpretable and highly informative for cognitive modeling and neural network interpretability. Innovative outputs suggest that phonetic and phonological representations learned by the network can be productively recombined and directly paralleled to productivity in human speech: a fiwGAN network trained on `suit' and `dark' outputs innovative `start', even though it never saw `start' or even a [st] sequence in the training data. We also argue that setting latent featural codes to values well beyond training range results in almost categorical generation of prototypical lexical items and reveals underlying values of each latent code.