通过私人联盟学习免费培训令牌

论文标题

通过私人联盟学习免费培训令牌

Training a Tokenizer for Free with Private Federated Learning

论文作者

Bagdasaryan, Eugene, Song, Congzheng, van Dalen, Rogier, Seigel, Matt, Cahill, Áine

论文摘要

具有不同隐私的联合学习，即私人联合学习（PFL），可以在不损害隐私的情况下对跨用户设备分发的私人数据进行培训。 PFL对于具有固定数量参数的神经网络等模型有效，因此有效。这样的模型包括神经网络语言模型，但不是Tokenizer，这是这项工作的主题。训练令牌仪需要无限词汇中的单词频率，以及寻找无限词汇的现有方法，需要单独的隐私预算。解决方法是在公开数据上训练令牌。但是，在本文中，我们首先表明，与不匹配的数据进行训练的令牌仪相比，与访问用户数据的隐私竞争“ Oracle”令牌相比，模型性能较差，而访问用户数据的数据增加了20％。我们还表明，子词的标记器比单词级别更适合联合上下文，因为它们可以编码新单词，尽管每个单词都有更多的令牌。其次，我们提出了一种新的方法来获得令牌，而无需使用任何额外的隐私预算。在对语言模型的私人联合学习期间，我们从模型中进行采样，对采样序列进行新的令牌训练，并更新模型嵌入。然后，我们继续私人联盟学习，并在“ Oracle”令牌的1％之内获得绩效。由于此过程仅在私人数据上间接训练令牌，因此我们可以使用差异隐私的“后处理保证”，因此不使用额外的隐私预算。

Federated learning with differential privacy, i.e. private federated learning (PFL), makes it possible to train models on private data distributed across users' devices without harming privacy. PFL is efficient for models, such as neural networks, that have a fixed number of parameters, and thus a fixed-dimensional gradient vector. Such models include neural-net language models, but not tokenizers, the topic of this work. Training a tokenizer requires frequencies of words from an unlimited vocabulary, and existing methods for finding an unlimited vocabulary need a separate privacy budget. A workaround is to train the tokenizer on publicly available data. However, in this paper we first show that a tokenizer trained on mismatched data results in worse model performance compared to a privacy-violating "oracle" tokenizer that accesses user data, with perplexity increasing by 20%. We also show that sub-word tokenizers are better suited to the federated context than word-level ones, since they can encode new words, though with more tokens per word. Second, we propose a novel method to obtain a tokenizer without using any additional privacy budget. During private federated learning of the language model, we sample from the model, train a new tokenizer on the sampled sequences, and update the model embeddings. We then continue private federated learning, and obtain performance within 1% of the "oracle" tokenizer. Since this process trains the tokenizer only indirectly on private data, we can use the "postprocessing guarantee" of differential privacy and thus use no additional privacy budget.

下载PDF全文

下载文献需遵守相关版权规定

论文标题