论文标题

先前的艺术搜索和重新研究生成的专利文本

Prior Art Search and Reranking for Generated Patent Text

论文作者

Lee, Jieh-Sheng, Hsiang, Jieh

论文摘要

GPT-2之类的生成模型最近显示出令人印象深刻的结果。我们要解决的一个基本问题是:生成的文本从何而来?这项工作是我们通过使用先前的艺术搜索来回答问题的最初努力。先前的艺术搜索目的是在GPT-2的培训数据中找到最相似的先前文本。我们采用一种重新的方法,并将其应用于专利领域。具体而言,我们通过使用USPTO的专利数据将TRAIN GPT-2模型从头开始。先前的ART搜索的输入是由GPT-2模型生成的专利文本。我们还从头开始预先训练的BERT模型,以将专利文本转换为嵌入。重读的步骤是:(1)通过采用单袋排名方法(BM25),(2)以文本格式将搜索结果转换为BERT嵌入,以及(3)通过基于与GPT-2生成的较低文本的相似性来将搜索结果转换为文本格式,(3)通过对BERT嵌入的相似性来提供最终结果,以搜索GPT-2训练数据中最相似文本的最相似文本。这项工作的实验表明,这种重新疗法比仅在嵌入式上排名要好。但是,我们的混合结果还表明,计算长文本跨度之间的语义相似性仍然具有挑战性。据我们所知,这项工作是第一个实施重新依给系统的工作,以基于其输出为基于GPT模型的最相似输入来追溯识别最相似的输入。

Generative models, such as GPT-2, have demonstrated impressive results recently. A fundamental question we'd like to address is: where did the generated text come from? This work is our initial effort toward answering the question by using prior art search. The purpose of the prior art search is to find the most similar prior text in the training data of GPT-2. We take a reranking approach and apply it to the patent domain. Specifically, we pre-train GPT-2 models from scratch by using the patent data from the USPTO. The input for the prior art search is the patent text generated by the GPT-2 model. We also pre-trained BERT models from scratch for converting patent text to embeddings. The steps of reranking are: (1) search the most similar text in the training data of GPT-2 by taking a bag-of-word ranking approach (BM25), (2) convert the search results in text format to BERT embeddings, and (3) provide the final result by ranking the BERT embeddings based on their similarities with the patent text generated by GPT-2. The experiments in this work show that such reranking is better than ranking with embeddings alone. However, our mixed results also indicate that calculating the semantic similarities among long text spans is still challenging. To our knowledge, this work is the first to implement a reranking system to identify retrospectively the most similar inputs to a GPT model based on its output.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源