论文标题

简单有效的基于序列到序列模型的调整

Simple and Effective Gradient-Based Tuning of Sequence-to-Sequence Models

论文作者

Lichtarge, Jared, Alberti, Chris, Kumar, Shankar

论文摘要

培训越来越多的语言模型的最新趋势已大大提高了语言任务的机器学习绩效。但是,培训较大模型的巨大成本可以使他们过高地调整它们的昂贵,从而激发了对更有效方法的研究。基于梯度的高参数优化提供了在训练期间调整超参数的能力,但以前尚未以序列到序列设置进行研究。我们首次将基于简单的基于梯度的高参数优化方法应用于序列到序列任务,证明了效率和性能在强大的基准上的效率和性能提高,用于神经机器翻译和自然语言理解(NLU)任务(NLU)任务(通过T5预处理)。对于翻译,我们显示该方法跨语言对,比贝叶斯高参数优化更有效,并且某些超参数的学习时间表可以超过最佳的恒定值调整。对于T5,我们表明在预训练期间学习超参数可以提高下游NLU任务的性能。当同时学习多个超参数时,我们表明,全球学习率可以遵循培训的时间表,从而改善了性能,并且无法通过贪婪方法的“短匹配偏见”来解释\ citep \ citep {wu2018}。我们发布用于促进进一步研究的代码。

Recent trends towards training ever-larger language models have substantially improved machine learning performance across linguistic tasks. However, the huge cost of training larger models can make tuning them prohibitively expensive, motivating the study of more efficient methods. Gradient-based hyper-parameter optimization offers the capacity to tune hyper-parameters during training, yet has not previously been studied in a sequence-to-sequence setting. We apply a simple and general gradient-based hyperparameter optimization method to sequence-to-sequence tasks for the first time, demonstrating both efficiency and performance gains over strong baselines for both Neural Machine Translation and Natural Language Understanding (NLU) tasks (via T5 pretraining). For translation, we show the method generalizes across language pairs, is more efficient than Bayesian hyper-parameter optimization, and that learned schedules for some hyper-parameters can out-perform even optimal constant-valued tuning. For T5, we show that learning hyper-parameters during pretraining can improve performance across downstream NLU tasks. When learning multiple hyper-parameters concurrently, we show that the global learning rate can follow a schedule over training that improves performance and is not explainable by the `short-horizon bias' of greedy methods \citep{wu2018}. We release the code used to facilitate further research.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源