论文标题
St-Moe:设计稳定且可转移的稀疏专家模型
ST-MoE: Designing Stable and Transferable Sparse Expert Models
论文作者
论文摘要
规模开放了自然语言处理的新边界 - 但成本很高。作为响应,已经提出了Experts(MOE)和开关变压器的混合物,作为通往更大且功能更强大的语言模型的节能路径。但是,在微调过程中,训练不稳定性和不确定的质量阻碍了在广泛的自然语言任务中推进最先进的。我们的工作重点是这些问题,并作为设计指南。我们通过将稀疏模型缩放到269b参数来结束,计算成本与32B密度的编码器码头变压器(稳定且可转移的Experts或ST-MOE-32B)相当。稀疏模型首次在转移学习方面取得了最先进的表现,包括推理(Superglue,ARC Easy,ARC挑战),摘要(XSUM,CNN-DM),封闭图书问答(WebQQA,自然问题)和对手构造的任务(Winogrande(Winogrande,Anli r3)。
Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).
