嫩叶草一区二免费在线下载_九九九伊在人综合免费看_已满十八从此进入甸伊园

报告13:Understanding and Improving LLM Training: Insights into Adam and Advent of Adam-mini
2024/10/21 来源: 编辑:


报告人:孙若愚 (香港中文大学)


报告题目:Understanding and Improving LLM Training: Insights into Adam and Advent of Adam-mini


摘要:Adam is the default algorithm for training large foundation models. In this talk, we aim to understand why Adam is better than SGD on training large foundation models, and propose a memory-efficient alternative called Adam-mini. First, we show that the original version of Adam does converge. We will also explain that the earlier work on the non-convergence of Adam has used an unusual notion of convergence. Second, we provide an explanation of the failure of SGD on transformer:(i) Transformers are “heterogeneous”: the Hessian spectrum across parameter blocks vary dramatically;(ii) Heterogeneity hampers SGD: SGD performs badly on problems with block heterogeneity. Third, motivated by this finding, we introduce Adam-mini, which partitions the parameters according to the Hessian structure and assigns a single second momentum term to all weights in a block. We empirically show that Adam-mini saves 45-50% memory over Adam without compromising performance, on various models including 8B-size language models and ViT.

仪征市| 江安县| 保定市| 蚌埠市| 车致| 博野县| 新野县| 平谷区| 夏邑县| 全南县|