YuE: Scaling Open Foundation Models for Long-Form Music Generation
- Ruibin Yuan ,
- Hanfeng Lin ,
- Shuyue Guo ,
- Ge Zhang ,
- Jiahao Pan ,
- Yongyi Zang ,
- Haohe Liu ,
- Yiming Liang ,
- Wenye Ma ,
- Xingjian Du ,
- Xinrun Du ,
- Zhen Ye ,
- Tianyu Zheng ,
- Yi Ma ,
- Minghao Liu ,
- Zeyue Tian ,
- Ziya Zhou ,
- Liumeng Xue ,
- Xingwei Qu ,
- Yizhi Li ,
- Shangda Wu ,
- Tianhao Shen ,
- Ziyang Ma ,
- Junlin Zhan ,
- Chunhui Wang ,
- Yatian Wang ,
- Xiao-Qian Chi ,
- Xinyue Zhang ,
- Zhen Yang ,
- Xiangzhou Wang ,
- Shan-Ling Liu ,
- Ling Mei ,
- Pengfei Li ,
- Junjie Wang ,
- Jianwei Yu ,
- Guojian Pang ,
- Xu Li ,
- Zihao Wang ,
- Xiaohuan Zhou ,
- Lijun Yu ,
- Emmanouil Benetos ,
- Yong Chen ,
- Cheng-Ju Lin ,
- Xie Chen ,
- Gus G. Xia ,
- Zhaoxiang Zhang ,
- Chao Zhang ,
- Wenhu Chen ,
- Xinyu Zhou ,
- Xipeng Qiu ,
- R. Dannenberg ,
- Jia-Hua Liu ,
- Jian Yang ,
- Wenhao Huang ,
- Wei Xue ,
- Xu Tan ,
- Yi-Ting Guo
ICLR 2026 |
We tackle the task of long-form music generation–particularly the challenging \textbf{lyrics-to-song} problem–by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE’s learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation