Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs
Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as building blocks for research in programming languages and software engineering. However, the quality of code produced by a Code LLM varies significantly by programming language. Code LLMs produce impressive results on programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with \emph{low-resource languages} that have limited training data available. Low resource languages include OCaml, Racket, and several others.
This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach generates high-quality datasets for low-resource languages, which can then be used to fine-tune any pretrained Code LLM. Our approach, called MultiPL-T, translates training data from high-resource languages into training data for low-resource languages in the following way. 1)~We use a Code LLM to synthesize tests for commented code from a high-resource language, filtering out faulty tests and code with low test coverage. 2)~We use a Code LLM to translate Python code to a target low-resource language, and use tests to validate the translation. We apply this approach to generate tens of thousands of new, validated training items for Julia, Lua, OCaml, R, and Racket. Furthermore, we use an open model (StarCoderBase) with open training data (The Stack), which allows us to decontaminate benchmarks, train models without violating licenses, and run experiments that could not otherwise be done.
Using MultiPL-T, we present fine-tuned versions of StarCoderBase and Code Llama for Julia, Lua, OCaml, R, and Racket that outperform other fine-tunes of these base models. We also present Racket fine-tunes for two very recent models, DeepSeekCoder and StarCoder2, to show that MultiPL-T continues to outperform other fine-tuning approaches for low-resource languages. The MultiPL-T approach is easy to apply to new languages, and is significantly more efficient and effective than alternatives such as training longer.
Thu 24 OctDisplayed time zone: Pacific Time (US & Canada) change
13:40 - 15:20 | |||
13:40 20mTalk | CYCLE: Learning to Self-Refine the Code Generation OOPSLA 2024 Yangruibo Ding Columbia University, Marcus J. Min Columbia University, Gail Kaiser Columbia University, Baishakhi Ray Columbia University, New York; AWS AI Lab DOI | ||
14:00 20mTalk | Evaluating the effectiveness of Deep Learning Models for Foundational Program Analysis Tasks OOPSLA 2024 Qian Chen Nanjing University, Chenyang Yu Department of Computer Science and Technology, Nanjing University, Ruyan Liu Department of Computer Science and Technology, Nanjing University, Chi Zhang Nanjing University, Yu Wang Nanjing University, Ke Wang , Ting Su East China Normal University, Linzhang Wang Nanjing University DOI | ||
14:20 20mTalk | Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs OOPSLA 2024 Federico Cassano Northeastern University, John Gouwar Northeastern University, Francesca Lucchetti Northeastern University, Claire Schlesinger Northeastern University, Anders Freeman Wellesley College, Carolyn Jane Anderson Wellesley College, Molly Q Feldman Oberlin College, Michael Greenberg Stevens Institute of Technology, Abhinav Jangda Microsoft Research, Arjun Guha Northeastern University; Roblox DOI Pre-print | ||
14:40 20mTalk | Statically Contextualizing Large Language Models with Typed Holes OOPSLA 2024 Andrew Blinn University of Michigan, Xiang Li University of Michigan, Ann Arbor, June Hyung Kim University of Michigan, Cyrus Omar University of Michigan DOI | ||
15:00 20mTalk | WhiteFox: White-box Compiler Fuzzing Empowered by Large Language Models OOPSLA 2024 Chenyuan Yang University of Illinois at Urbana-Champaign, Yinlin Deng University of Illinois at Urbana-Champaign, Runyu Lu Huazhong University of Science and Technology, Jiayi Yao The Chinese University of Hong Kong, Shenzhen, Jiawei Liu University of Illinois at Urbana-Champaign, Reyhaneh Jabbarvand University of Illinois at Urbana-Champaign, Lingming Zhang University of Illinois at Urbana-Champaign DOI |