Thu 24 Oct 2024 14:20 - 14:40 at IBR East - Machine Learning and Programming Languages Chair(s): Loris D'Antoni

Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as building blocks for research in programming languages and software engineering. However, the quality of code produced by a Code LLM varies significantly by programming language. Code LLMs produce impressive results on programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with \emph{low-resource languages} that have limited training data available. Low resource languages include OCaml, Racket, and several others.

This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach generates high-quality datasets for low-resource languages, which can then be used to fine-tune any pretrained Code LLM. Our approach, called MultiPL-T, translates training data from high-resource languages into training data for low-resource languages in the following way. 1)~We use a Code LLM to synthesize tests for commented code from a high-resource language, filtering out faulty tests and code with low test coverage. 2)~We use a Code LLM to translate Python code to a target low-resource language, and use tests to validate the translation. We apply this approach to generate tens of thousands of new, validated training items for Julia, Lua, OCaml, R, and Racket. Furthermore, we use an open model (StarCoderBase) with open training data (The Stack), which allows us to decontaminate benchmarks, train models without violating licenses, and run experiments that could not otherwise be done.

Using MultiPL-T, we present fine-tuned versions of StarCoderBase and Code Llama for Julia, Lua, OCaml, R, and Racket that outperform other fine-tunes of these base models. We also present Racket fine-tunes for two very recent models, DeepSeekCoder and StarCoder2, to show that MultiPL-T continues to outperform other fine-tuning approaches for low-resource languages. The MultiPL-T approach is easy to apply to new languages, and is significantly more efficient and effective than alternatives such as training longer.

Thu 24 Oct

Displayed time zone: Pacific Time (US & Canada) change

13:40 - 15:20
Machine Learning and Programming LanguagesOOPSLA 2024 at IBR East
Chair(s): Loris D'Antoni UCSD
13:40
20m
Talk
CYCLE: Learning to Self-Refine the Code Generation
OOPSLA 2024
Yangruibo Ding Columbia University, Marcus J. Min Columbia University, Gail Kaiser Columbia University, Baishakhi Ray Columbia University, New York; AWS AI Lab
DOI
14:00
20m
Talk
Evaluating the effectiveness of Deep Learning Models for Foundational Program Analysis Tasks
OOPSLA 2024
Qian Chen Nanjing University, Chenyang Yu Department of Computer Science and Technology, Nanjing University, Ruyan Liu Department of Computer Science and Technology, Nanjing University, Chi Zhang Nanjing University, Yu Wang Nanjing University, Ke Wang , Ting Su East China Normal University, Linzhang Wang Nanjing University
DOI
14:20
20m
Talk
Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs
OOPSLA 2024
Federico Cassano Northeastern University, John Gouwar Northeastern University, Francesca Lucchetti Northeastern University, Claire Schlesinger Northeastern University, Anders Freeman Wellesley College, Carolyn Jane Anderson Wellesley College, Molly Q Feldman Oberlin College, Michael Greenberg Stevens Institute of Technology, Abhinav Jangda Microsoft Research, Arjun Guha Northeastern University; Roblox
DOI Pre-print
14:40
20m
Talk
Statically Contextualizing Large Language Models with Typed Holes
OOPSLA 2024
Andrew Blinn University of Michigan, Xiang Li University of Michigan, Ann Arbor, June Hyung Kim University of Michigan, Cyrus Omar University of Michigan
DOI
15:00
20m
Talk
WhiteFox: White-box Compiler Fuzzing Empowered by Large Language Models
OOPSLA 2024
Chenyuan Yang University of Illinois at Urbana-Champaign, Yinlin Deng University of Illinois at Urbana-Champaign, Runyu Lu Huazhong University of Science and Technology, Jiayi Yao The Chinese University of Hong Kong, Shenzhen, Jiawei Liu University of Illinois at Urbana-Champaign, Reyhaneh Jabbarvand University of Illinois at Urbana-Champaign, Lingming Zhang University of Illinois at Urbana-Champaign
DOI