SWE-Bench: Evaluating Language Models on Real-World GitHub Issues

21/12/2024 22 min
SWE-Bench: Evaluating Language Models on Real-World GitHub Issues

Listen "SWE-Bench: Evaluating Language Models on Real-World GitHub Issues"

Episode Synopsis

This research paper introduces SWE-Bench, a new way to test how good large language models are at solving real problems with computer code. It uses real problems and code from GitHub, a website where programmers share and work on code together. These problems are more complex than what language models are usually tested on, requiring them to understand lots of code and make changes across multiple files. Researchers created SWE-Bench Lite, a smaller version of SWE-Bench, and SWE-Llama, a special language model trained to fix code. The study found that even the best language models could only solve the easiest problems, showing that there's still a long way to go before they can be really helpful to programmers. The paper also suggests using tools that measure how complex code is to better understand how language models are learning.https://arxiv.org/pdf/2310.06770

More episodes of the podcast AI Papers Podcast Daily