[short] LLM in a flash: Efficient Large Language Model Inference with Limited Memory

20/12/2023 2 min

Listen "[short] LLM in a flash: Efficient Large Language Model Inference with Limited Memory"

Episode Synopsis

This paper addresses the challenge of efficiently running large language models (LLMs) on devices with limited DRAM capacity by storing model parameters on flash memory and bringing them on demand to DRAM. The authors propose two techniques, "windowing" and "row-column bundling," which enable running models up to twice the size of available DRAM with significant increases in inference speed.

https://arxiv.org/abs//2312.11514

YouTube: https://www.youtube.com/@ArxivPapers

TikTok: https://www.tiktok.com/@arxiv_papers

Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016

Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers

More episodes of the podcast Arxiv Papers