NoWag: Unified Compression for Large Language Models

26/04/2025 17 min

Listen "NoWag: Unified Compression for Large Language Models"

Episode Synopsis

We discuss NoWag, a novel framework for compressing large language models (LLMs) while preserving their structure. This unified approach, encompassing both pruning (removing less important connections) and vector quantization (grouping and reducing the precision of weights), uses a normalization technique guided by weight and activation data. Experiments on Llama models demonstrate that NoWag significantly outperforms existing state-of-the-art zero-shot quantization methods with less data and achieves competitive results in pruning, suggesting a shared underlying principle for effective LLM compression.

More episodes of the podcast Best AI papers explained