Listen "Factuality, Alignment, and Edge Efficiency"
Episode Synopsis
Generated with Google NotebookLM This week’s roundup distills 15 brand‑new arXiv papers that are bending the curve on large‑language‑model accuracy, efficiency, and safety:Truth under pressure – A RAG‑powered adversarial pipeline shreds GPT‑4o’s fact‑checker, proving that evaluators need retrieval too.API docs, minus the bloat – Smart chunking plus a “Discovery Agent” trims OpenAPI specs while raisingendpoint recall.Alignment, re‑weighted – FocalPO boosts Direct Preference Optimisation by doubling‑down on pairs the model already ranks right.Seeing, thinking, scheming – MultiMind merges facial cues, vocal tone, Theory‑of‑Mind, and MCTS to out‑bluff humans in Werewolf.Token thrift as design law – A manifesto argues that pruning isn’t just for speed; it cuts hallucinations and stabilises training.Cheaper RL finetunes – MoPPS predicts prompt difficulty on‑the‑fly and slashes rollout counts.Edge‑ready inference – DeltaLLM exploits temporal sparsity, while HCAttention squeezes KV cache to 25 %—letting Llama‑3‑8B read 4 M tokens on a single A100.LLMs that draw – A ReAct + RAG agent converts natural‑language briefs straight into AutoCAD code.Tool orchestration at scale – SciToolAgent uses a knowledge‑graph spine to automate hundreds of domain‑specific apps.Where models get lost – MazeEval exposes huge language‑bound gaps in spatial navigation.Red‑team reality check – 1.8 M attacks show nearly every frontier agent breaks policy within 100 prompts; robustness ≠ size.Proving corrigibility – Five lexicographic “core safety values” deliver the first provable obedience guarantees.Open‑source powerhouse – Kimi K2 (32 B MoE / 1 T tokens) tops agentic leaderboards with a new MuonClip optimiser.From adversarial fact‑checking to provably safe utility heads, these papers reveal the state of the art—and the cracks that still need sealing. Tune in for a 30‑minute tour of:efficiency tricks that make billion‑param models mobile‑friendly,alignment methods that actually move preferences,benchmarks that stress‑test reasoning across space, language, and social strategy, andframeworks that weld LLMs to real‑world tools without burning GPU budgets.If you build with, bet on, or just geek out over LLMs, this episode will arm you with the freshest insights—and plenty of rabbit holes for the weekend.Sources:https://arxiv.org/pdf/2410.14651https://arxiv.org/pdf/2411.19804https://arxiv.org/pdf/2501.06645https://arxiv.org/pdf/2504.18039https://arxiv.org/pdf/2505.18227https://arxiv.org/pdf/2507.04632https://arxiv.org/pdf/2507.19608https://arxiv.org/pdf/2507.19771https://arxiv.org/pdf/2507.19823https://arxiv.org/pdf/2507.20280https://arxiv.org/pdf/2507.20395https://arxiv.org/pdf/2507.20526https://arxiv.org/pdf/2507.20534https://arxiv.org/pdf/2507.20796https://arxiv.org/pdf/2507.20964
More episodes of the podcast Today in arXiv AI
Cognition, Contracts, and Compression
06/08/2025
Architectures, Attacks, and Autonomy
06/08/2025
Safety, Evaluation, and Reasoning
25/07/2025
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.