Mini-o3: Scaling Reasoning for Visual Search

10/09/2025 12 min

Listen "Mini-o3: Scaling Reasoning for Visual Search"

Episode Synopsis

This September 2025 paper introduces Mini-o3, a Vision-Language Model (VLM) designed to overcome the limitations of existing VLMs in handling complex visual search tasks that require multi-turn reasoning and trial-and-error exploration. The researchers developed a three-component training recipe, including the creation of the Visual Probe Dataset with challenging, high-resolution images, a pipeline for synthesizing diverse multi-turn trajectories for supervised finetuning, and an over-turn masking technique in reinforcement learning. This masking prevents penalization of long, incomplete reasoning paths, encouraging deeper exploration without increasing training time. Mini-o3 demonstrates state-of-the-art performance on various visual search benchmarks, showcasing its enhanced ability for complex, adaptive visual understanding through iterative observation, thought, and action.Source:https://arxiv.org/pdf/2509.07969

More episodes of the podcast AI: post transformers