INSAIT Presents – GaussianVLM: 3D Scene Understanding from Video

INSAIT researchers in collaboration with international partners, have introduced GaussianVLM—the first Vision-Language Model capable of understanding fully immersive 3D scenes reconstructed from ordinary smartphone videos, without the need for specialized hardware.

GaussianVLM leverages Gaussian splats, a compact and photorealistic 3D representation, enabling the model to interpret complex spatial environments and answer open-ended, natural-language questions about them. This capability opens new possibilities in areas such as robotics, augmented reality, and human-computer interaction. For example, a robot equipped with GaussianVLM could navigate a room and respond to queries like “What’s on the table?” or “Are there enough seats for the guests?”

In addition to its novel architecture, GaussianVLM demonstrates a significant advance in efficiency: it reduces the data required to represent a scene from 40,000 tokens to just 132, supporting faster and more scalable processing.

The research has received broad attention, ranking among the Top 10 most-read papers on Scholar Inbox in the first week following its release.

Authors: Anna-Maria Halacheva, Dr. Jan-Nico Zaech, Dr. Xi Wang, Dr. Danda Pani Paudel, and Prof. Luc Van Gool.

More info: https://insait-institute.github.io/gaussianvlm.github.io/
Paper: https://arxiv.org/abs/2507.00886