Computer Vision & Robotics

Robotic Foundation Models

Embodied AI enables our AI algorithms to interact with their users and to operate in a challenging environment that tests their real-world performance.

INSAIT is building robotic foundation models with a strong focus on their visual understanding. We investigate challenges such as out-of-domain generalization of foundation models, deep integration of 3D representations, and learning from simulation. Our recent works include ReVLA which sets the state-of-the-art for generalization performance in open robotic foundation models.

Faculty & Mentors involved in this research area:

3D Vision

Reconstruction, understanding, and generation of 3D shapes is a core field at INSAIT. We focus on Gaussian splatting and NERF reconstruction enabling applications in embodied AI, scene understanding, and city-scale vision.

INSAIT has published a large dataset of more than 65’000 high-quality Gaussian splats to enable foundation model training with 3D data. Our applied work investigates interactable 3D representations, LLM-based 3D generation, and policy learning from 3D.

Faculty & Mentors involved in this research area:

Transfer Learning

Generalising deep learning models to downstream tasks is essential for advancing AI applications.

INSAIT is addressing challenges like limited labeled data, domain shifts, and novel vocabularies. Our recent works include CD-ViTO which proposes a new benchmark for cross-domain few-shot object detection (CD-FSOD) and builds a new SOTA method via enhancing open-set detectors. We are also making efforts on open-vocabulary object detection for earth images i.e., locate anything on earth (LAE).

Faculty & Mentors involved in this research area:

Multimodal Learning

INSAIT is actively exploring multimodal learning methods, focusing on multi-sensor fusion, multi-task learning, and multimodal applications. 

We have proposed an all-in-one unified RGB-X Tracker for Video Object Tracking, a robust multi-sensor fusion method for Panoptic Segmentation, a Text-to-Image Generation model, and also a Multimodal Museum Dataset containing 200K+ image-table pairs that aim to promote the applications of museum exhibits. We are also exploring multimodal methods for ego-centric videos.

Faculty & Mentors involved in this research area: