GitHub Park

Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Mantis is a versatile vision-language-action model that empowers robots to perform complex manipulation tasks through innovative disentangled visual foresight, progressive training, and adaptive temporal integration mechanisms. The key features of Mantis include: disentangled visual foresight efficiently captures latent action trajectories without overburdening the backbone network; a progressive training strategy introduces multimodal data in stages, effectively preserving and enhancing the language understanding and reasoning capabilities of the vision-language model backbone; and adaptive temporal integration dynamically adjusts the integration strength, ensuring control stability while reducing inference costs. Mantis demonstrates outstanding performance in robotic manipulation tasks, including strong generalization capabilities for both in-domain and out-of-domain instructions. It is meticulously fine-tuned based on multi-stage pretraining and multiple datasets.

  • Disentangled Visual Foresight: Automatically captures latent actions depicting visual trajectories without adding extra burden to the backbone network.
  • Progressive Training: Introduces modalities in stages to preserve the language understanding and reasoning capabilities of the Vision-Language Model (VLM) backbone.
  • Adaptive Temporal Integration: Dynamically adjusts the temporal integration strength to maintain stable control while reducing inference costs.

Models and Datasets

Model Versions

Model Description
Mantis-Base The base Mantis model trained through a three-stage pretraining process
Mantis-SSV2 Mantis model pretrained in the first stage on the SSV2 dataset
Mantis-LIBERO Mantis model fine-tuned on the LIBERO dataset

Datasets Used

Dataset Description
Something-Something-v2 Human action video dataset used in the first-stage pretraining
DROID-Lerobot Robotic dataset used in the second and third stages of pretraining
LLaVA-OneVision-1.5-Instruct-Data Multimodal dataset used in the third-stage pretraining
LIBERO-Lerobot LIBERO dataset used for fine-tuning

Please first download the LIBERO datasets and the base Mantis model.

Visit zhijie-group/Mantis to access the source code and obtain more information.