Mantis is a versatile vision-language-action model that empowers robots to perform complex manipulation tasks through innovative disentangled visual foresight, progressive training, and adaptive temporal integration mechanisms. The key features of Mantis include: disentangled visual foresight efficiently captures latent action trajectories without overburdening the backbone network; a progressive training strategy introduces multimodal data in stages, effectively preserving and enhancing the language understanding and reasoning capabilities of the vision-language model backbone; and adaptive temporal integration dynamically adjusts the integration strength, ensuring control stability while reducing inference costs. Mantis demonstrates outstanding performance in robotic manipulation tasks, including strong generalization capabilities for both in-domain and out-of-domain instructions. It is meticulously fine-tuned based on multi-stage pretraining and multiple datasets.
| Model | Description |
|---|---|
| Mantis-Base | The base Mantis model trained through a three-stage pretraining process |
| Mantis-SSV2 | Mantis model pretrained in the first stage on the SSV2 dataset |
| Mantis-LIBERO | Mantis model fine-tuned on the LIBERO dataset |
| Dataset | Description |
|---|---|
| Something-Something-v2 | Human action video dataset used in the first-stage pretraining |
| DROID-Lerobot | Robotic dataset used in the second and third stages of pretraining |
| LLaVA-OneVision-1.5-Instruct-Data | Multimodal dataset used in the third-stage pretraining |
| LIBERO-Lerobot | LIBERO dataset used for fine-tuning |
Please first download the LIBERO datasets and the base Mantis model.