HY-World 1.5 (WorldPlay) is a systematic and interactive world modeling framework introduced by Tencent Hunyuan. It focuses on two core characteristics: real-time low latency and long-term geometric consistency. This framework addresses the limitations of the first-generation HY-World 1.0, which was offline and lacked real-time interaction. Built upon a flow-based video diffusion model, it can generate long-horizon streaming videos at 24 FPS. It supports first-person and third-person perspective switching for both realistic and stylized scenes, enabling various applications such as 3D reconstruction, promptable events, and infinite world expansion. The framework is now fully open-source, providing model versions with different parameter counts and complete training and inference workflows.
HY-World 1.5 achieves real-time interaction and ensures long-term geometric consistency, overcoming the speed-memory trade-off present in current methods:
After inputting a single image or a text prompt describing the world, the model performs a next-chunk (16 video frames) prediction task, generating future videos based on user actions. During the generation of each chunk, it dynamically reconstructs context memory from past chunks to ensure long-term temporal and geometric consistency.
HY-World 1.5 runs on NVIDIA GPU with CUDA environment. Different inference and training scenarios have different GPU VRAM requirements (using the Hunyuan Video1.5-based 125-frame generation as an example).
The framework is developed using Python 3.10. It requires creating an isolated environment via conda, followed by sequential installation of dependency libraries, dedicated tools, and model downloads. The steps are clear and reproducible.
conda create --name worldplay python=3.10 -y
conda activate worldplay
pip install -r requirements.txt
Improves inference speed and reduces GPU VRAM usage. Supports either Flash Attention or SageAttention.
pip install flash-attn --no-build-isolation
git clone https://github.com/cooper1637/SageAttention.git
cd SageAttention
export EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" MAX_JOBS=32 # Optional configuration
python3 setup.py install
Used for Transformer quantization and enabling fp8 gemm functionality, respectively.
pip install angelslim==0.2.2
git clone --recursive [email protected]:deepseek-ai/DeepGEMM.git
cd DeepGEMM
./develop.sh
./install.sh
An automated download script is provided to download all core models with one command. A Hugging Face token is required:
python download_models.py --hf_token <your_huggingface_token>
FLUX.1-Redux-dev model. You must first apply for access on Hugging Face (https://huggingface.co/black-forest-labs/FLUX.1-Redux-dev) (usually granted immediately), then generate a read-only token in your account settings.python download_models.py --skip_vision_encoder
run.sh file.Multiple pre-trained model checkpoints are provided, covering bidirectional, autoregressive, distilled, and other types. Some models are not yet released.
| Model Name | Description | Download Status |
|---|---|---|
| HY-World1.5-Bidirectional-480P-I2V | Bidirectional attention model with reconstructed context memory | Available |
| HY-World1.5-Autoregressive-480P-I2V | Autoregressive model with reconstructed context memory | Available |
| HY-World1.5-Autoregressive-480P-I2V-rl | Autoregressive model with RL post-training | Pending |
| HY-World1.5-Autoregressive-480P-I2V-distill | Distilled autoregressive model optimized for fast inference (4 steps) | Available |
| HY-World1.5-Autoregressive-480P-I2V-rl-distill | Distilled autoregressive model with RL post-training | Pending |
Two inference pipelines are provided, catering to high-performance and lightweight needs respectively. Core configuration and execution are handled in the run.sh file.
wan/README.md for details.The following are the core steps for inference based on Hunyuan Video:
Update the model paths printed by the download script into the corresponding variables in run.sh:
MODEL_PATH=<path_printed_by_download_script>
AR_ACTION_MODEL_PATH=<path_printed_by_download_script>/ar_model
BI_ACTION_MODEL_PATH=<path_printed_by_download_script>/bidirectional_model
AR_DISTILL_ACTION_MODEL_PATH=<path_printed_by_download_script>/ar_distilled_action_model
Key parameters like scene description, input image, and number of frames to generate can be customized in run.sh:
| Parameter | Description |
|---|---|
PROMPT |
Text description of the scene. |
IMAGE_PATH |
Path to the input image (required for I2V tasks). |
NUM_FRAMES |
Number of frames to generate (default 125). Must satisfy format requirements: Bi model: [(num_frames-1)//4+1]%16==0; AR model: [(num_frames-1)//4+1]%4==0. |
N_INFERENCE_GPU |
Number of GPUs for parallel inference. |
POSE |
Camera trajectory: Pose string (recommended for quick tests) or JSON file path. |
Uncomment the corresponding inference command in run.sh to select the desired model:
--action_ckpt $BI_ACTION_MODEL_PATH --model_type 'bi'
--action_ckpt $AR_ACTION_MODEL_PATH --model_type 'ar'
--action_ckpt $AR_DISTILL_ACTION_MODEL_PATH --few_step true --num_inference_steps 4 --model_type 'ar'
Supports two methods: Pose String and Custom JSON File, catering to quick testing and complex trajectory needs.
Set the POSE variable in run.sh using the action-duration format, supporting multiple action combinations:
POSE='w-31' # Move forward 31 latent steps, generating [1+31] latent values.
# Combination example: forward 3, rotate right 1, move right 4.
POSE='w-3, right-1, d-4'
Supported Actions:
Suitable for complex trajectories. First, generate a custom trajectory file:
python generate_custom_trajectory.py
Then, set the JSON file path in run.sh:
POSE='./assets/pose/your_custom_trajectory.json'
Enhance scene description by enabling prompt rewriting with a vLLM server:
export T2V_REWRITE_BASE_URL="<your_vllm_server_base_url>"
export T2V_REWRITE_MODEL_NAME="<your_model_name>"
REWRITE=true # Set in run.sh
After completing all configurations, execute the script to start inference:
bash run.sh