GitHub Park

HY-World 1.5: A Systematic Framework for Interactive World Modeling with Real-Time Latency and Geometric Consistency

HY-World 1.5 (WorldPlay) is a systematic and interactive world modeling framework introduced by Tencent Hunyuan. It focuses on two core characteristics: real-time low latency and long-term geometric consistency. This framework addresses the limitations of the first-generation HY-World 1.0, which was offline and lacked real-time interaction. Built upon a flow-based video diffusion model, it can generate long-horizon streaming videos at 24 FPS. It supports first-person and third-person perspective switching for both realistic and stylized scenes, enabling various applications such as 3D reconstruction, promptable events, and infinite world expansion. The framework is now fully open-source, providing model versions with different parameter counts and complete training and inference workflows.

HY-World 1.5 achieves real-time interaction and ensures long-term geometric consistency, overcoming the speed-memory trade-off present in current methods:

  1. Dual Action Representation: Enables robust action control in response to user keyboard and mouse inputs, accurately following user interaction commands.
  2. Reconstructed Context Memory: Dynamically reconstructs context from past frames, ensuring the accessibility of geometrically critical distant frames through temporal reframing, effectively mitigating memory decay.
  3. WorldCompass Reinforcement Learning Post-Training Framework: Specifically designed for long-horizon autoregressive video models, it directly enhances the model's action-following capability and visual quality.
  4. Context-Enforced Distillation Method: Tailored for memory-aware models, it aligns the memory context between teacher and student models, preserving the student model's ability to utilize long-range information. This achieves real-time speed while preventing error drift.

Core Inference Process

After inputting a single image or a text prompt describing the world, the model performs a next-chunk (16 video frames) prediction task, generating future videos based on user actions. During the generation of each chunk, it dynamically reconstructs context memory from past chunks to ensure long-term temporal and geometric consistency.

System Requirements

HY-World 1.5 runs on NVIDIA GPU with CUDA environment. Different inference and training scenarios have different GPU VRAM requirements (using the Hunyuan Video1.5-based 125-frame generation as an example).

Inference VRAM Requirements (AR Distilled Model)

  • sp=8: 28G
  • sp=4: 34G
  • sp=1: 72G

Training VRAM Requirements

  • sp=8: 60G

Environment Setup and Dependency Installation

The framework is developed using Python 3.10. It requires creating an isolated environment via conda, followed by sequential installation of dependency libraries, dedicated tools, and model downloads. The steps are clear and reproducible.

1. Create and Activate Virtual Environment

conda create --name worldplay python=3.10 -y
conda activate worldplay
pip install -r requirements.txt

2. Install Attention Libraries (Optional, Recommended)

Improves inference speed and reduces GPU VRAM usage. Supports either Flash Attention or SageAttention.

Flash Attention

pip install flash-attn --no-build-isolation

SageAttention

git clone https://github.com/cooper1637/SageAttention.git
cd SageAttention
export EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" MAX_JOBS=32 # Optional configuration
python3 setup.py install

3. Install AngelSlim and DeepGEMM

Used for Transformer quantization and enabling fp8 gemm functionality, respectively.

AngelSlim

pip install angelslim==0.2.2

DeepGEMM

git clone --recursive [email protected]:deepseek-ai/DeepGEMM.git
cd DeepGEMM
./develop.sh
./install.sh

4. Download Required Models

An automated download script is provided to download all core models with one command. A Hugging Face token is required:

python download_models.py --hf_token <your_huggingface_token>

Key Notes

  • The visual encoder requires access to the restricted FLUX.1-Redux-dev model. You must first apply for access on Hugging Face (https://huggingface.co/black-forest-labs/FLUX.1-Redux-dev) (usually granted immediately), then generate a read-only token in your account settings.
  • If you do not have FLUX access, you can skip the visual encoder:
python download_models.py --skip_vision_encoder

Script Download Contents

  • HY-WorldPlay action models (each ~32GB)
  • Hunyuan Video-1.5 base models (vae, scheduler, 480p transformer)
  • Qwen2.5-VL-7B-Instruct text encoder (~15GB)
  • ByT5 encoder (byt5-small + Glyph-SDXL-v2)
  • SigLIP visual encoder (from FLUX.1-Redux-dev)
  • After download, the script prints the model paths, which need to be added to the run.sh file.

Model Checkpoints

Multiple pre-trained model checkpoints are provided, covering bidirectional, autoregressive, distilled, and other types. Some models are not yet released.

Model Name Description Download Status
HY-World1.5-Bidirectional-480P-I2V Bidirectional attention model with reconstructed context memory Available
HY-World1.5-Autoregressive-480P-I2V Autoregressive model with reconstructed context memory Available
HY-World1.5-Autoregressive-480P-I2V-rl Autoregressive model with RL post-training Pending
HY-World1.5-Autoregressive-480P-I2V-distill Distilled autoregressive model optimized for fast inference (4 steps) Available
HY-World1.5-Autoregressive-480P-I2V-rl-distill Distilled autoregressive model with RL post-training Pending

Quick Inference

Two inference pipelines are provided, catering to high-performance and lightweight needs respectively. Core configuration and execution are handled in the run.sh file.

Pipeline Selection

  1. Pipeline based on Hunyuan Video (Recommended): Uses HunyuanVideo-8B as the backbone, offering superior action control and long-term memory performance.
  2. WAN Pipeline (Lightweight): Uses WAN-5B as the backbone, suitable for GPUs with limited VRAM. Action control and long-term memory performance are slightly compromised. See wan/README.md for details.

The following are the core steps for inference based on Hunyuan Video:

1. Configure Model Paths

Update the model paths printed by the download script into the corresponding variables in run.sh:

MODEL_PATH=<path_printed_by_download_script>
AR_ACTION_MODEL_PATH=<path_printed_by_download_script>/ar_model
BI_ACTION_MODEL_PATH=<path_printed_by_download_script>/bidirectional_model
AR_DISTILL_ACTION_MODEL_PATH=<path_printed_by_download_script>/ar_distilled_action_model

2. Configure Core Parameters

Key parameters like scene description, input image, and number of frames to generate can be customized in run.sh:

Parameter Description
PROMPT Text description of the scene.
IMAGE_PATH Path to the input image (required for I2V tasks).
NUM_FRAMES Number of frames to generate (default 125). Must satisfy format requirements: Bi model: [(num_frames-1)//4+1]%16==0; AR model: [(num_frames-1)//4+1]%4==0.
N_INFERENCE_GPU Number of GPUs for parallel inference.
POSE Camera trajectory: Pose string (recommended for quick tests) or JSON file path.

3. Select Model

Uncomment the corresponding inference command in run.sh to select the desired model:

Bidirectional Model

--action_ckpt $BI_ACTION_MODEL_PATH --model_type 'bi'

Autoregressive Model

--action_ckpt $AR_ACTION_MODEL_PATH --model_type 'ar'

Distilled Model

--action_ckpt $AR_DISTILL_ACTION_MODEL_PATH --few_step true --num_inference_steps 4 --model_type 'ar'

4. Camera Trajectory Control

Supports two methods: Pose String and Custom JSON File, catering to quick testing and complex trajectory needs.

Method 1: Pose String (Recommended for Quick Tests)

Set the POSE variable in run.sh using the action-duration format, supporting multiple action combinations:

POSE='w-31' # Move forward 31 latent steps, generating [1+31] latent values.
# Combination example: forward 3, rotate right 1, move right 4.
POSE='w-3, right-1, d-4'

Supported Actions:

  • Movement: w (forward), s (backward), a (left), d (right)
  • Rotation: up (pitch up), down (pitch down), left (yaw left), right (yaw right)

Method 2: Custom JSON File

Suitable for complex trajectories. First, generate a custom trajectory file:

python generate_custom_trajectory.py

Then, set the JSON file path in run.sh:

POSE='./assets/pose/your_custom_trajectory.json'

5. Prompt Rewriting (Optional)

Enhance scene description by enabling prompt rewriting with a vLLM server:

export T2V_REWRITE_BASE_URL="<your_vllm_server_base_url>"
export T2V_REWRITE_MODEL_NAME="<your_model_name>"
REWRITE=true # Set in run.sh

6. Run Inference

After completing all configurations, execute the script to start inference:

bash run.sh
Visit Tencent-Hunyuan/HY-WorldPlay to access the source code and obtain more information.