vLLM Kunlun is a community-maintained hardware plugin specifically designed for Kunlun XPUs, enabling the vLLM framework to run seamlessly and efficiently on Kunlun XPU hardware. By providing a pluggable hardware interface, vLLM Kunlun achieves a decoupled integration between Kunlun XPUs and vLLM, allowing various mainstream open-source large models—including Transformer-based, Mixture of Experts (MoE), Embedding, and multimodal LLMs—to operate on this architecture.
vLLM Kunlun supports generative models like Qwen, Llama, and GLM, as well as multimodal models such as Qianfan-VL and InternVL. It offers key features like quantization, LoRA, and segmented Kunlun graphs, delivering exceptional high-performance computing capabilities on Kunlun 3 P800 hardware.
| Model | Support Status | Quantization | LoRA | Segmented Kunlun Graph | Notes |
|---|---|---|---|---|---|
| Qwen2/2.5 | ✅ | - | ✅ | ✅ | - |
| Qwen3 | ✅ | - | ✅ | ✅ | - |
| Qwen3-Moe/Coder | ✅ | ✅ | ✅ | ✅ | - |
| QwQ-32B | ✅ | - | - | ✅ | - |
| Llama2/3/3.1 | ✅ | - | - | ✅ | - |
| GLM-4.5/Air | ✅ | ✅ | ✅ | ✅ | - |
| Qwen3-next | ⚠️ | - | - | - | Coming soon |
| GPT OSS | ⚠️ | - | - | - | Coming soon |
| DeepSeek-v3/3.2 | ⚠️ | - | - | - | Coming soon |
| Model | Support Status | Quantization | LoRA | Segmented Kunlun Graph | Notes |
|---|---|---|---|---|---|
| Qianfan-VL | ✅ | - | - | ✅ | - |
| Qwen2.5-VL | ✅ | - | - | ✅ | - |
| InternVL2.5/3/3.5 | ✅ | - | - | ✅ | - |
| InternS1 | ✅ | - | - | ✅ | - |
| Qwen2.5-Omni | ⚠️ | - | - | - | Coming soon |
| Qwen3-VL | ⚠️ | - | - | - | Coming soon |
| GLM-4.5V | ✅ | - | - | ✅ | - |
On the Kunlun Chip 3 P800, various models demonstrate efficient computational capabilities. The testing environment was configured with 16 concurrent requests and an input/output size of 2048. The throughput for each model is as follows: