YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases

YingMusic-SVC is an open-source project dedicated to real-world, zero-shot robust singing voice conversion (SVC). Its goal is to accurately reproduce the timbre of a target singer while preserving melody and lyrics, overcoming challenges such as harmonic interference, F0 errors, and insufficient singing-specific inductive bias faced by existing systems in real songs.

YingMusic-SVC proposes an innovative three-stage training framework that combines continuous pre-training (CPT) based on a singing training module, robust supervised fine-tuning (SFT) through F0 perturbation and harmonic enhancement, and multi-reward reinforcement learning using Flow‑GRPO to optimize perceptual quality. The framework introduces singing-specific inductive biases, including an RVC-trained timbre converter for timbre‑content decoupling, an F0‑aware fine-grained timbre adapter to capture dynamic vocal expressions, and an energy‑balanced flow‑matching loss to enhance high‑frequency details.

YingMusic-SVC offers a visual web interface for task management and result filtering, supports concurrent monitoring of multiple tasks, and integrates GPT‑4o for in‑depth intelligent analysis of product images, descriptions, and seller profiles. It enables instant notification delivery and highly customizable task filtering, significantly improving user experience and practical deployment outcomes.

The project provides a difficulty‑graded multi‑track benchmark dataset, along with pre‑trained full SVC and accompaniment separation models. It supports accompaniment separation functionality and Gradio applications, demonstrating excellent performance under complex accompaniment and harmonic contamination conditions, thereby offering robust support for practical singing voice conversion deployment.

YingMusic-SVC Key Features

Three‑Stage Training Pipeline

CPT: Continuous Pre-Training with singing‑trained modules
SFT: Robust Supervised Fine-Tuning with F0 perturbation & harmony augmentation
RL (Flow‑GRPO): Multi-reward reinforcement learning for perceptual quality

Singing-Specific Inductive Biases

RVC-based Timbre Shifter (trained on 120 singers)
F0‑Aware Fine-Grained Timbre Adaptor
Energy-balanced Flow Matching Loss (enhanced high-frequency details)

Click to download 100+ multi-track studio songs.