Aller au contenu principal

Clonage vocal RVC : Guide complet de la voix IA open-source

Clonage vocal RVC : guide complet de la voix IA open-source — installation, entraînement de modèles, inférence et meilleures pratiques pour des résultats réalistes.

Clonage vocal RVC : Guide complet de la voix IA open-source

Réponse rapide

Le clonage vocal RVC (Retrieval-based Voice Conversion) est un outil open-source qui entraîne des modèles de voix IA à partir de courts échantillons audio. Il peut convertir une voix en une autre tout en préservant l'intonation et l'émotion. Gratuit et fonctionnant localement sur votre GPU.

Qu'est-ce que RVC et pourquoi les producteurs l'utilisent

RVC — Retrieval-based Voice Conversion — est un framework IA open-source sorti en 2023[1] qui convertit une voix source en une voix cible tout en préservant la performance originale. Contrairement à la génération vocale text-to-speech, RVC ne crée pas de voix à partir de rien — il transforme une voix existante en une autre. Le timing, l'intonation, l'émotion et les micro-variations de votre performance originale passent toutes dans la sortie.

Pour les producteurs, cette distinction compte énormément. Si vous enregistrez une mélodie de référence vous-même et la passez à travers un modèle RVC d'une voix entraînée, l'audio résultant hérite de vos dynamiques de performance — votre timing, vos inflexions, votre émotion. C'est fondamentalement différent de taper « chantez cette mélodie » dans un outil text-to-speech.

La technologie sous-jacente à RVC est construite sur trois étapes : un encodeur de contenu HuBERT qui supprime l'identité du locuteur de l'audio et extrait les caractéristiques phonétiques, un index vectoriel FAISS qui stocke les caractéristiques vocales de la voix cible entraînée, et un réseau génératif qui reconstruit l'audio en appliquant les caractéristiques de la voix cible tout en préservant le contenu phonétique original.

Le clonage vocal se situe à une frontière juridique active. Le droit fédéral américain protège les enregistrements sonores fixés mais ne protège pas les qualités abstraites d'une voix — un tribunal ne peut pas empêcher quelqu'un de parler d'une certaine manière. Mais les lois étatiques sur le droit à l'image, les contrats et le droit d'auteur sur les enregistrements créent un paysage complexe.

Le ELVIS Act du Tennessee (Ensuring Likeness Voice and Image Security), promulgué le 21 mars 2024 et effectif le 1er juillet 2024, est la première loi étatique à protéger explicitement les individus contre l'utilisation non autorisée de leur voix par l'IA.[2]

Dans les litiges actifs, l'affaire Lehrman & Sage v. Lovo, Inc. a démontré qu'entraîner un modèle IA sur les enregistrements d'un comédien vocal sans autorisation peut soutenir des réclamations en vertu du droit à l'image et du droit d'auteur.[3]

  • Clone your own voice Entièrement sûr — vous possédez votre voix et pouvez vous accorder n'importe quelle utilisation. C'est le chemin le plus pratique pour les producteurs construisant un modèle vocal personnalisé.
  • Clone a consenting collaborator Légal lorsque vous avez un consentement écrit clair, documenté, qui spécifie comment le modèle sera utilisé, dans quels contextes, et pour combien de temps.[4]
  • Clone a public figure or recording artist Risque juridique élevé. Même si leurs enregistrements sont commercialement disponibles, les utiliser pour entraîner un modèle et distribuer des résultats soulève des questions de droit à l'image et de potentielles réclamations de droit d'auteur. Obtenez une licence ou ne le faites pas.
  • AI covers for public release Publier commercialement une reprise IA qui imite la voix d'un vrai artiste sans autorisation est le cas d'usage à plus haut risque et le sujet de litiges en cours et de retraits basés sur le DMCA.
  • Internal demos and private experimentation Lower risk when kept private, but right-of-publicity law in some states does not require commercial use for liability. When in doubt, use your own voice.

Outils RVC : lequel utiliser

The RVC ecosystem has several UIs and forks built on the same core algorithm. The table below covers the actively maintained options as of 2026 — do not use archived projects like So-VITS-SVC for new work, as it received no security updates after the original team archived it.

OutilIdéal pourReal-Time?PlateformeStatus
RVC WebUI (official)Training custom models, batch inferenceNonWindows / LinuxActive[8]
ApplioBeginner-friendly local or Colab workflowYes (Realtime tab)Win / Linux / MacStable, security patches only[9]
Ultimate RVCAdvanced: FCPE pitch, autotuning, TTSNonWin / UbuntuActive[10]
W-Okada Voice ChangerLive streaming, real-time performanceOuiWin / Mac / LinuxOpen source, active community
So-VITS-SVCLegacy singing conversionNonWin / LinuxArchived — do not use for new projects

Applio is the recommended starting point for most producers. It wraps RVC in a clean Gradio browser UI, includes a Voice Blender for fusing two models, a real-time conversion tab, TTS support, and integrates a library of over 20,000 pre-trained community voice models via its API.[11] Its current stable branch is v3.6.2.[9]

The official RVC WebUI from RVC-Project has over 35,000 GitHub stars and is the canonical reference implementation.[8] It supports NVIDIA CUDA, AMD GPUs via DirectML (Windows) or ROCm (Linux), and Intel ARC via IPEX.[2]

De quel matériel avez-vous réellement besoin

The RVC ecosystem is more accessible than most ML tools, but there are real hardware tiers that affect your workflow.

  • Inference only (using existing models) A modern CPU and any mid-range GPU will work. The official WebUI notes the architecture runs on even modest graphics cards for inference.[2] Applio confirms: "most modern computers will work just fine" for inference.[11]
  • Training a custom model locally Applio recommends an NVIDIA RTX 20-series GPU or newer for local training.[11] Batch size of 6–8 is appropriate for an 8 GB VRAM card.
  • Training without a GPU — Google Colab Applio and Ultimate RVC both provide ready-made Colab notebooks that run on Google's free cloud GPUs. This is the recommended path if you don't own a qualifying NVIDIA card. Colab free tier is sufficient for datasets under 30 minutes.[12]
  • Real-time conversion The official WebUI achieves approximately 170 ms latency under standard conditions, and around 90 ms with ASIO audio hardware.[2] Real-time use demands a capable GPU.

Entraîner un modèle vocal : workflow pas à pas

Whether you use Applio or the official WebUI, the training pipeline follows the same stages. All steps below are based on the Applio training documentation.[13]

  1. Gather and clean your audio dataset
    Record or source 10–30 minutes of clean mono audio at your target voice. Aim for zero background noise, zero reverb, and no music underneath. Lossless formats (WAV or FLAC) only.[13] The more acoustic variety in the delivery (different pitches, intensities, vowels), the more robust the model. Quality here directly determines output quality — this step cannot be compensated for later.
  2. Split and preprocess
    Use Applio's built-in Dataset Creator or a separate tool like UVR5 (bundled in the official WebUI[2]) to strip any music bed and isolate the voice. Slice the audio into segments, then run the Preprocess step in the UI — set your target sample rate (32k, 40k, or 48k).[13]
  3. Extract features
    Select your pitch extraction algorithm. RMVPE is the recommended choice — the official WebUI notes it provides better results and faster processing with lower resource use than older Crepe-based methods.[2] The feature extractor also builds the FAISS index from your dataset at this stage.
  4. Train the model
    Set epochs to 200–400 as a starting point.[13] Enable Save Every Epoch (every 10–50 epochs) so you can compare checkpoints and roll back if the model overtrains. Monitor loss curves in TensorBoard — stop when the validation loss plateaus, not when epochs run out. Overtraining is a common mistake: the model memorizes artifacts rather than generalizing the voice.
  5. Export and generate the FAISS index
    When training completes, export the model weights (.pth file) and generate the accompanying FAISS retrieval index file. Both files are required for high-quality inference — the index is what makes RVC sound like retrieval-based conversion rather than a raw statistical map.
  6. Run inference and evaluate
    Load the model in the Inference tab. Record a test vocal (your own voice, at a neutral pitch and tempo). Adjust the pitch shift slider to account for register difference between source and target voice. Try multiple pitch extraction algorithms on the output and compare. A well-trained model on clean data should produce intelligible, natural-sounding conversion — expect imperfections in sibilance and extreme high notes on first pass.

Cas d'usage pour producteurs : à quoi RVC sert réellement

RVC's strengths and weaknesses shape which production tasks it fits. Knowing both upfront saves frustration.

Your Own Voice Model

Training a model on your own voice is the most legally clean and practically useful application. Once trained, you can: record a rough melodic idea in a single take and convert it to a cleaner version of your voice; generate harmonies by converting the same take with a pitch shift; produce consistent backing vocals without re-recording multiple passes; and keep vocal sessions private and fully offline.

Backing Vocals and Harmonies

Feed a comped lead vocal into RVC using your own trained voice model, pitch-shift the input before conversion for harmonies, then export each harmony line. This workflow sidesteps the tonal inconsistencies of recording five separate takes in different registers. Works best when your source vocal is dry and close-mic'd — wet or reverb-heavy signals confuse the pitch extractor.

AI Covers and Demo Sketches (Private Use)

Producers sometimes use AI covers as reference sketches when pitching an arrangement to an artist — you demonstrate how a melody sits on the beat by converting it through an approximation of the target artist's vocal style. Keep these strictly internal, never upload to streaming or YouTube, and treat them as internal working files the same way you would handle an uncleared sample.

Quality and Realism Expectations

On a dataset of 20+ minutes of high-quality clean audio, RVC can produce conversion output that is convincing at a listening distance — meaning in a mix with other elements, the seams are not obvious. Up close or soloed, trained listeners will notice tonal artifacts, particularly in fast passages and extreme registers. RVC is not a replacement for a live vocal performance in a commercial release context; it is a fast prototyping and creative tool.

Obtenir la meilleure qualité de sortie

Technical decisions at each stage have a compounding effect on the final output. The following practices have the most impact:

  • Source audio quality is the ceiling RVC cannot create information that wasn't in the training data. Noisy, reverberant, or compressed training audio produces noisy, reverberant output. Record in a quiet treated space and use a clean preamp chain — the model inherits every artifact in the dataset.
  • Pitch extraction algorithm matters Use RMVPE for singing and melodic content. It handles vibrato and sustained notes more cleanly than older algorithms.[2] FCPE (available in Ultimate RVC) is worth testing on speech-heavy conversion.
  • Index ratio tuning The FAISS index ratio (often labeled Feature Retrieval Ratio in the UI) controls how strongly the model pulls from your training data versus the base model. Higher values increase target voice fidelity but can introduce dataset artifacts. Start at 0.5–0.75 and tune by ear.
  • Post-processing in your DAW RVC output almost always benefits from de-essing, high-pass filtering below 80 Hz, and gentle saturation to add presence. Treat it like any other vocal stem — it needs a chain. See how to mix vocals for a complete vocal chain walkthrough.
  • Applio's Voice Blender for character The Voice Blender in Applio lets you interpolate between two trained models, creating a hybrid voice. This is useful for creating a custom backing-vocal character that sits differently from your lead, even when both are based on your own voice recordings.

Carte de décision pour démarrer rapidement

Where to start depends on your hardware and your goal:

Your situationRecommended path
No qualifying GPU, want to try RVC nowRun Applio on Google Colab — free tier, no local setup[12]
NVIDIA RTX 20-series or newer, want full controlInstall Applio locally, train on your own voice data[13]
Want to try inference only with existing modelsUse any modern computer — Applio inference is not GPU-dependent[11]
Need real-time conversion in a live stream or DAWApplio Realtime tab or W-Okada Voice Changer with a dedicated GPU
Advanced user, want cutting-edge pitch extractionUltimate RVC with FCPE pitch extractor on Linux or Windows[10]

Explorez les outils de voix IA et les ressources de production vocale sur Plugg Supply.

Parcourir les téléchargements gratuits

Parcours d'apprentissage

Hubs de réponses associés

Related catalog

More software from the catalog

More software from the Plugg Supply feed, ranked by catalog popularity.

Browse Software

Questions fréquentes

Le clonage vocal avec RVC est-il légal ?
It depends entirely on whose voice you clone. Cloning your own voice is legal. Cloning another person's voice without their explicit written consent carries legal risk under right-of-publicity law in most U.S. states — and under Tennessee's ELVIS Act, even non-commercial unauthorized voice replication can trigger civil and criminal liability.<sup><a href="https://en.wikipedia.org/wiki/ELVIS_Act" target="_blank" rel="noopener">[4]</a></sup> Get written consent that specifies use case, territory, and duration before training on anyone else's voice.
Puis-je cloner ma propre voix avec RVC ?
Yes — and this is the recommended use case. Record 10–30 minutes of clean, dry audio in a quiet space<sup><a href="https://docs.applio.org/getting-started/training/" target="_blank" rel="noopener">[13]</a></sup>, train a model on Applio or the official RVC WebUI, and you have a reusable voice model you legally own. Producers use own-voice models for backing vocals, harmonies, and demo sketches.
Ai-je besoin d'un GPU pour utiliser RVC ?
For inference (using an existing trained model), a modern CPU is sufficient — most computers can run it. For training your own model, an NVIDIA RTX 20-series GPU or newer is recommended for local training.<sup><a href="https://docs.applio.org/" target="_blank" rel="noopener">[11]</a></sup> Without one, use Google Colab — both Applio and Ultimate RVC provide free cloud notebooks that run on Google's GPU infrastructure.
Combien d'audio ai-je besoin pour entraîner un modèle vocal RVC ?
The official RVC WebUI states that training is feasible with as little as 10 minutes of clean audio.<sup><a href="https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/main/docs/en/README.en.md" target="_blank" rel="noopener">[2]</a></sup> Applio's training guide recommends 10–30 minutes for a quality result.<sup><a href="https://docs.applio.org/getting-started/training/" target="_blank" rel="noopener">[13]</a></sup> Audio must be low-noise, dry (no reverb), and free of background music.
Quelle est la différence entre RVC WebUI et Applio ?
The official RVC WebUI from RVC-Project is the canonical implementation — it exposes the full technical parameter set and supports the widest range of GPU types.<sup><a href="https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI" target="_blank" rel="noopener">[8]</a></sup> Applio is a fork built on RVC technology that adds a cleaner UI, real-time conversion, Voice Blender, TTS support, and access to a large community model library.<sup><a href="https://docs.applio.org/" target="_blank" rel="noopener">[11]</a></sup> For most producers starting out, Applio is the better first choice.
Puis-je publier de la musique commercialement avec une voix générée par RVC ?
If the voice model is trained on your own voice, yes — you own the output and can release it commercially. If the model is trained on another person's voice, you need that person's documented consent covering commercial release, and you may still need to clear underlying rights. Releasing an AI cover that imitates a real recording artist's voice without authorization is the highest-risk scenario and is the subject of active litigation and platform takedowns.<sup><a href="https://btlj.org/2025/06/from-training-data-to-ai-covers-the-legal-challenges-of-voice-cloning/" target="_blank" rel="noopener">[3]</a></sup>
Comment RVC se compare-t-il à ElevenLabs ou d'autres services de clonage vocal cloud ?
RVC is a local, open-source, speech-to-speech converter — it needs an existing audio performance to convert, not text. ElevenLabs and similar services are primarily text-to-speech and handle the synthesis end-to-end in the cloud. RVC gives more control over the source performance and runs entirely offline with no subscription cost, but requires more technical setup and a GPU for training.