Voice Conversion 调研 #2789

yt605155624 · 2022-12-30T10:56:04Z

yt605155624
Dec 30, 2022
Collaborator

1 基本概念

1.1 根据是否对语音特征进行解耦（如 timbre/content/pitch/rhythm 等的解耦）

Feature Disentangle

text based VC ，如基于 PPG 的 VC 等（PPG 来自于 ASR 模型，需要文本）
text free VC，如 information bottleneck , vector quantization, instance normalization, SSL (因为训练 SSL 不需要文本)

Direct Transformation

CycleGAN 系列
StarGAN 系列

1.2 根据单个 VC 系统可以支持的源说话者和目标说话者的数量

one-to-one

需要在一对 source-target 之间有 parallel data，需要使用 DTW 等算法进行对齐，转换后的语音具有与源语音相同的持续时间，也可以用 seq2seq 的模型，就不需要对齐了，or CycleGAN-VC

many-to-one

一般会用到 PPG，concatenate a PPG extractor with a target-speaker dependent PPG-to- acoustic synthesis model

many-to-many

text based
- PPG
- non-parallel seq2seq
text free
- auto encoder、VAE、GAN 等

any-to-many

many-to-many 的推广，基于 PPG 可以很好地推广到看不到的源说话人

any-to-any (one-shot / free shot VC)

PPG any-to-many + 一个 speaker encoder 提取 spaker embedding
one-shot VC

FreeVC 中提到 text-based VC and text-free VC

A popular text-based VC approach is to use an automatic speech recognition (ASR) model to extract phonetic posteriorgram (PPG) as content representation
Typical text-free approaches include information bottleneck , vector quantization, instance normalization, etc.
However, text-free generally lags behind text-based approaches . This can be attributed to the fact that the content information they extract is more easily to have source speaker information leaked in.

2 参考资料

字节商业服务
- https://www.volcengine.com/product/Voice-conversion
- “柯南领结”变成现实，字节跳动智创语音团队SAMI发布新一代实时AI变声方案 -> any to many 的场景
http://yqli.tech/page/tts_paper.html
11514 语音转换技术综述王赟 Maigo
voice change,语音转换—倦鸟余花
语音转换综述 An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning
voice conversion 开源列表：https://paperswithcode.com/task/voice-conversion/latest
语音后验图特征PPG(Phonetic Posteriorgram)特征简介
全网最全的音频预训练综述
需要看下 wav2vec2 、vq-wav2vec 是不是都是说话人相关，是否可以平替

3 Feature Disentangle

3.1 Text-Free Voice Conversion

VITS-VC

[27 Oct 2022] FreeVC

Text-Free One-Shot
代码：https://github.com/OlaWod/FreeVC
基于 VITS，需要 WavLM 提取 SSL 特征和一个 speaker encoder，pretrained speak erencoder and non-pretrained speaker encoder. non-pretrained speaker encoder 的效果不会差很多，但是对于 unseen-to-unseen, 预训练的 speaker encoder 的效果更好
propose spectrogram-resize (SR) based data augmentation
WavLM 是 Prior Encoder 的一部分，但是模型是预训练好的，
需要确认
- 论文中提到 SSL feature xssl containing both content information and speaker information，那 PPG 是否包含 speaker information？
- speaker encoder 是否可以用 ECAPA-TDNN?
- 需要确定是否参与反向训练和梯更新 -> 不需要，使用 WavLM 提前预处理好特征作为训练时的输入，预测时也需要用 WavLM 提取特征
- 为啥明明用了 WavLM 却说自己是 text free? WavLM 和 PPG 是啥区别，为啥 PPG 就是 text based？-> 因为 PPG 需要有标注数据训练 ASR 模型, 而 WavLM 是 SSL 模型，无需标注文本

[2 Jun 2021] NVC-Net

One-Shot
代码：https://github.com/sony/ai-research-code/tree/master/nvcnet
和 VITS 比较像，是直接对 wav 操作的，v100 4 卡训练 4 天

[Interspeech 2022 18 Aug 2022] SRD-VC

One-Shot
代码：https://github.com/YoungSeng/SRD-VC (67 stars) ✅
主要是基于 SpeechSplit 改的
一作的 csdn https://blog.csdn.net/qq_41897800/article/details/122616675
主要参考了 SpeechSplit 的音高提取过程（要看下这个模型的训练时间），需要有男女性别信息；AutoVC 的 Mel extractor and Vocoder
The proposed model outperforms all the baseline models in terms of speech naturalness, and has a comparable performance with VQMIVC in terms of speaker similarity.
同时考虑到 timbre，pitch 和 rhythm，因为 pitch 和 rhythm 是与内容有关的，模型设置的目的是防止 pitch 和 rhythm 泄露到音色（timbre）中，实现每个表示的可控合成，在预测阶段，只用 target audio 的 timbre 信息
ClsVC 的作者 Tang huaizhen 是 SRD-VC 的 8 作
没有 640000-P.ckpt 的训练方法，所以无法复现
SRD-VC 作者建议复现 ClsVC 、VQMIVC，因为比 SRD-VC 简单些

[ICASSP 2022]SpeechSplit2

any-to-many，是一种语音特征解耦方式
代码：https://github.com/biggytruck/SpeechSplit2 (84 stars)
SRD-VC 论文发出后才挂到 arxiv 上的，所以 SRD-VC 只参考了 SpeechSplit
使用有效的信号处理方法而不是瓶颈调整来约束语音分量在自动编码器输入上的信息流
G 和 F 都可以训练
[PMLR 2020]SpeechSplit
any-to-many，是一种语音特征解耦方式, SRD-VC 基于这个
代码：https://github.com/auspicious3000/SpeechSplit (481 stars)
AutoVC 团队作品，继承自 AutoVC，但是 AutoVC 只能转换音色，speaker embedding 由 one-hot 向量表示（所以不需要额外的 speaker encoder?）
from SRD-VC 作者：本身不是做 one-shot（因为用的是 one-hot 表示音色？），只通过 bottleneck 效果肯定不会很好的，SpeechSplit 2 也指出了这个问题
从 demo 的频谱变化就能看出不同分量（timbre、pitch、rhythm 的影响），非常有趣
640000-P.ckpt 不知道具体的作用是什么，也不知道是如何训练得到的，所以 SRD-VC 可能也不太好复现

[31 Mar 2022]DYGANVC

many-to-many
代码：https://github.com/MingjieChen/DYGANVC (56 stars) ✅
WadaIN + AdaSpeech + VQWav2vec (fairseq) + 额外的 speaker_encoder
VQWAV2vec 的功能理应包含语音内容信息,并且这部分信息也是不随说话人而改变的
判别器用的是 StarGANv2-VC 的
实习生面试说 BNE-PPG-VC 效果比这个好

[29 Sep 2021]ClsVC

One-Shot and many-to-many
代码：https://openreview.net/forum?id=xp2D-1PtLc5 的附件里

[Interspeech 2021]VQMIVC

One-Shot
代码：https://github.com/Wendison/VQMIVC (259 stars)
论文全是公式，比较难理解
Content encoder、Speaker encoder、Pitch extractor 和 Decoder
The content encoder is borrowed from VectorQuantizedCPC, which also inspires the negative sampling within-utterance for CPC;
The speaker encoder is borrowed from AdaIN-VC;
The decoder is modified from AutoVC;
Estimation of mutual information is modified from CLUB;
Speech features extraction is based on espnet and Pyworld.

cascaded ASR+TTS

many-to-one
Baseline system for Voice Conversion Challenge 2020
代码：https://github.com/espnet/espnet
cascaded ASR+TTS
- 代码：https://github.com/espnet/espnet/tree/master/egs/vcc20
- 自回归的 ASR 和自回归的 TTS，非常低效
- 只支持 many-to-one 的转换
- Transformer ASR + multi-speaker, x-vector Transformer-TTS model
Transformer and Tacotron2 based parallel VC using melspectrogram (new!)

[Interspeech 2020]SkipVQVC
(VQVC+)

One-Shot
代码：https://github.com/ericwudayi/SkipVQVC
李宏毅团队
speaker 转换方面具有良好的性能，但由于 VQ 的离散性，内容信息严重受损（AGAIN-VC 论文中提到的）
VQVC + UNet

[31 Oct 2020]AGAIN-VC

One-Shot
代码：https://github.com/KimythAnly/AGAIN-VC
李宏毅团队，做了 AutoVC、AdaIN VC、VQVC+ 的优缺点分析
利用单个编码器来解耦 speaker 和 content
activation guidance

[27 Oct 2020]FragmentVC

One-Shot
https://github.com/yistLin/FragmentVC (160 stars)
李宏毅团队
source encoder、target encoder 和 decoder，其中预训练好的 wav2vec2 作为 source encoder
attention based + UNet

[ICML 2019 14 May 2019] AutoVC

One-Shot
代码：https://github.com/auspicious3000/autovc
a careful bottleneck design is all you need
需要一个已经预训练好的 speaker encoder

[Interspeech 2019 10 Apr 2019] AdaIN-VC

One-Shot
代码：https://github.com/jjery2243542/adaptive_voice_conversion
李宏毅团队

3.2 Text-based Voice Conversion 基于 PPG 的 VC
[12 Oct 2021]S3PRL-VC

any-to-one and any-to-any (One-Shot)
代码： https://github.com/s3prl/s3prl
未提供 demo 音频

[TASLP 2021]BNE-PPG-VC

any-to-many，any-to-any
代码：https://github.com/liusongxiang/ppg-vc
使用 one-hot 来表示说话人身份，为 any-to-many，可以拓展为使用 speaker encoder 和 speaker embedding (GE2E)，为 any-to-any
Seq2seqPR-DurIAN (一种改进的 cascaded ASR + TTS 的方案) 和 BNE-Seq2seqMoL VC
从 demo 听 any-to-many 的音色不太像，any-to-any 的音质比 any-to-many 要差，从 FastSpeech VC 同语言的听起来，FastSpeech VC 的音色更像一点

[ICASSP 2021 3 Feb 2021]FastSpeech VC

any-to-many
- 每个 target 音色都要训练一个 FastSpeech2, target 必须是训练集里面见过的，或者训练一个多说话人（用 spk_id 标识）的 FastSpeech
demo：https://alibabasglab.github.io/vc/ 效果不错，但是需要训练 target 音色的 FastSpeech2，不知道 any-to-any 的效果如何
FastSpeech VC duration predictor and the length regulator are removed from the original FastSpeech network, so that the input PPG sequence and the output LPCNet feature sequence have the same length.

Q:
1.PPG extractor 是在中文数据集训练的还是英文数据集，是音素级别还是字级别
2.是否使用了 Log-F0 特征，用什么工具提取的
3.提取 PPG 的声学参数（fs， hop_size, windows_size ）是否和 TTS 的声学参数一致，是否做了长度映射（因为输出 mel 特征长度要和 PPG 长度一致）
A:
1.英文场景用的是英文的 asr，音素级别。其实用模型中间的隐状态会更好一些
2.pitch，好像没有取 log，应该取一下 log 的，用的是 pyworld
3.是一样的，asr 和 tts 的 hop_size 一样，所以帧数就一样。都是 10ms 还是12ms 来着。如果不一样做一个插值将长度对齐就行，fs，windows_size 关系都不大，主要和 hop size 的时间有关

[INTERSPEECH 2020 16 Oct 2020]
Tacotron2 VC

any-to-many
先使用 Tacotron2-VC 生成中文说话人的英语数据和英文说话人的中文数据，再分别用 Tacotron2/TransfromerTTS/FastSpeech + LPCNet 构建 codeswitch 的跨语言语音合成，发音字典是直接把中文音素和英文音素 concat 到一起
Tacotron2-VC 的输入是帧级别的 MFCC + LogF0
我们现在 ASR 用的特征是 LogFbank 而不是 MFCC
我们的代码中，推荐使用

PaddleSpeech/paddlespeech/s2t/models/u2/u2.py

Line 737 in 96d76c8

decoder_out = paddle.nn.functional.log_softmax(decoder_out, axis=-1)

作为 PPG，不推荐使用

PaddleSpeech/paddlespeech/s2t/models/u2/u2.py

Line 379 in 96d76c8

ctc_probs = self.ctc.log_softmax(encoder_out) # (B, maxlen, vocab_size)

from @goat

4 Direct Transformation

[Interspeech 2021Best Paper Award]StarGANv2-VC

non-parallel many-to-many
- source_speaker 和 target_speaker 都是一个列表
代码：https://github.com/yl4579/StarGANv2-VC (278 stars)
CycleGAN VC （one-to-one）、StarGAN VC 只能对训练集中见过的进行转换，CycleGAN VC 是一对一，StarGAN VC 是多对多
需要 pretrained F0 and ASR models（用于求 loss，asr 模块和 FO 模型应该不进行梯度的反向传播 https://github.com/yl4579/StarGANv2-VC/blob/7ecbd68e47a59b48a5ea0d230250885e93457570/train.py#L85）
倦鸟余花推荐
数据集是 VCTK, 但是本身训练时候不需要成对数据

5 音频 demo 听感结论

从 SRD-VC 的 demo 听效果：
SRDVC > ClsVC（更清晰，但是语调不符合 source） > VQMIVC > SkipVQVC（咬字不清） > AutoVC（沙哑）> AdaIN-VC（发声困难）
音色相似度上 VQMIVC 最像

5.1 One-Shot

从 AGAIN-VC 的 demo 听效果
不太自然
从 DYGANVC 的 demo 听效果
DYGAN-VC 效果不错 > cascaded ASR+TTS
从 AutoVC 的 demo 听效果
AutoVC > StarGAN VC
LIMIVC 和 FragmentVC 效果比较
LIMIVC（） > FragmentVC（沙哑）
良杰的判断
SRDVC > LIMIVC > FragmentVC

SRD-VC、DYGANVC、StarGANv2-VC、VQMIVC、LIMIVC 择优复现

5.2 text-based

从 BNE-PPG-VC demo 听效果
any-to-many 的音色不太像，any-to-any 的音质比 any-to-many 要差
从 FastSpeech VC demo 听效果
同语言的听起来FastSpeech VC 的音色比 BNE-PPG-VC 更像一点

VITS-VC -> 是 GlowTTS 的能力
pr：
#2268
相关细节：

TODO

VITS 的 VC 是怎么实现的
freevc 和 fastspeech VC 的效果比较

freevc 、fastspeech VC VQMIVC 择优选择

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Voice Conversion 调研 #2789

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Voice Conversion 调研 #2789

yt605155624 Dec 30, 2022 Collaborator

1 基本概念

1.1 根据是否对语音特征进行解耦（如 timbre/content/pitch/rhythm 等的解耦）

1.2 根据单个 VC 系统可以支持的源说话者和目标说话者的数量

2 参考资料

3 Feature Disentangle

3.1 Text-Free Voice Conversion

4 Direct Transformation

5 音频 demo 听感结论

5.1 One-Shot

5.2 text-based

Replies: 0 comments

yt605155624
Dec 30, 2022
Collaborator