Finetune CosyVoice2 Flow Module #784

OswaldoBornemann · 2024-12-25T07:30:18Z

So, I found that the finetune process of the CosyVoice2 Flow module has still not been released. Then I tried to write forward function inside the CausalMaskedDiffWithXvec class.

def forward(
        self,
        batch: dict,
        device: torch.device,
    ) -> Dict[str, Optional[torch.Tensor]]:
        token = batch['speech_token'].to(device)
        token_len = batch['speech_token_len'].to(device) # torch.Size([B])
        feat = batch['speech_feat'].to(device) # torch.Size([B, Length, 80])
        feat_len = batch['speech_feat_len'].to(device)
        embedding = batch['embedding'].to(device)
        
        # xvec projection
        embedding = F.normalize(embedding, dim=1)
        embedding = self.spk_embed_affine_layer(embedding)
        
        # concat text and prompt_text
        mask = (~make_pad_mask(token_len)).float().unsqueeze(-1).to(device)
        token = self.input_embedding(torch.clamp(token, min=0)) * mask
        
        # text encode
        h, h_lengths = self.encoder(token, token_len)
        h = self.encoder_proj(h) # torch.Sizes([B, 80, 512])
        h, h_lengths = self.length_regulator(h, feat_len)
        # breakpoint()
        
        # get conditions
        conds = torch.zeros(feat.shape, device=token.device)
        for i, j in enumerate(feat_len):
            if random.random() < 0.5:
                continue
            index = random.randint(0, int(0.3 * j))
            conds[i, :index] = feat[i, :index]
        conds = conds.transpose(1, 2)
        
        mask = (~make_pad_mask(feat_len)).to(h)
        # breakpoint()
        feat = F.interpolate(feat.unsqueeze(dim=1), size=h.shape[1:], mode="nearest").squeeze(dim=1) # [B, 80, 80]
        # breakpoint()
        loss, _ = self.decoder.compute_loss(
            feat.transpose(1, 2).contiguous(),
            mask.unsqueeze(1),
            h.transpose(1, 2).contiguous(),
            embedding,
            cond=conds
        )
        # breakpoint()
        return {'loss': loss}

So I found that most of the code is similar to those in the MaskedDiffWithXvec class. I fine-tuned the pretrained flow model with my own datasets. The loss is like the following:

2024-12-25 15:28:38,069 DEBUG TRAIN Batch 0/2200 loss 0.249451 lr 0.00004404 grad_norm 2.147710 rank 1
2024-12-25 15:29:02,695 DEBUG TRAIN Batch 0/2300 loss 0.386747 lr 0.00004604 grad_norm 1.991988 rank 1
2024-12-25 15:29:02,698 DEBUG TRAIN Batch 0/2300 loss 0.425706 lr 0.00004604 grad_norm 1.991988 rank 0
2024-12-25 15:29:26,829 DEBUG TRAIN Batch 0/2400 loss 0.277764 lr 0.00004804 grad_norm 2.968434 rank 0
2024-12-25 15:29:26,846 DEBUG TRAIN Batch 0/2400 loss 0.462326 lr 0.00004804 grad_norm 2.968434 rank 1
2024-12-25 15:29:50,665 DEBUG TRAIN Batch 0/2500 loss 0.260378 lr 0.00005004 grad_norm 1.854653 rank 0
2024-12-25 15:29:50,677 DEBUG TRAIN Batch 0/2500 loss 0.628877 lr 0.00005004 grad_norm 1.854653 rank 1

Is that normal or not?

The text was updated successfully, but these errors were encountered:

aluminumbox · 2024-12-29T09:32:25Z

不是这样，可以等我们官方实现

OswaldoBornemann · 2024-12-29T09:47:26Z

会区别多少呢？因为我这边finetune了之后感觉是有效的

aluminumbox · 2024-12-29T09:49:34Z

区别是dynamic chunk，如果你仅关注非流式推理，那是没问题的

OswaldoBornemann · 2024-12-29T10:19:30Z

后面会用这个finetune的flow在流式推理上的

shenkunlovecoding · 2024-12-29T11:50:21Z

不是这样，可以等我们官方实现

会在什么时候发布呢

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetune CosyVoice2 Flow Module #784

Finetune CosyVoice2 Flow Module #784

OswaldoBornemann commented Dec 25, 2024

aluminumbox commented Dec 29, 2024

OswaldoBornemann commented Dec 29, 2024

aluminumbox commented Dec 29, 2024

OswaldoBornemann commented Dec 29, 2024

shenkunlovecoding commented Dec 29, 2024

Finetune CosyVoice2 Flow Module #784

Finetune CosyVoice2 Flow Module #784

Comments

OswaldoBornemann commented Dec 25, 2024

aluminumbox commented Dec 29, 2024

OswaldoBornemann commented Dec 29, 2024

aluminumbox commented Dec 29, 2024

OswaldoBornemann commented Dec 29, 2024

shenkunlovecoding commented Dec 29, 2024