Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetune CosyVoice2 Flow Module #784

Open
OswaldoBornemann opened this issue Dec 25, 2024 · 5 comments
Open

Finetune CosyVoice2 Flow Module #784

OswaldoBornemann opened this issue Dec 25, 2024 · 5 comments

Comments

@OswaldoBornemann
Copy link

So, I found that the finetune process of the CosyVoice2 Flow module has still not been released. Then I tried to write forward function inside the CausalMaskedDiffWithXvec class.

def forward(
        self,
        batch: dict,
        device: torch.device,
    ) -> Dict[str, Optional[torch.Tensor]]:
        token = batch['speech_token'].to(device)
        token_len = batch['speech_token_len'].to(device) # torch.Size([B])
        feat = batch['speech_feat'].to(device) # torch.Size([B, Length, 80])
        feat_len = batch['speech_feat_len'].to(device)
        embedding = batch['embedding'].to(device)
        
        # xvec projection
        embedding = F.normalize(embedding, dim=1)
        embedding = self.spk_embed_affine_layer(embedding)
        
        # concat text and prompt_text
        mask = (~make_pad_mask(token_len)).float().unsqueeze(-1).to(device)
        token = self.input_embedding(torch.clamp(token, min=0)) * mask
        
        # text encode
        h, h_lengths = self.encoder(token, token_len)
        h = self.encoder_proj(h) # torch.Sizes([B, 80, 512])
        h, h_lengths = self.length_regulator(h, feat_len)
        # breakpoint()
        
        # get conditions
        conds = torch.zeros(feat.shape, device=token.device)
        for i, j in enumerate(feat_len):
            if random.random() < 0.5:
                continue
            index = random.randint(0, int(0.3 * j))
            conds[i, :index] = feat[i, :index]
        conds = conds.transpose(1, 2)
        
        mask = (~make_pad_mask(feat_len)).to(h)
        # breakpoint()
        feat = F.interpolate(feat.unsqueeze(dim=1), size=h.shape[1:], mode="nearest").squeeze(dim=1) # [B, 80, 80]
        # breakpoint()
        loss, _ = self.decoder.compute_loss(
            feat.transpose(1, 2).contiguous(),
            mask.unsqueeze(1),
            h.transpose(1, 2).contiguous(),
            embedding,
            cond=conds
        )
        # breakpoint()
        return {'loss': loss}

So I found that most of the code is similar to those in the MaskedDiffWithXvec class. I fine-tuned the pretrained flow model with my own datasets. The loss is like the following:

2024-12-25 15:28:38,069 DEBUG TRAIN Batch 0/2200 loss 0.249451 lr 0.00004404 grad_norm 2.147710 rank 1
2024-12-25 15:29:02,695 DEBUG TRAIN Batch 0/2300 loss 0.386747 lr 0.00004604 grad_norm 1.991988 rank 1
2024-12-25 15:29:02,698 DEBUG TRAIN Batch 0/2300 loss 0.425706 lr 0.00004604 grad_norm 1.991988 rank 0
2024-12-25 15:29:26,829 DEBUG TRAIN Batch 0/2400 loss 0.277764 lr 0.00004804 grad_norm 2.968434 rank 0
2024-12-25 15:29:26,846 DEBUG TRAIN Batch 0/2400 loss 0.462326 lr 0.00004804 grad_norm 2.968434 rank 1
2024-12-25 15:29:50,665 DEBUG TRAIN Batch 0/2500 loss 0.260378 lr 0.00005004 grad_norm 1.854653 rank 0
2024-12-25 15:29:50,677 DEBUG TRAIN Batch 0/2500 loss 0.628877 lr 0.00005004 grad_norm 1.854653 rank 1

Is that normal or not?

@aluminumbox
Copy link
Collaborator

不是这样,可以等我们官方实现

@OswaldoBornemann
Copy link
Author

会区别多少呢?因为我这边finetune了之后 感觉是有效的

@aluminumbox
Copy link
Collaborator

区别是dynamic chunk,如果你仅关注非流式推理,那是没问题的

@OswaldoBornemann
Copy link
Author

后面会用这个finetune的flow在流式推理上的

@shenkunlovecoding
Copy link

不是这样,可以等我们官方实现

会在什么时候发布呢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants