Towards Generalizable SER: Soft Labeling and Data Augmentation for Modeling Temporal Emotion Shifts in Large-Scale Multilingual Speech

Paper on arxiv.

This repository contains a model weights release and an example notebook.

NEW

We release a new model checkpoint showing better performance than the previous public model. The new checkpoint also uses MRL (Matryoshka Representation Learning) which allows for embedding emotions into any number of dimensions natively. Please find model under releases.

Overview

The paper presents a novel approach to SER that addresses the issue of bias in cross-corpus emotion detection. Key aspects include:

Amalgamation of 16 diverse datasets resulting in 375 hours of multilingual speech data.
Introduction of a soft labeling system to capture gradational emotional intensities.
Use of the Whisper encoder and a unique data augmentation method inspired by contrastive learning.
Validation on four multilingual datasets demonstrating significant zero-shot generalization.

How to use

Download the weights from the releases
Clone the repository
Check example.ipynb for examples on how to load and use

If your clip is shorter than 30s:

After you downsample your audio to 16KHz get its length (# of samples) and downsample by the same factor as Whisper as follows:

effective_length = clip.shape[-1]//160//2

And once you perform inference take the mean of the logits 0:effective_length similarly to this

torch.mean(extracted_features[:, :length, :], dim=1)

Do note that you can downsample the predictions in a different way so as to get predictions every t frames.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
example.ipynb		example.ipynb
father.wav		father.wav
model.py		model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Towards Generalizable SER: Soft Labeling and Data Augmentation for Modeling Temporal Emotion Shifts in Large-Scale Multilingual Speech

NEW

Overview

How to use

About

Releases 2

Packages

Languages

License

spaghettiSystems/emotion_whisper

Folders and files

Latest commit

History

Repository files navigation

Towards Generalizable SER: Soft Labeling and Data Augmentation for Modeling Temporal Emotion Shifts in Large-Scale Multilingual Speech

NEW

Overview

How to use

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages