Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precise Use of Actual Subtitles #323

Open
iodides opened this issue Oct 7, 2024 · 5 comments
Open

Precise Use of Actual Subtitles #323

iodides opened this issue Oct 7, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@iodides
Copy link

iodides commented Oct 7, 2024

First of all, I want to express my thanks because I'm using it very well.

In general, when you have a script for recorded videos, movies, or music, there is often a fully accurate script available. However, when using Whisper WebUI to convert speech to text, it often doesn't recognize certain words and sentences perfectly, so manual correction is required.

It's difficult, even for AI or humans, to completely understand dialogue just by listening. Therefore, it would be great if, when there is an original script, we could upload a script file (without timestamps) alongside the audio, and the AI could recognize and synchronize the original subtitles with the correct timing.

@iodides iodides added the enhancement New feature or request label Oct 7, 2024
@jhj0517
Copy link
Owner

jhj0517 commented Oct 7, 2024

Hi. If I understand correctly, you want to let the web ui only update "timestamps" with transcription?
I'm considering if I should implement this or not and how I should implement this.

And if the hallucination is problem, you can consider using VAD ( Voice Detection ) and BGM Separation filters from the WebUI.

They will feed the cleaner audio to the whisper and most of the hallucinations will disappear just by removing the noise from the audio.

@iodides
Copy link
Author

iodides commented Oct 8, 2024

Yes, I have an original script for the video. Of course, the recognition result from Whisper is excellent, but the results are not 100% accurate.

For example, if the original script is:
Lost in the maze of broken streets, Twelve paths ahead, where will I meet,

the result from WebUI comes out as:

🎵 Lost in a maze of broken strings 🎵 🎵 To a path ahead, where will I meet? 🎵

So, I have to compare line by line and correct the text.

@iodides
Copy link
Author

iodides commented Oct 8, 2024

another samples,
Original script:
Even if fate decides to blind,
I’ll walk the path, leave doubt behind.
With every turn, I feel you near,
Summer’s light will reappear.

Webui Result:
Even if fate decides to bind
I'll walk the path leaped out behind
With every turn, I fear you're near
Sunrise light will reappear

in my case, it's a music

@jhj0517
Copy link
Owner

jhj0517 commented Oct 9, 2024

Transcribing music is a really good example of using the Background Music Remover filter in the WebUI.
If you haven't tried it yet, I recommend to use it.

Original script: leave doubt behind.
Webui Result: leaped out behind

This kind of case seems difficult one. You might consider to use higher beam_size ( Which exists in the Advanced Parameters" tab), like might 10. Higher beam_size slows down the transcription, but makes it more accurate.

As for the feature itself, I see this as a very specific one, I will implement it if others want it as well!

@iodides
Copy link
Author

iodides commented Oct 11, 2024

Sample Music.zip

  • Goodbye.flac : Sample Muisic
  • Goodbye.txt : Original Lyrics
  • Goodbye_webui.srt : Webui Created SRT
  • Goodbye_manual.srt : I created manually

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants