Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subtitle timing and synchronization issue #396

Open
borahanarslan opened this issue Nov 15, 2024 · 8 comments
Open

Subtitle timing and synchronization issue #396

borahanarslan opened this issue Nov 15, 2024 · 8 comments
Assignees
Labels
enhancement New feature or request hallucination hallucination of the models
Milestone

Comments

@borahanarslan
Copy link
Contributor

Hello, I am experiencing some issues while generating subtitles for the video attached below. Despite trying various values in the Advanced Parameters and Voice Detection sections, I am not able to achieve the desired results.

For example, I keep testing, but the text either appears before or after the audio, or the words are too long. Sometimes, very simple two-word subtitles stay on the screen for 30 seconds. Occasionally, there are 2 or 3 different languages in the uploaded file, and in such cases, the behavior changes as well.

I have enabled background music removal, activated VAD, and tested with the large v2 and v3 versions. I increased the Best of and Beam Size values up to 30. I tried many parameters with the sample file I provided, but I still didn’t get the exact results I wanted. What parameters should I use? There is a link to the sample file and subtitles. Are there any settings you would recommend?

https://easyupload.io/d1w4fi (file)
https://easyupload.io/0p558m (srt)

@borahanarslan borahanarslan added the hallucination hallucination of the models label Nov 15, 2024
@jhj0517
Copy link
Owner

jhj0517 commented Nov 15, 2024

Thanks for uploading the sample! I'll test & try to find out what the problem is, and what could be better.

+) The first halluication part is 18:27 ~ 19:21

@jhj0517
Copy link
Owner

jhj0517 commented Nov 15, 2024

OK, based on the subtitle you posted, the first hallucination part is ( 00:18:27,870 --> 00:19:23,870 ).

Let's focus on removing this hallucination part.
First of all, the part is very likely to cause whisper to hallucinate, because it is full of the monster's grr sound, gun sound, and some string instrument sound to add tension to the scene.

So it is recommended to turn on the VAD, and also the Background Music Separator if it gives better result.

You can try this setting:

  1. Since the audio is full of noise, large-v2 is recommended rather than large-v3.

  2. Enable the Background Music Remover Filter:
    image

  3. Enable VAD, I used Minimum Silence Duration (ms) as 250 specifically. All others are just defaults.
    image

I think I got a better result with this setting than the previous one:

At least I didn't observe repetitive phrases like "Go, go, go, go!"s.

@borahanarslan
Copy link
Contributor Author

I will try both v2 and v3 and get back to you, thanks.

@borahanarslan
Copy link
Contributor Author

Sorry but the result is still disappointing, maybe it may be necessary to use it in different settings. It is ideal for short 5-6 minute content, but it is not ideal for movies or documentaries right now. I am adding both files, both v2 and v3 synchronization problem continues and it seems like it started to get really ridiculous in the end :(
Subtitle.zip

@jhj0517
Copy link
Owner

jhj0517 commented Nov 16, 2024

Subtitle.zip

That's too different result than mine, would you copy + paste this into default_parameters.yaml and try again?
The app will be automatically start with the settings below.

whisper:
  model_size: large-v2
  lang: Automatic Detection
  is_translate: false
  beam_size: 5
  log_prob_threshold: -1.0
  no_speech_threshold: 0.6
  compute_type: float16
  best_of: 5
  patience: 1.0
  condition_on_previous_text: true
  prompt_reset_on_temperature: 0.5
  initial_prompt: null
  temperature: 0.0
  compression_ratio_threshold: 2.4
  length_penalty: 1.0
  repetition_penalty: 1.0
  no_repeat_ngram_size: 0
  prefix: null
  suppress_blank: true
  suppress_tokens: '[-1]'
  max_initial_timestamp: 1.0
  word_timestamps: false
  prepend_punctuations: '"''“¿([{-'
  append_punctuations: '"''.。,,!!??::”)]}、'
  max_new_tokens: null
  chunk_length: 30
  hallucination_silence_threshold: null
  hotwords: null
  language_detection_threshold: null
  language_detection_segments: 1
  batch_size: 24
  add_timestamp: true
  file_format: SRT
vad:
  vad_filter: true
  threshold: 0.5
  min_speech_duration_ms: 250
  max_speech_duration_s: 9999
  min_silence_duration_ms: 250
  speech_pad_ms: 2000
diarization:
  is_diarize: false
  device: cuda
  hf_token: ''
bgm_separation:
  is_separate_bgm: true
  model_size: UVR-MDX-NET-Inst_HQ_4
  device: cuda
  segment_size: 256
  save_file: false
  enable_offload: true
translation:
  deepl:
    api_key: ''
    is_pro: false
    source_lang: Automatic Detection
    target_lang: English
  nllb:
    model_size: facebook/nllb-200-1.3B
    source_lang: null
    target_lang: null
    max_length: 200
  add_timestamp: true

@borahanarslan
Copy link
Contributor Author

whisperx

Subtitle.zip

That's too different result than mine, would you copy + paste this into default_parameters.yaml and try again? The app will be automatically start with the settings below.

whisper:
  model_size: large-v2
  lang: Automatic Detection
  is_translate: false
  beam_size: 5
  log_prob_threshold: -1.0
  no_speech_threshold: 0.6
  compute_type: float16
  best_of: 5
  patience: 1.0
  condition_on_previous_text: true
  prompt_reset_on_temperature: 0.5
  initial_prompt: null
  temperature: 0.0
  compression_ratio_threshold: 2.4
  length_penalty: 1.0
  repetition_penalty: 1.0
  no_repeat_ngram_size: 0
  prefix: null
  suppress_blank: true
  suppress_tokens: '[-1]'
  max_initial_timestamp: 1.0
  word_timestamps: false
  prepend_punctuations: '"''“¿([{-'
  append_punctuations: '"''.。,,!!??::”)]}、'
  max_new_tokens: null
  chunk_length: 30
  hallucination_silence_threshold: null
  hotwords: null
  language_detection_threshold: null
  language_detection_segments: 1
  batch_size: 24
  add_timestamp: true
  file_format: SRT
vad:
  vad_filter: true
  threshold: 0.5
  min_speech_duration_ms: 250
  max_speech_duration_s: 9999
  min_silence_duration_ms: 250
  speech_pad_ms: 2000
diarization:
  is_diarize: false
  device: cuda
  hf_token: ''
bgm_separation:
  is_separate_bgm: true
  model_size: UVR-MDX-NET-Inst_HQ_4
  device: cuda
  segment_size: 256
  save_file: false
  enable_offload: true
translation:
  deepl:
    api_key: ''
    is_pro: false
    source_lang: Automatic Detection
    target_lang: English
  nllb:
    model_size: facebook/nllb-200-1.3B
    source_lang: null
    target_lang: null
    max_length: 200
  add_timestamp: true

I will try. You may consider the whisperx integration. Last night I saw that creating subtitles with whisperx was much more successful.

@borahanarslan
Copy link
Contributor Author

hi @jhj0517

I tried with your settings but the result is the same I tried with large-v2 and large-v3. I tried other settings (remove background music, voice detection filter, advanced settings etc.) but there was no satisfactory improvement.

1
00:00:08,620 --> 00:00:13,620
You know where I'm from, they say you blow on those

it sounds like this but the first conversation starts in 10 seconds. Also the sentence is long. This is like this in many places.

I made a sample project with whisperx the result was like below, it really gave better results

1
00:00:10,651 --> 00:00:11,393
You know where I'm from?

2
00:00:12,593 --> 00:00:14,634
They say you blow on those to set your wishes free.

@jhj0517
Copy link
Owner

jhj0517 commented Nov 17, 2024

Both whisperX and Whisper-WebUI use faster-whisper for transcription.

The reason why it gets better results is probably because it uses a different implementation of the VAD.
Just like #287, this project needs a better VAD implementation.

So I noted it here now.

@jhj0517 jhj0517 added the enhancement New feature or request label Nov 24, 2024
@jhj0517 jhj0517 added this to the vad milestone Dec 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request hallucination hallucination of the models
Projects
None yet
Development

No branches or pull requests

2 participants