Subtitle timing and synchronization issue #396

borahanarslan · 2024-11-15T15:25:29Z

Hello, I am experiencing some issues while generating subtitles for the video attached below. Despite trying various values in the Advanced Parameters and Voice Detection sections, I am not able to achieve the desired results.

For example, I keep testing, but the text either appears before or after the audio, or the words are too long. Sometimes, very simple two-word subtitles stay on the screen for 30 seconds. Occasionally, there are 2 or 3 different languages in the uploaded file, and in such cases, the behavior changes as well.

I have enabled background music removal, activated VAD, and tested with the large v2 and v3 versions. I increased the Best of and Beam Size values up to 30. I tried many parameters with the sample file I provided, but I still didn’t get the exact results I wanted. What parameters should I use? There is a link to the sample file and subtitles. Are there any settings you would recommend?

https://easyupload.io/d1w4fi (file)
https://easyupload.io/0p558m (srt)

jhj0517 · 2024-11-15T16:11:06Z

Thanks for uploading the sample! I'll test & try to find out what the problem is, and what could be better.

+) The first halluication part is 18:27 ~ 19:21

jhj0517 · 2024-11-15T16:53:17Z

OK, based on the subtitle you posted, the first hallucination part is ( 00:18:27,870 --> 00:19:23,870 ).

Let's focus on removing this hallucination part.
First of all, the part is very likely to cause whisper to hallucinate, because it is full of the monster's grr sound, gun sound, and some string instrument sound to add tension to the scene.

So it is recommended to turn on the VAD, and also the Background Music Separator if it gives better result.

You can try this setting:

Since the audio is full of noise, large-v2 is recommended rather than large-v3.
Enable the Background Music Remover Filter:
Enable VAD, I used Minimum Silence Duration (ms) as 250 specifically. All others are just defaults.

I think I got a better result with this setting than the previous one:

subtitle : https://gist.github.com/jhj0517/b9c374ea36c6e858966484835dd456a2

At least I didn't observe repetitive phrases like "Go, go, go, go!"s.

borahanarslan · 2024-11-15T17:27:22Z

I will try both v2 and v3 and get back to you, thanks.

borahanarslan · 2024-11-15T17:59:56Z

Sorry but the result is still disappointing, maybe it may be necessary to use it in different settings. It is ideal for short 5-6 minute content, but it is not ideal for movies or documentaries right now. I am adding both files, both v2 and v3 synchronization problem continues and it seems like it started to get really ridiculous in the end :(
Subtitle.zip

jhj0517 · 2024-11-16T13:06:01Z

Subtitle.zip

That's too different result than mine, would you copy + paste this into default_parameters.yaml and try again?
The app will be automatically start with the settings below.

whisper:
  model_size: large-v2
  lang: Automatic Detection
  is_translate: false
  beam_size: 5
  log_prob_threshold: -1.0
  no_speech_threshold: 0.6
  compute_type: float16
  best_of: 5
  patience: 1.0
  condition_on_previous_text: true
  prompt_reset_on_temperature: 0.5
  initial_prompt: null
  temperature: 0.0
  compression_ratio_threshold: 2.4
  length_penalty: 1.0
  repetition_penalty: 1.0
  no_repeat_ngram_size: 0
  prefix: null
  suppress_blank: true
  suppress_tokens: '[-1]'
  max_initial_timestamp: 1.0
  word_timestamps: false
  prepend_punctuations: '"''“¿([{-'
  append_punctuations: '"''.。,，!！?？:：”)]}、'
  max_new_tokens: null
  chunk_length: 30
  hallucination_silence_threshold: null
  hotwords: null
  language_detection_threshold: null
  language_detection_segments: 1
  batch_size: 24
  add_timestamp: true
  file_format: SRT
vad:
  vad_filter: true
  threshold: 0.5
  min_speech_duration_ms: 250
  max_speech_duration_s: 9999
  min_silence_duration_ms: 250
  speech_pad_ms: 2000
diarization:
  is_diarize: false
  device: cuda
  hf_token: ''
bgm_separation:
  is_separate_bgm: true
  model_size: UVR-MDX-NET-Inst_HQ_4
  device: cuda
  segment_size: 256
  save_file: false
  enable_offload: true
translation:
  deepl:
    api_key: ''
    is_pro: false
    source_lang: Automatic Detection
    target_lang: English
  nllb:
    model_size: facebook/nllb-200-1.3B
    source_lang: null
    target_lang: null
    max_length: 200
  add_timestamp: true

borahanarslan · 2024-11-16T17:11:06Z

whisperx

Subtitle.zip

That's too different result than mine, would you copy + paste this into default_parameters.yaml and try again? The app will be automatically start with the settings below.

whisper:
  model_size: large-v2
  lang: Automatic Detection
  is_translate: false
  beam_size: 5
  log_prob_threshold: -1.0
  no_speech_threshold: 0.6
  compute_type: float16
  best_of: 5
  patience: 1.0
  condition_on_previous_text: true
  prompt_reset_on_temperature: 0.5
  initial_prompt: null
  temperature: 0.0
  compression_ratio_threshold: 2.4
  length_penalty: 1.0
  repetition_penalty: 1.0
  no_repeat_ngram_size: 0
  prefix: null
  suppress_blank: true
  suppress_tokens: '[-1]'
  max_initial_timestamp: 1.0
  word_timestamps: false
  prepend_punctuations: '"''“¿([{-'
  append_punctuations: '"''.。,，!！?？:：”)]}、'
  max_new_tokens: null
  chunk_length: 30
  hallucination_silence_threshold: null
  hotwords: null
  language_detection_threshold: null
  language_detection_segments: 1
  batch_size: 24
  add_timestamp: true
  file_format: SRT
vad:
  vad_filter: true
  threshold: 0.5
  min_speech_duration_ms: 250
  max_speech_duration_s: 9999
  min_silence_duration_ms: 250
  speech_pad_ms: 2000
diarization:
  is_diarize: false
  device: cuda
  hf_token: ''
bgm_separation:
  is_separate_bgm: true
  model_size: UVR-MDX-NET-Inst_HQ_4
  device: cuda
  segment_size: 256
  save_file: false
  enable_offload: true
translation:
  deepl:
    api_key: ''
    is_pro: false
    source_lang: Automatic Detection
    target_lang: English
  nllb:
    model_size: facebook/nllb-200-1.3B
    source_lang: null
    target_lang: null
    max_length: 200
  add_timestamp: true

I will try. You may consider the whisperx integration. Last night I saw that creating subtitles with whisperx was much more successful.

borahanarslan · 2024-11-17T13:06:30Z

hi @jhj0517

I tried with your settings but the result is the same I tried with large-v2 and large-v3. I tried other settings (remove background music, voice detection filter, advanced settings etc.) but there was no satisfactory improvement.

1
00:00:08,620 --> 00:00:13,620
You know where I'm from, they say you blow on those

it sounds like this but the first conversation starts in 10 seconds. Also the sentence is long. This is like this in many places.

I made a sample project with whisperx the result was like below, it really gave better results

1
00:00:10,651 --> 00:00:11,393
You know where I'm from?

2
00:00:12,593 --> 00:00:14,634
They say you blow on those to set your wishes free.

jhj0517 · 2024-11-17T13:37:33Z

Both whisperX and Whisper-WebUI use faster-whisper for transcription.

The reason why it gets better results is probably because it uses a different implementation of the VAD.
Just like #287, this project needs a better VAD implementation.

So I noted it here now.

borahanarslan added the hallucination hallucination of the models label Nov 15, 2024

borahanarslan assigned jhj0517 Nov 15, 2024

jhj0517 added the enhancement New feature or request label Nov 24, 2024

jhj0517 added this to the vad milestone Dec 14, 2024

jhj0517 mentioned this issue Dec 14, 2024

no hallucinations on the first generation. #421

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subtitle timing and synchronization issue #396

Subtitle timing and synchronization issue #396

borahanarslan commented Nov 15, 2024

jhj0517 commented Nov 15, 2024 •

edited

Loading

jhj0517 commented Nov 15, 2024 •

edited

Loading

borahanarslan commented Nov 15, 2024

borahanarslan commented Nov 15, 2024

jhj0517 commented Nov 16, 2024

borahanarslan commented Nov 16, 2024

borahanarslan commented Nov 17, 2024

jhj0517 commented Nov 17, 2024

Subtitle timing and synchronization issue #396

Subtitle timing and synchronization issue #396

Comments

borahanarslan commented Nov 15, 2024

jhj0517 commented Nov 15, 2024 • edited Loading

jhj0517 commented Nov 15, 2024 • edited Loading

borahanarslan commented Nov 15, 2024

borahanarslan commented Nov 15, 2024

jhj0517 commented Nov 16, 2024

borahanarslan commented Nov 16, 2024

borahanarslan commented Nov 17, 2024

jhj0517 commented Nov 17, 2024

jhj0517 commented Nov 15, 2024 •

edited

Loading

jhj0517 commented Nov 15, 2024 •

edited

Loading