-
-
Notifications
You must be signed in to change notification settings - Fork 215
VAD Parameters
jhj0517 edited this page Jun 26, 2024
·
1 revision
Currently Silero VAD is only implemented with faster-whisper. So Silero VAD is only usable when you use faster-whisper.
Parameter | Description |
---|---|
vad_filter |
The VAD filter is disabled by default, so you need to set it to true if you want to use it. |
threshold |
Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH. It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets. If it has a low value, it will be sensitive to small sounds and not treat them as a silent part. |
min_speech_duration_ms |
Final speech chunks shorter min_speech_duration_ms are thrown out. |
max_speech_duration_s |
Maximum duration of speech chunks in seconds. Chunks longer than max_speech_duration_s will be split at the timestamp of the last silence that lasts more than 100ms (if any), to prevent aggressive cutting. Otherwise, they will be split aggressively just before max_speech_duration_s. |
min_silence_duration_ms |
In the end of each speech chunk wait for min_silence_duration_ms before separating it |
window_size_samples |
Audio chunks of window_size_samples size are fed to the silero VAD model. WARNING! Silero VAD models were trained using 512, 1024, 1536 samples for 16000 sample rate. Values other than these may affect model performance!! |
speech_pad_ms |
Final speech chunks are padded by speech_pad_ms each side |