Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Enabling word-level timestamps for Wav2Vec 2.0 (#3627)
Summary: # Before submitting - [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements) - [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)? - [ ] Did you make sure to update the docs? - [ ] Did you write any new necessary tests? ## What does this PR do? Fixes #3371. Currently, the output from Wav2Vec 2.0 decoding does not contain word-level start/end times, which can be useful for certain applications of ASR. Based on the discussion [here](flashlight/flashlight#618), they could be computed based on the output from the Flashlight decoder. For the KenLM decoder, we could first obtain the frame number corresponding to each non-blank token. Next, the timestamp of each character could be computed as `segment_start + frame_no/total_frames * segment_duration`. Finally, the start and end time of each word could be calculated based on the timestamp of the word boundary characters. In order to enable this, the frame number of each non-blank character is returned as a result of KenLM decoding. This is similar to the `timesteps` output from the [ctcdecode](https://github.com/parlance/ctcdecode#outputs-from-the-decode-method) library. ## PR review alexeib Pull Request resolved: #3627 Reviewed By: michaelauli Differential Revision: D29282488 Pulled By: alexeib fbshipit-source-id: b5fe64bf50abd7ef8e9539f4e338937c866eb0ca
- Loading branch information