[S2T]deepspeech模型数据处理报错 #3938

wangdach · 2024-12-06T09:13:43Z

For support and discussions, please use our Discourse forums.

If you've found a bug then please create an issue with the following information:

Describe the bug

环境1 PaddleSpeech-dev分支

基于PaddleSpeech的develop分支，搭建paddle 环境，py3.10 + paddle-dev分支whl包
PaddleSpeech/examples/librispeech/asr0/下指导执行数据处理操作
最终的train-500会失败报错

/workspace/PaddleSpeech/examples/librispeech/asr0 {develop *} bash 
run.sh --stage 0 --stop_stage 0 
checkpoint name deepspeech2
Skip downloading and unpacking. Data already exists in /workspace/PaddleSpeech/dataset/librispeech/test-clean.
Creating manifest data/manifest.test-clean ...
Skip downloading and unpacking. Data already exists in /workspace/PaddleSpeech/dataset/librispeech/dev-clean.
Creating manifest data/manifest.dev-clean ...
Skip downloading and unpacking. Data already exists in /workspace/PaddleSpeech/dataset/librispeech/train-clean-100.
Creating manifest data/manifest.train-clean-100 ...
Skip downloading and unpacking. Data already exists in /workspace/PaddleSpeech/dataset/librispeech/test-other.
Creating manifest data/manifest.test-other ...
Skip downloading and unpacking. Data already exists in /workspace/PaddleSpeech/dataset/librispeech/dev-other.
Creating manifest data/manifest.dev-other ...
Skip downloading and unpacking. Data already exists in /workspace/PaddleSpeech/dataset/librispeech/train-clean-360.
Creating manifest data/manifest.train-clean-360 ...
Skip downloading and unpacking. Data already exists in /workspace/PaddleSpeech/dataset/librispeech/train-other-500.
Creating manifest data/manifest.train-other-500 ...
Creating manifest data/manifest.train-other-500 ...
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/workspace/PaddleSpeech/dataset/librispeech/librispeech.py", line 146, in prepare_dataset
    create_manifest(target_dir, manifest_path)
  File "/workspace/PaddleSpeech/dataset/librispeech/librispeech.py", line 98, in create_manifest
    audio_data, samplerate = soundfile.read(audio_filepath)
  File "/usr/local/lib/python3.10/dist-packages/soundfile.py", line 285, in read
    with SoundFile(file, 'r', samplerate, channels,
  File "/usr/local/lib/python3.10/dist-packages/soundfile.py", line 658, in __init__
    self._file = self._open(file, mode_int, closefd)
  File "/usr/local/lib/python3.10/dist-packages/soundfile.py", line 1216, in _open
    raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening '/workspace/PaddleSpeech/dataset/librispeech/train-other-500/LibriSpeech/train-other-500/5480/41791/5480-41791-0000.flac': System error.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/PaddleSpeech/dataset/librispeech/librispeech.py", line 186, in <module>
    main()
  File "/workspace/PaddleSpeech/dataset/librispeech/librispeech.py", line 180, in main
    pool.starmap(prepare_dataset, tasks)
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 375, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
soundfile.LibsndfileError: Error opening '/workspace/PaddleSpeech/dataset/librispeech/train-other-500/LibriSpeech/train-other-500/5480/41791/5480-41791-0000.flac': System error.
Prepare LibriSpeech failed. Terminated.
λ tjdm-isa-ai-p800node13 /workspace/PaddleSpeech/examples/librisp

环境2 PaddleSpeech-tag v2.1.1分支

作为对比，在paddlespeech的tag v2.1.1进行了同样的尝试，当然只能使用paddle的老版本。没有遇到这个问题，数据处理成功且能正常训练deepspeech模型。
将此处的librispeech数据处理的manifest数据导出给环境1。启动训练会报错。

# 启动命令
python3 -u  /workspace/PaddleSpeech/paddlespeech/s2t/exps/deepspeech2/bin/train.py --nxpu 1 --ngpu 0 --config conf/deepspeech2.yaml --output exp/deepspeech2 --seed 0

The text was updated successfully, but these errors were encountered:

wangdach · 2024-12-06T09:20:06Z

补充：
报错部分是 PaddleSpeech/paddlespeech/s2t/io/dataloader.py 的 feat_dim_and_vocab_size
1、尝试查阅PaddleSpeech tag v2.1.1的数据处理部分，未找到 feat_dim 、 data_json的代码，可能差异比较大
2、pdb查看key是不存在input，output的

Ray961123 · 2024-12-16T09:30:57Z

开发者你好，感谢关注 PaddleSpeech 开源项目，抱歉给你带来了不好的开发体验，目前开源项目维护人力有限，你可以尝试通过修改 PaddleSpeech 源码的方式自己解决，或请求开源社区其他开发者的协助。飞桨开源社区交流频道：飞桨AI Studio星河社区-人工智能学习与实训社区

zxcd · 2024-12-17T09:25:55Z

确认不是数据问题吗？以及v2.1.1是哪个版本，paddlespeech没有这个分支？
看报错应该是
/workspace/PaddleSpeech/dataset/librispeech/train-other-500/LibriSpeech/train-other-500/5480/41791/5480-41791-0000.flac 不存在或者损坏引起的报错。

megemini · 2024-12-17T14:13:44Z

@wangdach 可以试一下手动 read 这个 flac 文件：

>>> import soundfile
>>> soundfile.read('5480-41791-0000.flac')
(array([ 6.10351562e-05,  6.10351562e-05, -9.15527344e-05, ...,
       -1.12915039e-03, -4.57763672e-04,  4.88281250e-04]), 16000)
>>>

看看是不是能够正常读取，文件是不是完整～

我在 paddle 3.0 beta2 测试的，能够正常读取，木有遇到这个问题～

这个 train-other-500 数据集比较大，有可能是解压过程出错，文件损坏了？

wangdach added Bug S2T asr/st labels Dec 6, 2024

wangdach assigned zh794390558 Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[S2T]deepspeech模型数据处理报错 #3938

[S2T]deepspeech模型数据处理报错 #3938

wangdach commented Dec 6, 2024

wangdach commented Dec 6, 2024

Ray961123 commented Dec 16, 2024

zxcd commented Dec 17, 2024 •

edited

Loading

megemini commented Dec 17, 2024

[S2T]deepspeech模型数据处理报错 #3938

[S2T]deepspeech模型数据处理报错 #3938

Comments

wangdach commented Dec 6, 2024

环境1 PaddleSpeech-dev分支

环境2 PaddleSpeech-tag v2.1.1分支

wangdach commented Dec 6, 2024

Ray961123 commented Dec 16, 2024

zxcd commented Dec 17, 2024 • edited Loading

megemini commented Dec 17, 2024

zxcd commented Dec 17, 2024 •

edited

Loading