Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[S2T]deepspeech模型数据处理报错 #3938

Open
wangdach opened this issue Dec 6, 2024 · 4 comments
Open

[S2T]deepspeech模型数据处理报错 #3938

wangdach opened this issue Dec 6, 2024 · 4 comments
Assignees
Labels

Comments

@wangdach
Copy link

wangdach commented Dec 6, 2024

For support and discussions, please use our Discourse forums.

If you've found a bug then please create an issue with the following information:

Describe the bug

环境1 PaddleSpeech-dev分支

基于PaddleSpeech的develop分支,搭建paddle 环境,py3.10 + paddle-dev分支whl包
PaddleSpeech/examples/librispeech/asr0/下指导执行数据处理操作
最终的train-500会失败报错

/workspace/PaddleSpeech/examples/librispeech/asr0 {develop *} bash 
run.sh --stage 0 --stop_stage 0 
checkpoint name deepspeech2
Skip downloading and unpacking. Data already exists in /workspace/PaddleSpeech/dataset/librispeech/test-clean.
Creating manifest data/manifest.test-clean ...
Skip downloading and unpacking. Data already exists in /workspace/PaddleSpeech/dataset/librispeech/dev-clean.
Creating manifest data/manifest.dev-clean ...
Skip downloading and unpacking. Data already exists in /workspace/PaddleSpeech/dataset/librispeech/train-clean-100.
Creating manifest data/manifest.train-clean-100 ...
Skip downloading and unpacking. Data already exists in /workspace/PaddleSpeech/dataset/librispeech/test-other.
Creating manifest data/manifest.test-other ...
Skip downloading and unpacking. Data already exists in /workspace/PaddleSpeech/dataset/librispeech/dev-other.
Creating manifest data/manifest.dev-other ...
Skip downloading and unpacking. Data already exists in /workspace/PaddleSpeech/dataset/librispeech/train-clean-360.
Creating manifest data/manifest.train-clean-360 ...
Skip downloading and unpacking. Data already exists in /workspace/PaddleSpeech/dataset/librispeech/train-other-500.
Creating manifest data/manifest.train-other-500 ...
Creating manifest data/manifest.train-other-500 ...
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/workspace/PaddleSpeech/dataset/librispeech/librispeech.py", line 146, in prepare_dataset
    create_manifest(target_dir, manifest_path)
  File "/workspace/PaddleSpeech/dataset/librispeech/librispeech.py", line 98, in create_manifest
    audio_data, samplerate = soundfile.read(audio_filepath)
  File "/usr/local/lib/python3.10/dist-packages/soundfile.py", line 285, in read
    with SoundFile(file, 'r', samplerate, channels,
  File "/usr/local/lib/python3.10/dist-packages/soundfile.py", line 658, in __init__
    self._file = self._open(file, mode_int, closefd)
  File "/usr/local/lib/python3.10/dist-packages/soundfile.py", line 1216, in _open
    raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening '/workspace/PaddleSpeech/dataset/librispeech/train-other-500/LibriSpeech/train-other-500/5480/41791/5480-41791-0000.flac': System error.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/PaddleSpeech/dataset/librispeech/librispeech.py", line 186, in <module>
    main()
  File "/workspace/PaddleSpeech/dataset/librispeech/librispeech.py", line 180, in main
    pool.starmap(prepare_dataset, tasks)
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 375, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
soundfile.LibsndfileError: Error opening '/workspace/PaddleSpeech/dataset/librispeech/train-other-500/LibriSpeech/train-other-500/5480/41791/5480-41791-0000.flac': System error.
Prepare LibriSpeech failed. Terminated.
λ tjdm-isa-ai-p800node13 /workspace/PaddleSpeech/examples/librisp

image

环境2 PaddleSpeech-tag v2.1.1分支

作为对比,在paddlespeech的tag v2.1.1进行了同样的尝试,当然只能使用paddle的老版本。没有遇到这个问题,数据处理成功且能正常训练deepspeech模型。
将此处的librispeech数据处理的manifest数据导出给 环境1。启动训练会报错。

# 启动命令
python3 -u  /workspace/PaddleSpeech/paddlespeech/s2t/exps/deepspeech2/bin/train.py --nxpu 1 --ngpu 0 --config conf/deepspeech2.yaml --output exp/deepspeech2 --seed 0

image

@wangdach
Copy link
Author

wangdach commented Dec 6, 2024

补充:
报错部分是 PaddleSpeech/paddlespeech/s2t/io/dataloader.py 的 feat_dim_and_vocab_size
1、尝试查阅PaddleSpeech tag v2.1.1的数据处理部分,未找到 feat_dim 、 data_json的代码,可能差异比较大
2、pdb查看key是不存在input,output的
image

@Ray961123
Copy link

开发者你好,感谢关注 PaddleSpeech 开源项目,抱歉给你带来了不好的开发体验,目前开源项目维护人力有限,你可以尝试通过修改 PaddleSpeech 源码的方式自己解决,或请求开源社区其他开发者的协助。飞桨开源社区交流频道:飞桨AI Studio星河社区-人工智能学习与实训社区

@zxcd
Copy link
Collaborator

zxcd commented Dec 17, 2024

确认不是数据问题吗?以及v2.1.1是哪个版本,paddlespeech没有这个分支?
看报错应该是
/workspace/PaddleSpeech/dataset/librispeech/train-other-500/LibriSpeech/train-other-500/5480/41791/5480-41791-0000.flac 不存在或者损坏引起的报错。

@megemini
Copy link
Contributor

@wangdach 可以试一下手动 read 这个 flac 文件:

>>> import soundfile
>>> soundfile.read('5480-41791-0000.flac')
(array([ 6.10351562e-05,  6.10351562e-05, -9.15527344e-05, ...,
       -1.12915039e-03, -4.57763672e-04,  4.88281250e-04]), 16000)
>>> 

看看是不是能够正常读取,文件是不是完整 ~

我在 paddle 3.0 beta2 测试的,能够正常读取,木有遇到这个问题 ~

这个 train-other-500 数据集比较大,有可能是解压过程出错,文件损坏了?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants