Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to process PDF files containing Traditional Chinese characters, reporting encoding errors #2129

Open
2 of 9 tasks
ulnit opened this issue Nov 25, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@ulnit
Copy link

ulnit commented Nov 25, 2024

Pre-check

  • I have searched the existing issues and none cover this bug.

Description

10:36:32.953 [INFO ] private_gpt.components.ingest.ingest_component - Ingesting file_name=11449514-0.PDF
10:36:32.954 [INFO ] private_gpt.components.ingest.ingest_component - Ingesting file_name=11449527-0.PDF
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/site-packages/injector/init.py", line 800, in get
return self._context[key]
~~~~~~~~~~~~~^^^^^
KeyError: <class 'private_gpt.server.ingest.ingest_service.IngestService'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/data/software/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 74, in transform_file_into_documents
documents = IngestionHelper._load_file_to_documents(file_name, file_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/software/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 92, in _load_file_to_documents
return string_reader.load_data([file_data.read_text()])
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/pathlib.py", line 1059, in read_text
return f.read()
^^^^^^^^
File "", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 11: invalid start byte
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/data/software/private-gpt/scripts/ingest_folder.py", line 121, in
worker.ingest_folder(root_path, args.ignored)
File "/data/software/private-gpt/scripts/ingest_folder.py", line 57, in ingest_folder
self._ingest_all(self._files_under_root_folder)
File "/data/software/private-gpt/scripts/ingest_folder.py", line 61, in _ingest_all
self.ingest_service.bulk_ingest([(str(p.name), p) for p in files_to_ingest])
File "/data/software/private-gpt/private_gpt/server/ingest/ingest_service.py", line 87, in bulk_ingest
documents = self.ingest_component.bulk_ingest(files)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/software/private-gpt/private_gpt/components/ingest/ingest_component.py", line 279, in bulk_ingest
self._ingest_work_pool.starmap(self.ingest, files)
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 375, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 774, in get
raise self._value
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/software/private-gpt/private_gpt/components/ingest/ingest_component.py", line 264, in ingest
documents = self._file_to_documents_work_pool.apply(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 360, in apply
return self.apply_async(func, args, kwds).get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/dbgpt_env/lib/python3.11/multiprocessing/pool.py", line 774, in get
raise self._value
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 11: invalid start byte
make: *** [Makefile:52:ingest] error 1

Steps to Reproduce

#PGPT_PROFILES=ollama-pg make ingest /data/private_gpt_data/s_reports/s_hk_reports/ -- --watch

Expected Behavior

can ingesting normal

Actual Behavior

UnicodeDecodeError

Environment

CPU Python 3.11.10

Additional Information

No response

Version

No response

Setup Checklist

  • Confirm that you have followed the installation instructions in the project’s documentation.
  • Check that you are using the latest version of the project.
  • Verify disk space availability for model storage and data processing.
  • Ensure that you have the necessary permissions to run the project.

NVIDIA GPU Setup Checklist

  • Check that the all CUDA dependencies are installed and are compatible with your GPU (refer to CUDA's documentation)
  • Ensure an NVIDIA GPU is installed and recognized by the system (run nvidia-smi to verify).
  • Ensure proper permissions are set for accessing GPU resources.
  • Docker users - Verify that the NVIDIA Container Toolkit is configured correctly (e.g. run sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi)
@ulnit ulnit added the bug Something isn't working label Nov 25, 2024
@yaziciali
Copy link

yaziciali commented Nov 28, 2024

I have similar issue:
Generating embeddings: 0it [00:00, ?it/s]
Traceback (most recent call last):
File "/Users/user/AI/private-gpt/scripts/ingest_folder.py", line 122, in
worker.ingest_folder(root_path, args.ignored)
File "/Users/user/AI/private-gpt/scripts/ingest_folder.py", line 58, in ingest_folder
self._ingest_all(self._files_under_root_folder)
File "/Users/user/AI/private-gpt/scripts/ingest_folder.py", line 62, in _ingest_all
self.ingest_service.bulk_ingest([(str(p.name), p) for p in files_to_ingest])
File "/Users/user/AI/private-gpt/private_gpt/server/ingest/ingest_service.py", line 87, in bulk_ingest
documents = self.ingest_component.bulk_ingest(files)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/AI/private-gpt/private_gpt/components/ingest/ingest_component.py", line 132, in bulk_ingest
documents = IngestionHelper.transform_file_into_documents(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/AI/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 74, in transform_file_into_documents
documents = IngestionHelper._load_file_to_documents(file_name, file_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/AI/private-gpt/private_gpt/components/ingest/ingest_helper.py", line 92, in _load_file_to_documents
return string_reader.load_data([file_data.read_text()])
^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/pathlib.py", line 1059, in read_text
return f.read()
^^^^^^^^
File "", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 14: invalid start byte
make: *** [ingest] Error 1

private-gpt % cat version.txt
0.6.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants