Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce support for PDFs #7318

Open
yabramuvdi opened this issue Dec 10, 2024 · 6 comments
Open

Introduce support for PDFs #7318

yabramuvdi opened this issue Dec 10, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@yabramuvdi
Copy link

Feature request

The idea (discussed in the Discord server with @lhoestq ) is to have a Pdf type like Image/Audio/Video. For example Video was recently added and contains how to decode a video file encoded in a dictionary like {"path": ..., "bytes": ...} as a VideoReader using decord. We want to do the same with pdf and get a pypdfium2.PdfDocument.

Motivation

In many cases PDFs contain very valuable information beyond text (e.g. images, figures). Support for PDFs would help create datasets where all the information is preserved.

Your contribution

I can start the implementation of the Pdf type :)

@yabramuvdi yabramuvdi added the enhancement New feature or request label Dec 10, 2024
@yabramuvdi
Copy link
Author

#self-assign

@lhoestq
Copy link
Member

lhoestq commented Dec 10, 2024

Awesome ! Let me know if you have any question or if I can help :)

cc @AndreaFrancis as well for viz

@lhoestq
Copy link
Member

lhoestq commented Dec 10, 2024

Other candidates libraries for the Pdf type: PyMuPDF pypdf and pdfplumber

EDIT: Pymupdf looks like a good choice when it comes to maturity + performance + versatility BUT the license is maybe an issue, and pypdf, pypdfium2 or pdfplumber are good options imo

@AndreaFrancis
Copy link
Contributor

Related to #7058

@yabramuvdi
Copy link
Author

PyMuPDF is AGPL licensed, so we can't use it. I will move forward with pdfplumber.

@yabramuvdi
Copy link
Author

Hi both! I have made a pull request with a first basic implementation of the Pdf feature. I followed closely what I saw on the Video and Image features. It is my first time contributing so any comments are very welcomed. I think it would be useful to outline together what additional things we can implement (e.g. enabling parsing of the pdf). Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants