-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce support for PDFs #7318
Comments
#self-assign |
Awesome ! Let me know if you have any question or if I can help :) cc @AndreaFrancis as well for viz |
Other candidates libraries for the Pdf type: PyMuPDF pypdf and pdfplumber EDIT: Pymupdf looks like a good choice when it comes to maturity + performance + versatility BUT the license is maybe an issue, and pypdf, pypdfium2 or pdfplumber are good options imo |
Related to #7058 |
PyMuPDF is AGPL licensed, so we can't use it. I will move forward with pdfplumber. |
Hi both! I have made a pull request with a first basic implementation of the Pdf feature. I followed closely what I saw on the Video and Image features. It is my first time contributing so any comments are very welcomed. I think it would be useful to outline together what additional things we can implement (e.g. enabling parsing of the pdf). Thanks :) |
Feature request
The idea (discussed in the Discord server with @lhoestq ) is to have a Pdf type like Image/Audio/Video. For example Video was recently added and contains how to decode a video file encoded in a dictionary like {"path": ..., "bytes": ...} as a VideoReader using decord. We want to do the same with pdf and get a pypdfium2.PdfDocument.
Motivation
In many cases PDFs contain very valuable information beyond text (e.g. images, figures). Support for PDFs would help create datasets where all the information is preserved.
Your contribution
I can start the implementation of the Pdf type :)
The text was updated successfully, but these errors were encountered: