Introduce support for PDFs #7318

yabramuvdi · 2024-12-10T16:59:48Z

Feature request

The idea (discussed in the Discord server with @lhoestq ) is to have a Pdf type like Image/Audio/Video. For example Video was recently added and contains how to decode a video file encoded in a dictionary like {"path": ..., "bytes": ...} as a VideoReader using decord. We want to do the same with pdf and get a pypdfium2.PdfDocument.

Motivation

In many cases PDFs contain very valuable information beyond text (e.g. images, figures). Support for PDFs would help create datasets where all the information is preserved.

Your contribution

I can start the implementation of the Pdf type :)

yabramuvdi · 2024-12-10T17:00:49Z

#self-assign

lhoestq · 2024-12-10T17:30:45Z

Awesome ! Let me know if you have any question or if I can help :)

cc @AndreaFrancis as well for viz

lhoestq · 2024-12-10T18:08:00Z

Other candidates libraries for the Pdf type: PyMuPDF pypdf and pdfplumber

EDIT: Pymupdf looks like a good choice when it comes to maturity + performance + versatility BUT the license is maybe an issue, and pypdf, pypdfium2 or pdfplumber are good options imo

AndreaFrancis · 2024-12-11T16:34:43Z

Related to #7058

yabramuvdi · 2024-12-12T14:01:19Z

PyMuPDF is AGPL licensed, so we can't use it. I will move forward with pdfplumber.

yabramuvdi · 2024-12-12T18:38:11Z

Hi both! I have made a pull request with a first basic implementation of the Pdf feature. I followed closely what I saw on the Video and Image features. It is my first time contributing so any comments are very welcomed. I think it would be useful to outline together what additional things we can implement (e.g. enabling parsing of the pdf). Thanks :)

yabramuvdi added the enhancement New feature or request label Dec 10, 2024

yabramuvdi mentioned this issue Dec 12, 2024

Introduce pdf support (#7318) #7325

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce support for PDFs #7318

Introduce support for PDFs #7318

yabramuvdi commented Dec 10, 2024

yabramuvdi commented Dec 10, 2024

lhoestq commented Dec 10, 2024

lhoestq commented Dec 10, 2024 •

edited

Loading

AndreaFrancis commented Dec 11, 2024

yabramuvdi commented Dec 12, 2024

yabramuvdi commented Dec 12, 2024

Introduce support for PDFs #7318

Introduce support for PDFs #7318

Comments

yabramuvdi commented Dec 10, 2024

Feature request

Motivation

Your contribution

yabramuvdi commented Dec 10, 2024

lhoestq commented Dec 10, 2024

lhoestq commented Dec 10, 2024 • edited Loading

AndreaFrancis commented Dec 11, 2024

yabramuvdi commented Dec 12, 2024

yabramuvdi commented Dec 12, 2024

lhoestq commented Dec 10, 2024 •

edited

Loading