Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help in debugging conversion of a PDF to text #641

Open
bkosowski opened this issue Dec 20, 2024 · 0 comments
Open

Help in debugging conversion of a PDF to text #641

bkosowski opened this issue Dec 20, 2024 · 0 comments
Labels
question Further information is requested

Comments

@bkosowski
Copy link

bkosowski commented Dec 20, 2024

I'm trying to convert a pdf file (free, openly available file):
Alejandro Villamor - Subtracting Suffering - An Anti-Aggregationist Approach to Suffering in Nature (2024).pdf

Using the following command:

docling --device cuda --num-threads 8 --table-mode accurate  --ocr-lang en --from pdf --to text --ocr --verbose "Alejandro Villamor - Subtracting Suffering - An Anti-Aggregationist Approach to Suffering in Nature (2024).pdf" --debug-visualize-ocr --debug-visualize-cells --debug-visualize-layout

on

docling --version
Docling version: 2.14.0
Docling Core version: 2.12.1
Docling IBM Models version: 3.1.0
Docling Parse version: 3.0.0

Debug images are generated, but they're not very helpful. I'm posting examples for ocr, cells, and layout.
Cells:
cells_page_00001
Layout:
postprocessed_layout_page_00001
OCR:
ocr_page_00001

The generated file:
Alejandro Villamor - Subtracting Suffering - An Anti-Aggregationist Approach to Suffering in Nature (2024).txt

The problem is that in the PDF, on the second page, just above the line demarcating the main text from the footnotes there is this text:

Ontological Prevalence of Suffering in Nature. There is an ontological prevalence of suffering over welfare in nature. That is, the net sum or iterative comparison (one by one) of

This fragment of the text has not been converted, so it's missing from the generated text file. (Of course, this is just one example of a missing text from the PDF.)

I investigated further by converting the file to pictures:

convert -background white -alpha remove -density 300 +antialias -interpolate Nearest -quality 90 "Alejandro Villamor - Subtracting Suffering - An Anti-Aggregationist Approach to Suffering in Nature (2024).pdf" /mnt/d/imgs/page-%d.png

and then running easyocr manually:

easyocr --download_enabled True --detector True --decoder beamsearch --workers 4 --paragraph True --lang en --gpu False --verbose True -f "D:\imgs\page-1.png"

And after a long while it generated the below output:

D:\AI\Tools\docling\venv\Lib\site-packages\easyocr\utils.py:221: RuntimeWarning: overflow encountered in scalar add
  curr.entries[labeling].prTotal += prBlank + prNonBlank
D:\AI\Tools\docling\venv\Lib\site-packages\easyocr\utils.py:248: RuntimeWarning: overflow encountered in scalar add
  curr.entries[newLabeling].prNonBlank += prNonBlank
D:\AI\Tools\docling\venv\Lib\site-packages\easyocr\utils.py:249: RuntimeWarning: overflow encountered in scalar add
  curr.entries[newLabeling].prTotal += prNonBlank
D:\AI\Tools\docling\venv\Lib\site-packages\easyocr\utils.py:219: RuntimeWarning: overflow encountered in scalar add
  curr.entries[labeling].prNonBlank += prNonBlank
[[[656, 289], [1821, 289], [1821, 420], [656, 420]], 'Subtracting Suffering: An Anti-Aggregationist A Alejandro Villamor Iglesias']
[[[1141, 504], [1338, 504], [1338, 560], [1141, 560]], 'Resumen']
[[[344, 576], [2137, 576], [2137, 1422], [344, 1422]], 'En los ultimos anos, cada vez es ma prevalencia del sufrimiento sobre el b Esta creencia suele coincid Una axiologia sensocentrista segun la cual lo moralmente relevante placer y dolor: Esta combinacion conduce tiene una enorme relevancia moral, Est y argumenta, en su lugar, que podria no ser coherente: La afirmacion de que existe una prevalencia ontologica, en abstracto, del s embargo, no sucede lo mismo al respecto de su puede considerar que un calculo agregacionista sea moralmente valioso, estri pues no hay sujeto que lo sienta. No obstante, podria mantenerse la necesidad d una intervencion positiva en la naturaleza  Palabras clave: agregacionismo,  antiagregacionismo, etica animal, sufrimiento animal, intervencionismo.']
[[[351, 1537], [666, 1537], [666, 1586], [351, 1586]], '1. Introduction']
[[[344, 1605], [2141, 1605], [2141, 2277], [344, 2277]], "In recent decades; more and more p suffering of non-human animals. In aca phenomenon translates into a growing theoretical interest in the suffering of wild animals (e.g;: Dawkins, 1995; Rolston III, 1992 Horta, 2010a, 2010b, 2015; Faria, 2016; Villamor, 2 Although not a necessary condition;' most of these authors maintain that som that suffering predominates over well-being aggregationist  component? into their  theories, these positions  conduct a controversial inference from the following statement: Ontological Prevalence of Suffering in Nature. There is an ontological prevalence of suffering over welfare in nature: That i comparison (one by one) of"]
[[[344, 2398], [2135, 2398], [2135, 2883], [344, 2883]], "It is important to remember that there is no relation of necessity between consequentialism and aggregationism: Some   theories, such as Maximin or Leximin, are clearly  consequentialists but not aggregationists (Hirose, 2015, 30-31) Likewise, as Hirose has shown, a be present in deontological theories such as Scanlon's th 2 Even though the consequences could be s a conception of additive aggregation. As Larry Temkin emphasize for example, one might have principles o on weighted  totals, like  prioritarianism, OI on the highest or best   achievements, like some forms   of perfectionism, 0 on the wellbeing of those who are worst off, like max"]
[[[1000, 2925], [1479, 2925], [1479, 2974], [1000, 2974]], 'RHV, 2024, No 26,243-267']
[[[1025, 3043], [1054, 3043], [1054, 3058], [1025, 3058]], 'CC']
[[[1074, 3032], [1473, 3032], [1473, 3083], [1074, 3083]], 'CC BY-NC-ND BY Nc ND']
[[[1200, 3157], [1279, 3157], [1279, 3206], [1200, 3206]], '244']

As can be seen, the missing fragment is present there:

Ontological Prevalence of Suffering in Nature. There is an ontological prevalence of suffering over welfare in nature: That i comparison (one by one) of

So, the OCR seems to be working OK. Something else fails in the process, but I don't know what.

I don't know which step in the conversion fails. Hence, I don't know where should I post a specific bug report: here or in a dependent project. Could you please help?

@bkosowski bkosowski added the question Further information is requested label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant