Arabic OCR is not working #601

Alla-Abdella · 2024-12-16T02:31:59Z

I used the code below to parse an Arabic documents:

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr=True

pipeline_options.do_table_structure=True
pipeline_options.table_structure_options.do_cell_matching = True
options = TesseractOcrOptions()
options.lang = ['eng', 'ara']
pipeline_options.ocr_options = options

doc_converter = DocumentConverter(
allowed_formats=[
InputFormat.PDF,

    ],
format_options={
    InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options, 
                                      backend=PyPdfiumDocumentBackend),

}

)

Here is the results: Completely off:

The text was updated successfully, but these errors were encountered:

cau-git · 2024-12-16T07:12:42Z

@Alla-Abdella Can you please re-check this after adding options.force_full_page_ocr = True? We need to be sure it is actually using OCR and not preferring content encoded in the PDF.

nikos-livathinos · 2024-12-16T08:50:29Z

@Alla-Abdella have you installed the tesseract languages pack?
https://tesseract-ocr.github.io/tessdoc/Installation.html

Alla-Abdella · 2024-12-16T17:42:11Z

@cau-git
ValueError: "TesseractOcrOptions" object has no field "force_full_page_ocr"

Alla-Abdella added the bug Something isn't working label Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arabic OCR is not working #601

Arabic OCR is not working #601

Alla-Abdella commented Dec 16, 2024

cau-git commented Dec 16, 2024

nikos-livathinos commented Dec 16, 2024

Alla-Abdella commented Dec 16, 2024

Arabic OCR is not working #601

Arabic OCR is not working #601

Comments

Alla-Abdella commented Dec 16, 2024

cau-git commented Dec 16, 2024

nikos-livathinos commented Dec 16, 2024

Alla-Abdella commented Dec 16, 2024