You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@Alla-Abdella Can you please re-check this after adding options.force_full_page_ocr = True? We need to be sure it is actually using OCR and not preferring content encoded in the PDF.
I used the code below to parse an Arabic documents:
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr=True
pipeline_options.do_table_structure=True
pipeline_options.table_structure_options.do_cell_matching = True
options = TesseractOcrOptions()
options.lang = ['eng', 'ara']
pipeline_options.ocr_options = options
doc_converter = DocumentConverter(
allowed_formats=[
InputFormat.PDF,
)
Here is the results: Completely off:
The text was updated successfully, but these errors were encountered: