update: change pdf text parser to pymupdf4llm #139

tungsten106 · 2024-12-19T08:54:28Z

Using pymupdf4llm instead of pdfminer to parse pdf contents into markdown formats, as suggested by #131.

Pros and Cons:

pdfminer extract texts only, generated files have no heading, titles, etc. pymupdf4llm, however, could perform a nice markdown featrues including different levels of heading, code blocks and images (could be saved to specific path, but not included in this commit)
However, pymupdf4llm may easily create lines of digits which belongs to plots, and create non-existing tables. This is a common problem to most PDF parsers, except those using ocr models (such as markers, MinerU).

alphaleadership · 2024-12-19T10:39:40Z

maybe this can patch #142

l-lumin

I think it's better to let the user choose the engine rather than replacing it

afourney · 2024-12-19T17:54:56Z

I think it's better to let the user choose the engine rather than replacing it

I agree. There are pros and cons to each. The main thing is to allow a common interface.

Can you propose an interface for this? One option is to just call register_page_converter externally, and the precedence logic would give precedence to whichever converter is registered later. (see here for example: https://github.com/microsoft/markitdown/blob/925c4499f72757abcf6cb521ee10e4844967af3d/src/markitdown/_markitdown.py#L1269C1-L1287C1)

Another option would be to see which dependencies are installed (though this is more opaque)

…or better user instruction. Add examples for PdfConverter.convert() calling.

tungsten106 · 2024-12-24T07:14:39Z

I think it's better to let the user choose the engine rather than replacing it

I agree. There are pros and cons to each. The main thing is to allow a common interface.

Can you propose an interface for this? One option is to just call register_page_converter externally, and the precedence logic would give precedence to whichever converter is registered later. (see here for example: https://github.com/microsoft/markitdown/blob/925c4499f72757abcf6cb521ee10e4844967af3d/src/markitdown/_markitdown.py#L1269C1-L1287C1)

Another option would be to see which dependencies are installed (though this is more opaque)

I have added a parameter pdf_engine to let the user choose engine. For example,

source = "https://arxiv.org/pdf/2308.08155v2.pdf"
markitdown.convert(source, pdf_engine="pymupdf4llm")  # use pymupdf4llm
markitdown.convert(source, pdf_engine="pdfminer")  # use pdfminer

l-lumin

could you add test for pymupdf4llm?

l-lumin · 2024-12-24T10:10:14Z

src/markitdown/_markitdown.py

+        else:
+            return None     # unknown method


This part doesn't return anything. Could you update it? Maybe add a warning message?

l-lumin · 2024-12-24T10:25:46Z

src/markitdown/_markitdown.py

    """

    def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
+        """


I think pdf_engine should be a named parameter in the function instead of being accessed via kwargs. This provides more clarity and ensures better handling of default values

Thanks for you suggestion. I have added it to a named parameters ( I was meant to align with other converters' parameter definition), and added the exception case when pdf_engine is not valid. The new test cases could be seen on my new commit.

This is optional, but adding one more named parameter could make this more customizable, like:

def convert(self, local_path, engine: Literal['pdfminer', 'pymupdf4llm']='pdfminer', engine_kwargs=None, **kwargs) -> Union[None, DocumentConverterResult]:

… cases for pdf. Raised exceptions when pdf_engine is not valid.

l-lumin · 2024-12-25T10:37:03Z

tests/test_markitdown.py

+import sys
+sys.path.insert(0, "/home/yxl/Projects/markitdown/src")


I think don't need this. Let me know if you need help setting up the test!

I will remove this, it won't be used

l-lumin · 2024-12-25T10:37:24Z

src/markitdown/_markitdown.py

+        elif pdf_engine == "pymupdf4llm":
+            text_content = pymupdf4llm.to_markdown(local_path, show_progress=False)
+        else:
+            # return None     # unknown method


remove this

l-lumin · 2024-12-25T10:41:16Z

src/markitdown/_markitdown.py

    """

    def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
+        """


This is optional, but adding one more named parameter could make this more customizable, like:

def convert(self, local_path, engine: Literal['pdfminer', 'pymupdf4llm']='pdfminer', engine_kwargs=None, **kwargs) -> Union[None, DocumentConverterResult]:

l-lumin · 2024-12-25T10:46:27Z

src/markitdown/_markitdown.py

+            # return None     # unknown method
+            raise FileConversionException("'pdf_engine' not valid. Please choose between ['pdfminer', 'pymupdf4llm'].")


I'd suggest checking the engine first. You could use _engines to define the allowed engines

_engines: Mapping[str, Any] = { "pdfminer": pdfminer, "pymupdf4llm": pymupdf4llm, } ### if engine is not None and engine not in self._engines: raise

It is a good idea for the ease of adding further engines. I have updated those changes in the latest commit. An example of using engine_kwargs for pdf imeges extraction could also be found in the test file.

…ter engines calling method for easier to add more engines. Examples of using `engine_kwargs` to extract pdf images added

l-lumin · 2024-12-26T06:54:33Z

@tungsten106
Thank you for your contribution! It looks great so far.
Just one more thing—when running the tests, files is generated. Could you add the following to the .gitignore file inside tests/ and modify test to export inside out/ folder?
Also, do you think there’s any way to make the tests run faster?

out/

l-lumin · 2024-12-26T06:55:50Z

#139 (comment)
answer this if you have time

tungsten106 · 2024-12-26T07:28:16Z

@tungsten106 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement
@microsoft-github-policy-service agree

tungsten106 · 2024-12-26T08:09:32Z

@tungsten106 Thank you for your contribution! It looks great so far. Just one more thing—when running the tests, files is generated. Could you add the following to the .gitignore file inside tests/ and modify test to export inside out/ folder? Also, do you think there’s any way to make the tests run faster?
out/

I have updated that.
For test speed, have you tried to use pytest-xdist to run test_markitdown.py in parallel?

pip install pytest-xdist

# let it decide
pytest -n auto tests/test_markitdown.py
# or using specific cpu numbers, like 8
pytest -n 8 tests/test_markitdown.py

l-lumin · 2024-12-26T08:20:46Z

I have updated that.
For test speed, have you tried to use pytest-xdist to run test_markitdown.py in parallel?

Thank you. You can run tests in parallel without using pytest-xdist; simply run hatch test -p. I want to discuss ways to improve the speed of test_markitdown_pdf

(hatch-test.py3.13) root@e2c718eb6604:/workspaces/markitdown# hatch test -p
========================================================================================================== test session starts ==========================================================================================================
platform linux -- Python 3.13.1, pytest-8.3.4, pluggy-1.5.0
rootdir: /workspaces/markitdown
configfile: pyproject.toml
plugins: rerunfailures-14.0, mock-3.14.0, anyio-4.7.0, xdist-3.6.1
8 workers [5 items]     
ss...

tungsten106 · 2024-12-26T09:17:01Z

I have updated that.
For test speed, have you tried to use pytest-xdist to run test_markitdown.py in parallel?

Thank you. You can run tests in parallel without using pytest-xdist; simply run hatch test -p. I want to discuss ways to improve the speed of test_markitdown_pdf
(hatch-test.py3.13) root@e2c718eb6604:/workspaces/markitdown# hatch test -p
========================================================================================================== test session starts ==========================================================================================================
platform linux -- Python 3.13.1, pytest-8.3.4, pluggy-1.5.0
rootdir: /workspaces/markitdown
configfile: pyproject.toml
plugins: rerunfailures-14.0, mock-3.14.0, anyio-4.7.0, xdist-3.6.1
8 workers [5 items]     
ss...               

The speed of pymupdf4llm.to_markdown might be slow due to package inner processes.
We could use a smaller test pdf since the original article have 43 pages. Adding pages=[i for i in range(10)] parameters for pymupdf4llm or page_numbers=[i for i in range(10)] for pdfminer could be one solution.

l-lumin · 2024-12-26T09:43:48Z

Work great. I just test pages=range(10) is enough, same with page_numbers. for run one test you can run below command so don't need comment

    # test_markitdown_remote()
    # test_markitdown_local()
    # test_markitdown_exiftool()
    # test_markitdown_deprecation()
    # test_markitdown_llm()

hatch test tests/test_markitdown.py::test_markitdown_pdf

alphaleadership · 2024-12-26T17:06:51Z

maybe add clli option

update: change pdf text parser to pymupdf4llm

b3f7e00

l-lumin reviewed Dec 19, 2024

View reviewed changes

tungsten106 and others added 5 commits December 23, 2024 10:05

Merge remote-tracking branch 'origin/main' into dev

df5f14e

update: add parameter "method" for PdfConverter

263e0b5

Merge branch 'main' into dev

d4d11a8

Merge branch 'dev' of https://github.com/tungsten106/markitdown into dev

797e0d4

update: changed "method" parameter fro PdfConverter to "pdf_engine" f…

ba5df9b

…or better user instruction. Add examples for PdfConverter.convert() calling.

l-lumin suggested changes Dec 24, 2024

View reviewed changes

update: adding named parameter pdf_engine to .conver(); adding test…

e808548

… cases for pdf. Raised exceptions when pdf_engine is not valid.

l-lumin reviewed Dec 25, 2024

View reviewed changes

l-lumin mentioned this pull request Dec 26, 2024

Bug, Suggestion: Improve Markdown Conversion, Format Support, and Rich Content Extraction #216

Open

update: Addengine_kwargs for customize parameters. Update PdfConver…

565ef05

…ter engines calling method for easier to add more engines. Examples of using `engine_kwargs` to extract pdf images added

update: adding tests/out folder to save test files

bd95fb0

l-lumin mentioned this pull request Dec 26, 2024

[bug] Markitdown failed to convert pdf that contains image #217

Open

update: use smaller test-pdf size

8482477

bugfix: fixing test_markitdown.py; updating exception messages

f07ea3e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update: change pdf text parser to pymupdf4llm #139

update: change pdf text parser to pymupdf4llm #139

tungsten106 commented Dec 19, 2024

alphaleadership commented Dec 19, 2024

l-lumin left a comment

afourney commented Dec 19, 2024 •

edited

Loading

tungsten106 commented Dec 24, 2024

l-lumin left a comment

l-lumin Dec 24, 2024

l-lumin Dec 24, 2024

tungsten106 Dec 25, 2024

l-lumin Dec 25, 2024

l-lumin Dec 25, 2024

tungsten106 Dec 26, 2024

l-lumin Dec 25, 2024

l-lumin Dec 25, 2024

l-lumin Dec 25, 2024

tungsten106 Dec 26, 2024

l-lumin commented Dec 26, 2024

l-lumin commented Dec 26, 2024

tungsten106 commented Dec 26, 2024

tungsten106 commented Dec 26, 2024

l-lumin commented Dec 26, 2024

tungsten106 commented Dec 26, 2024

l-lumin commented Dec 26, 2024

alphaleadership commented Dec 26, 2024

		import sys
		sys.path.insert(0, "/home/yxl/Projects/markitdown/src")

		# return None # unknown method
		raise FileConversionException("'pdf_engine' not valid. Please choose between ['pdfminer', 'pymupdf4llm'].")

update: change pdf text parser to pymupdf4llm #139

Are you sure you want to change the base?

update: change pdf text parser to pymupdf4llm #139

Conversation

tungsten106 commented Dec 19, 2024

alphaleadership commented Dec 19, 2024

l-lumin left a comment

Choose a reason for hiding this comment

afourney commented Dec 19, 2024 • edited Loading

tungsten106 commented Dec 24, 2024

l-lumin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

l-lumin commented Dec 26, 2024

l-lumin commented Dec 26, 2024

tungsten106 commented Dec 26, 2024

tungsten106 commented Dec 26, 2024

l-lumin commented Dec 26, 2024

tungsten106 commented Dec 26, 2024

l-lumin commented Dec 26, 2024

alphaleadership commented Dec 26, 2024

afourney commented Dec 19, 2024 •

edited

Loading