Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] MHTML Support #228

Open
spencerthayer opened this issue Dec 28, 2024 · 0 comments
Open

[FEATURE REQUEST] MHTML Support #228

spencerthayer opened this issue Dec 28, 2024 · 0 comments

Comments

@spencerthayer
Copy link

It would be incredibly helpful if MarkItDown could support the conversion of MHTML files to Markdown. MHTML (MIME HTML) files are a common format for saving web pages and preserving their structure, including embedded assets like images and styles. Adding support for MHTML would enhance MarkItDown’s utility, especially for users working with offline web archives or needing to extract text and structure from web-based documents.


Why It’s Useful

  • Web Page Archiving: Many users save web pages as MHTML files for offline access, and extracting meaningful content from these files is a frequent need.
  • Consistency with HTML Support: Since MarkItDown already supports HTML, extending this to MHTML would align with its existing functionality.
  • Expanding Use Cases: This feature would open up new workflows for researchers, content managers, and developers working with archived web content.

Proposed Functionality

  1. Input: Allow .mhtml files as valid inputs for the markitdown command and Python API.
  2. Conversion Process:
    • Extract the HTML content from the MHTML container.
    • Resolve embedded resources (e.g., images, CSS) to ensure a clean Markdown output.
    • Process the HTML content using the existing pipeline for HTML conversion.
  3. Output: A Markdown file or string, similar to the handling of other file types.

Examples

CLI:

markitdown path-to-file.mhtml > document.md

Python API:

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("example.mhtml")
print(result.text_content)

Challenges & Considerations

  • Parsing Embedded Resources: Properly handling and optionally excluding embedded resources could require additional tooling.
  • Dependencies: Adding support for MHTML might introduce new dependencies for handling MIME encapsulated data.

References


I believe this feature would significantly enhance MarkItDown's capabilities and appeal to a broader user base. Thank you for considering this request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant