[FEATURE REQUEST] MHTML Support #228

spencerthayer · 2024-12-28T21:07:38Z

It would be incredibly helpful if MarkItDown could support the conversion of MHTML files to Markdown. MHTML (MIME HTML) files are a common format for saving web pages and preserving their structure, including embedded assets like images and styles. Adding support for MHTML would enhance MarkItDown’s utility, especially for users working with offline web archives or needing to extract text and structure from web-based documents.

Why It’s Useful

Web Page Archiving: Many users save web pages as MHTML files for offline access, and extracting meaningful content from these files is a frequent need.
Consistency with HTML Support: Since MarkItDown already supports HTML, extending this to MHTML would align with its existing functionality.
Expanding Use Cases: This feature would open up new workflows for researchers, content managers, and developers working with archived web content.

Proposed Functionality

Input: Allow .mhtml files as valid inputs for the markitdown command and Python API.
Conversion Process:
- Extract the HTML content from the MHTML container.
- Resolve embedded resources (e.g., images, CSS) to ensure a clean Markdown output.
- Process the HTML content using the existing pipeline for HTML conversion.
Output: A Markdown file or string, similar to the handling of other file types.

Examples

CLI:

markitdown path-to-file.mhtml > document.md

Python API:

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("example.mhtml")
print(result.text_content)

Challenges & Considerations

Parsing Embedded Resources: Properly handling and optionally excluding embedded resources could require additional tooling.
Dependencies: Adding support for MHTML might introduce new dependencies for handling MIME encapsulated data.

References

I believe this feature would significantly enhance MarkItDown's capabilities and appeal to a broader user base. Thank you for considering this request!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE REQUEST] MHTML Support #228

[FEATURE REQUEST] MHTML Support #228

spencerthayer commented Dec 28, 2024

[FEATURE REQUEST] MHTML Support #228

[FEATURE REQUEST] MHTML Support #228

Comments

spencerthayer commented Dec 28, 2024