You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be incredibly helpful if MarkItDown could support the conversion of MHTML files to Markdown. MHTML (MIME HTML) files are a common format for saving web pages and preserving their structure, including embedded assets like images and styles. Adding support for MHTML would enhance MarkItDown’s utility, especially for users working with offline web archives or needing to extract text and structure from web-based documents.
Why It’s Useful
Web Page Archiving: Many users save web pages as MHTML files for offline access, and extracting meaningful content from these files is a frequent need.
Consistency with HTML Support: Since MarkItDown already supports HTML, extending this to MHTML would align with its existing functionality.
Expanding Use Cases: This feature would open up new workflows for researchers, content managers, and developers working with archived web content.
Proposed Functionality
Input: Allow .mhtml files as valid inputs for the markitdown command and Python API.
Conversion Process:
Extract the HTML content from the MHTML container.
Resolve embedded resources (e.g., images, CSS) to ensure a clean Markdown output.
Process the HTML content using the existing pipeline for HTML conversion.
Output: A Markdown file or string, similar to the handling of other file types.
I believe this feature would significantly enhance MarkItDown's capabilities and appeal to a broader user base. Thank you for considering this request!
The text was updated successfully, but these errors were encountered:
It would be incredibly helpful if MarkItDown could support the conversion of MHTML files to Markdown. MHTML (MIME HTML) files are a common format for saving web pages and preserving their structure, including embedded assets like images and styles. Adding support for MHTML would enhance MarkItDown’s utility, especially for users working with offline web archives or needing to extract text and structure from web-based documents.
Why It’s Useful
Proposed Functionality
.mhtml
files as valid inputs for themarkitdown
command and Python API.Examples
CLI:
markitdown path-to-file.mhtml > document.md
Python API:
Challenges & Considerations
References
I believe this feature would significantly enhance MarkItDown's capabilities and appeal to a broader user base. Thank you for considering this request!
The text was updated successfully, but these errors were encountered: