Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot parse HTML5 #146

Open
aminya opened this issue Jun 13, 2020 · 8 comments
Open

Cannot parse HTML5 #146

aminya opened this issue Jun 13, 2020 · 8 comments

Comments

@aminya
Copy link

aminya commented Jun 13, 2020

I am trying to parse this HTML using readhtml, but it throws some warnings
a.zip

┌ Warning: XMLError: Tag nav invalid from HTML parser (code: 801, line: 7136)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
┌ Warning: XMLError: Tag header invalid from HTML parser (code: 801, line: 7157)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
┌ Warning: XMLError: Tag nav invalid from HTML parser (code: 801, line: 7158)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
┌ Warning: XMLError: Tag article invalid from HTML parser (code: 801, line: 7169)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
┌ Warning: XMLError: Tag header invalid from HTML parser (code: 801, line: 7190)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
┌ Warning: XMLError: Tag section invalid from HTML parser (code: 801, line: 7193)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
┌ Warning: XMLError: Tag footer invalid from HTML parser (code: 801, line: 7203)
└ @ EzXML C:\Users\yahyaaba\.julia\packages\EzXML\ZNwhK\src\error.jl:95
@aminya aminya changed the title Tag nav invalid from HTML parser Cannot parse HTML5 Jun 25, 2020
@aminya
Copy link
Author

aminya commented Jun 25, 2020

I made an issue in HTML5ever. If they provide LibXML2 bindings we can use that.
servo/html5ever#423

otherwise, we can use gumbo
https://github.com/sevenval/gumbo-libxml

@aminya
Copy link
Author

aminya commented Jun 25, 2020

It might be easier instead to use Gumbo.jl and convert that!
https://github.com/JuliaWeb/Gumbo.jl

@XinyuWuu
Copy link

XinyuWuu commented Oct 9, 2021

@aminya
I have come to the same problem. EzXML cannot parse my html file correctly. I can search nodes, but when I export the modified document to a file, there are many mistakes. I have gone though the links provided by you, but I really know nothing how a html file is parsed. I even cannot figure out whether there is a solution for it now. Can it be fixed now? Should I build the https://github.com/sevenval/gumbo-libxml or something? It is very appriciated if you can give some guide.

@aminya
Copy link
Author

aminya commented Oct 9, 2021

Yes! Please check
JuliaWeb/Gumbo.jl#85

@XinyuWuu
Copy link

XinyuWuu commented Oct 9, 2021

I still don't know how to do it. How can I convert a Gumbo.HTMLDocument to a EzXML.Document. I have imported both Gumbo and EzXML, but it didn't fix the problem.

@aminya
Copy link
Author

aminya commented Oct 9, 2021

It is not possible directly. It needs some work as mentioned in the issue.

@lolbinarycat
Copy link

I believe the solution to the "unknown tag name" problem is to pass HTML_PARSE_RECOVER to htmlParseMemory.

I believe this is what xmllint --html does, has it has no problem parsing html5 tag names.

@lolbinarycat
Copy link

correction: they are warnings, not errors, so they can be ignored by passing noerror=true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants