-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert HTML file with table(s) to DataFrame #71
Comments
I wrote this code (which can help those who are looking for a similar feature) but this code is just a (very) quick implementation... which probably won't work with more complex HTML page with tables |
Hi Sébastien, Thanks for opening the issue, I agree this would be a good thing to have. I'd rather not have a dependency on DataFrames in this package, since it's a large dependency that's not necessary for Gumbo's core functionality. My impression is that the best way to do this would be to implement the Tables.jl interface for I'm not sure when I'll have time to do this, but I don't think it would be very difficult; if someone else wants to take a crack at it I'd happily accept a pull request. I'm happy to add a dependency on Tables.jl, since it's pretty small. |
I really the idea of implementing Tables.jl interface for |
Yeah, it sounds like a great idea. Happy to help support however I can here. Currently, Tables.jl doesn't have a concept of streaming multiple tables at a time, but as long as there's a way to "select" a single table tag and "stream" that, then it should work pretty well. Happy to chat on slack if anyone wants to brainstorm this. |
@quinnj yeah, I think we're on the same page. I'm imagining that it's up to the user to locate a single I'm actually pretty excited about this idea, since this is a feature request that's come up before, and I love the smooth interoperability between the whole ecosystem that packages like Tables can provide! I'll try to find time to work on it soon, I'll ask on Slack if I get stuck with anything Tables related. |
Cool, yeah, just let me know if you run into any issues. Just to get the ball rolling, some things to think about include:
Tables.istable(::Type{<:HTMLTable}) = true
Tables.rowaccess(::Type{<:HTMLTable}) = true
Tables.rows(table::HTMLTable) = table
Anyway, hopefully that gets the ball rolling and again, just let me know if you run into any issues. |
Thanks! That all makes sense. I agree there are some tricky parts and some places that'll have to use heuristics and guessing (for schemas, types, etc.). I think it's fine to just "do our best" and then people can clean things up themselves if they end up with messy data. The only thing I'm curious about is what the utility of the wrapper type ( |
The main decision there is whether you're comfortable defining |
Ahh, that make sense—I didn't realize the Tables interface required overriding the Base |
Any update on this? It would be great to get a Table from |
If we fix #85, we can just use AcuteML which already supports Tables.jl. |
Nothing yet? |
Hello,
I have an HTML file with a table and would like to convert it to a Julia DataFrame.
I was looking for a function similar to Python Pandas
read_html
function (which directly output a list of DataFrame).Unfortunately I don't see similar function in Julia ecosystem
In Gumbo doc I was looking for an example to iterate over rows and colums of each table
here is a basic HTML source file with 2 tables
I'm not sure if such example should be part of Gumbo or Cascadia or even EzXML.jl
Anyway none of this project show example with HTML tables... so there is probably a room for doc improvement.
Kind regards
PS : related SO post https://stackoverflow.com/questions/42915962/extracting-and-constructing-tables-from-html-files-using-julia
The text was updated successfully, but these errors were encountered: