Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some article texts are not fully downloaded. #563

Closed
AndyTheFactory opened this issue Oct 24, 2023 · 2 comments
Closed

Some article texts are not fully downloaded. #563

AndyTheFactory opened this issue Oct 24, 2023 · 2 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request sites not working
Milestone

Comments

@AndyTheFactory
Copy link
Owner

Issue by Jimchoo91
Wed Aug 31 15:11:57 2022
Originally opened as codelucas/newspaper#950


Hi, I have only found this on one website so far, but when I try to download the full text from an article on the BBC, it only returns a snippet.

Here is an example website:

https://www.bbc.co.uk/news/world-48810070

Any idea why? Thanks.

@AndyTheFactory
Copy link
Owner Author

Comment by bstivers
Sat Sep 17 08:36:14 2022


While it's more than a snippet, the full text of articles from Politico don't get pulled either.

I believe the main issue at heart is the code used to parse these websites is so old (last commit to main code is 4+ years old), it's not handling the html source properly due to website updates. Big name websites will change their layouts a lot more frequently than 4 years.

I am sure this library was great in its hay day, but it's near unusable now unless it's on smaller websites that haven't changed a thing in the last 5 years. Which doesn't leave many given that even WordPress-based websites have changed quite a bit.

@AndyTheFactory
Copy link
Owner Author

Comment by johnbumgarner
Fri Dec 30 18:09:55 2022


The library has lots of limitations, because the code base is old. You can parse the BBC site text with some additional code. Here is a document that I wrote on using the library. I will update it in the coming days with the code to extract the BBC text.

@AndyTheFactory AndyTheFactory added documentation Improvements or additions to documentation enhancement New feature or request sites not working labels Oct 25, 2023
@AndyTheFactory AndyTheFactory added this to the First release milestone Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request sites not working
Projects
None yet
Development

No branches or pull requests

1 participant