Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XHamster] Overhaul existing extractors and add playlist extractors #32579

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

dirkf
Copy link
Contributor

@dirkf dirkf commented Oct 2, 2023

Boilerplate: own code, new features+improvement

Please follow the guide below

  • You will be asked some questions, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
  • Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

This PR fixes and updates the existing XHamster[Embed,User] IEs and adds some new playlist extractors, including a pseudo-URL scheme xhsearch...: like ytsearch....

Specifically:

  • a default UA 'Mozilla' is set to bypass a possible captcha page (resolves [xhamster] An extractor error has occurred. (caused by KeyError('videoModel',) #32539)
  • the list of domains is updated to include those domains listed as trusted in the page that are aliased to xhamster.com, but excluding domains that redirect to xhamster (eg xhday.com) (fixes [REQUEST] support xhamster.com alternative domain #31023)
  • video extraction is re-factored and made safer using traverse_obj()
  • playlist extractors are added for Creator (aka Pornstar, Celebrity), Category (aka Tag), and Search pages (resolves [xhamster] support creator page #31371): a URL with a specified page is extracted as a single page playlist with (p{n}) appended to title; otherwise next page continuations are followed with (all) appended; any additional qualifications are also added to the title (eg category search hawaiian (fps=60,all))
  • a pseudo-URL scheme xhsearch... allows searching from the yt-dl command line, eg youtube-dl "xhsearchall:no sex"; if a result count is specified like xhsearch15:... (first 15) is added to the search term as the title.

For testing the playlists, a previous performance enhancement that limited test playlist processing to the playlist_mincount if specified is now only applied if other playlist counts are not being tested.

dirkf added 5 commits October 2, 2023 02:38
* include domains listed as trusted in page, aliased to xhamster.com
* excluding domains that redirect to xhamster (eg xhday.com)
* re-factor extraction code
* use traverse_obj()
…count tested

* eg not when `playlist_count` is specified
* avoid `playlist_mincount` if a `lambda` test may test the count
* re-factor existing playlist extraction
 - for a URL with specified page, extract that oage only with `(p{n})` appended to title
 - otherwise follow next page continuations with `(all)` appended
* add XHamsterCreatorIE for Creator/Pornstar/Celebrity pages
* add XHamsterCategoryIE for Category pages
* add XHamsterSearchIE for search pages with search term as base title
* add XHamsterSearchKeyIE to support search with xhsearch[n]: pseudo-URL scheme
@dirkf
Copy link
Contributor Author

dirkf commented Oct 3, 2023

Channel support is needed too.

One point to be considered there is a general policy issue. Where different versions and subsets of a playlist can be extracted, eg different sorts, 1 page vs all pages, various filters, should the playlist ID reflect these differences, or should that just be, say, in the title?

I'd also welcome comments on this decorator that I'm proposing to add to the utils module (along with yt-dlp's classproperty: when is the func argument of its __new__() None?):

class classpropinit(classproperty):
    """ A Python fubar: parent class vars are not in scope when the
        `class suite` is evaluated, so disallowing `childvar = fn(parentvar)`.
        Instead, the parent class has to be mentioned redundantly and
        unmaintainably, since the current class isn't yet bound. 
        This decorator evaluates a class method and assigns its result
        in place of the method.

        class child(parent):
            # before
            childvar = fn(parent.parentvar)
            # now
            @classpropinit
            def childvar(cls):
                return fn(cls.parentvar)
            # or
            childvar = classpropinit(lambda cls: fn(cls.parentvar))
    """
...

@dirkf dirkf mentioned this pull request Oct 3, 2023
9 tasks
@dirkf dirkf force-pushed the df-xhamster-ovrhaul branch from 19cf05a to b2b622a Compare October 4, 2023 00:57
@Grub4K
Copy link
Contributor

Grub4K commented Oct 4, 2023

Where different versions and subsets of a playlist can be extracted, eg different sorts, 1 page vs all pages, various filters, should the playlist ID reflect these differences, or should that just be, say, in the title?

In my opinion, a generic version of the playlist should always be extracted. That would allow filtering after extraction using flags, and disambiguate between titles, playlist_id and similar. If you can give specific cases to look at I can see better what you mean though.

when is the func argument of its __new__() None?

@classpropinit()
def func(cls):
    ...

Not as useful here, but generally considered good practice for consistency with

@decorator(option=value)
def func(...):
    ...

@dirkf
Copy link
Contributor Author

dirkf commented Oct 4, 2023

Thanks, so this use of __new__() is a pattern to handle invoking a decorator in case nothing, or None, was specified to be decorated, and then the result is a variant of the decorator with the other specified parameters baked in. The 2/3-compatible port that I committed here should handle that.

For playlist examples, consider the test URLs here. XH, like some other sites (YT less so), supports subset playlist URLs with additional path components and/or query parameters (XVideos also uses fragment tags).

If such a URL is specified, that subset playlist, filtered and/or sorted as specified, must be what is wanted: then shouldn't the user be able to distinguish it using the playlist ID (and not just the title as implemented here)?

Or else should the whole playlist be extracted regardless of the specific URL? This certainly wouldn't be right for search URLs.

Here are the test URLs for XHamsterCategoryIE:

  • https://xhamster.com/tags/hawaiian: use wants entire playlist matching tag/category hawaiian, including continuation pages
  • https://xhamster.com/categories/aruban: user wants entire playlist matching tag/category aruban, which is a single page without continuations
  • https://xhamster.com/categories/hawaiian/4k: user wants all videos matching tag/category hawaiian that are available in 4k resolution
  • https://xhamster.com/tags/hawaiian?fps=60: user wants all videos matching tag/category hawaiian that are available in 60fps resolution.

Another example that should be added:

  • https://xhamster.com/tags/hawaiian/best/3: user wants the third page of videos matching tag/category hawaiian sorted by user rating (I guess) instead of XH's default.

So, in the last case, the PR would currently return ID hawaiian and title hawaiian (best,p3). Should the ID actually be hawaiian/best/p3 (say), or for the previous example hawaiian/fps=60

@Grub4K
Copy link
Contributor

Grub4K commented Oct 4, 2023

The id should be unique, ideally with minimal processing. Using the path for that should work and require no further code. It doesnt matter much since the video ID is the important part.

Extracting only the page and filters requested makes sense from a ux perspective as well imo. I am unsure if changing the title that way is the best, but honestly also have no better idea for what else to do. Its probably fine, video title and id matter more in this regard anyways.

@Laharah
Copy link

Laharah commented Jan 26, 2024

Should comment that I cloned this PR and it still would not download either running the folder directly or after building it into a wheel. It may have broken again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants