[XHamster] Overhaul existing extractors and add playlist extractors #32579

dirkf · 2023-10-02T16:40:21Z

Boilerplate: own code, new features+improvement

Please follow the guide below

You will be asked some questions, please read them carefully and answer honestly
Put an x into all the boxes [ ] relevant to your pull request (like that [x])
Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

Searched the bugtracker for similar pull requests
Read adding new extractor tutorial
Read youtube-dl coding conventions and adjusted the code to meet them
Covered the code with tests (note that PRs without tests will be REJECTED)
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

This PR fixes and updates the existing XHamster[Embed,User] IEs and adds some new playlist extractors, including a pseudo-URL scheme xhsearch...: like ytsearch....

Specifically:

a default UA 'Mozilla' is set to bypass a possible captcha page (resolves [xhamster] An extractor error has occurred. (caused by KeyError('videoModel',) #32539)
the list of domains is updated to include those domains listed as trusted in the page that are aliased to xhamster.com, but excluding domains that redirect to xhamster (eg xhday.com) (fixes [REQUEST] support xhamster.com alternative domain #31023)
video extraction is re-factored and made safer using traverse_obj()
playlist extractors are added for Creator (aka Pornstar, Celebrity), Category (aka Tag), and Search pages (resolves [xhamster] support creator page #31371): a URL with a specified page is extracted as a single page playlist with (p{n}) appended to title; otherwise next page continuations are followed with (all) appended; any additional qualifications are also added to the title (eg category search hawaiian (fps=60,all))
a pseudo-URL scheme xhsearch... allows searching from the yt-dl command line, eg youtube-dl "xhsearchall:no sex"; if a result count is specified like xhsearch15:... (first 15) is added to the search term as the title.

For testing the playlists, a previous performance enhancement that limited test playlist processing to the playlist_mincount if specified is now only applied if other playlist counts are not being tested.

Resolves ytdl-org#32539

* include domains listed as trusted in page, aliased to xhamster.com * excluding domains that redirect to xhamster (eg xhday.com)

* re-factor extraction code * use traverse_obj()

…count tested * eg not when `playlist_count` is specified * avoid `playlist_mincount` if a `lambda` test may test the count

* re-factor existing playlist extraction - for a URL with specified page, extract that oage only with `(p{n})` appended to title - otherwise follow next page continuations with `(all)` appended * add XHamsterCreatorIE for Creator/Pornstar/Celebrity pages * add XHamsterCategoryIE for Category pages * add XHamsterSearchIE for search pages with search term as base title * add XHamsterSearchKeyIE to support search with xhsearch[n]: pseudo-URL scheme

dirkf · 2023-10-03T23:29:13Z

Channel support is needed too.

One point to be considered there is a general policy issue. Where different versions and subsets of a playlist can be extracted, eg different sorts, 1 page vs all pages, various filters, should the playlist ID reflect these differences, or should that just be, say, in the title?

I'd also welcome comments on this decorator that I'm proposing to add to the utils module (along with yt-dlp's classproperty: when is the func argument of its __new__() None?):

class classpropinit(classproperty):
    """ A Python fubar: parent class vars are not in scope when the
        `class suite` is evaluated, so disallowing `childvar = fn(parentvar)`.
        Instead, the parent class has to be mentioned redundantly and
        unmaintainably, since the current class isn't yet bound. 
        This decorator evaluates a class method and assigns its result
        in place of the method.

        class child(parent):
            # before
            childvar = fn(parent.parentvar)
            # now
            @classpropinit
            def childvar(cls):
                return fn(cls.parentvar)
            # or
            childvar = classpropinit(lambda cls: fn(cls.parentvar))
    """
...

…lass vars

Grub4K · 2023-10-04T05:15:40Z

Where different versions and subsets of a playlist can be extracted, eg different sorts, 1 page vs all pages, various filters, should the playlist ID reflect these differences, or should that just be, say, in the title?

In my opinion, a generic version of the playlist should always be extracted. That would allow filtering after extraction using flags, and disambiguate between titles, playlist_id and similar. If you can give specific cases to look at I can see better what you mean though.

when is the func argument of its __new__() None?

@classpropinit()
def func(cls):
    ...

Not as useful here, but generally considered good practice for consistency with

@decorator(option=value)
def func(...):
    ...

dirkf · 2023-10-04T12:30:24Z

Thanks, so this use of __new__() is a pattern to handle invoking a decorator in case nothing, or None, was specified to be decorated, and then the result is a variant of the decorator with the other specified parameters baked in. The 2/3-compatible port that I committed here should handle that.

For playlist examples, consider the test URLs here. XH, like some other sites (YT less so), supports subset playlist URLs with additional path components and/or query parameters (XVideos also uses fragment tags).

If such a URL is specified, that subset playlist, filtered and/or sorted as specified, must be what is wanted: then shouldn't the user be able to distinguish it using the playlist ID (and not just the title as implemented here)?

Or else should the whole playlist be extracted regardless of the specific URL? This certainly wouldn't be right for search URLs.

Here are the test URLs for XHamsterCategoryIE:

https://xhamster.com/tags/hawaiian: use wants entire playlist matching tag/category hawaiian, including continuation pages
https://xhamster.com/categories/aruban: user wants entire playlist matching tag/category aruban, which is a single page without continuations
https://xhamster.com/categories/hawaiian/4k: user wants all videos matching tag/category hawaiian that are available in 4k resolution
https://xhamster.com/tags/hawaiian?fps=60: user wants all videos matching tag/category hawaiian that are available in 60fps resolution.

Another example that should be added:

https://xhamster.com/tags/hawaiian/best/3: user wants the third page of videos matching tag/category hawaiian sorted by user rating (I guess) instead of XH's default.

So, in the last case, the PR would currently return ID hawaiian and title hawaiian (best,p3). Should the ID actually be hawaiian/best/p3 (say), or for the previous example hawaiian/fps=60

Grub4K · 2023-10-04T18:45:34Z

The id should be unique, ideally with minimal processing. Using the path for that should work and require no further code. It doesnt matter much since the video ID is the important part.

Extracting only the page and filters requested makes sense from a ux perspective as well imo. I am unsure if changing the title that way is the best, but honestly also have no better idea for what else to do. Its probably fine, video title and id matter more in this regard anyways.

Laharah · 2024-01-26T00:48:58Z

Should comment that I cloned this PR and it still would not download either running the folder directly or after building it into a wheel. It may have broken again.

dirkf added 5 commits October 2, 2023 02:38

[XHamster] Set default UA 'Mozilla' to bypass captcha page

296e436

Resolves ytdl-org#32539

[XHamster] Update domain list

bafa9d7

* include domains listed as trusted in page, aliased to xhamster.com * excluding domains that redirect to xhamster (eg xhday.com)

[XHamster] Revise video extraction

6845e4e

* re-factor extraction code * use traverse_obj()

[test] Only limit playlist test when playlist_mincount is the only …

d912aa0

…count tested * eg not when `playlist_count` is specified * avoid `playlist_mincount` if a `lambda` test may test the count

dirkf mentioned this pull request Oct 3, 2023

xHamster creator yt-dlp/yt-dlp#5232

Closed

9 tasks

dirkf added 6 commits October 4, 2023 00:59

[test] pl_counts

3a31e52

[utils] Add classproperty() decorator from yt-dlp

e6c95bd

[utils] Add classpropinit() decorator for easier use of inherited c…

d0762cf

…lass vars

[XHamster] Move domain list to base class and introduce classpropinit

44a30c6

[XHamster] Add extraction of user's favorites

71aae1d

[XHamster] Add channel extraction

b2b622a

dirkf force-pushed the df-xhamster-ovrhaul branch from 19cf05a to b2b622a Compare October 4, 2023 00:57

bashonly mentioned this pull request Oct 17, 2023

[xhamster] An extractor error has occurred. (caused by KeyError('videoModel')) yt-dlp/yt-dlp#8369

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XHamster] Overhaul existing extractors and add playlist extractors #32579

[XHamster] Overhaul existing extractors and add playlist extractors #32579

dirkf commented Oct 2, 2023

dirkf commented Oct 3, 2023 •

edited

Loading

Grub4K commented Oct 4, 2023

dirkf commented Oct 4, 2023

Grub4K commented Oct 4, 2023

Laharah commented Jan 26, 2024 •

edited

Loading

[XHamster] Overhaul existing extractors and add playlist extractors #32579

Are you sure you want to change the base?

[XHamster] Overhaul existing extractors and add playlist extractors #32579

Conversation

dirkf commented Oct 2, 2023

Please follow the guide below

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

dirkf commented Oct 3, 2023 • edited Loading

Grub4K commented Oct 4, 2023

dirkf commented Oct 4, 2023

Grub4K commented Oct 4, 2023

Laharah commented Jan 26, 2024 • edited Loading

dirkf commented Oct 3, 2023 •

edited

Loading

Laharah commented Jan 26, 2024 •

edited

Loading