Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the ability to track progress of individual file downloads in snapshot_download() through inner_tqdm_class addition #2718

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kevinMEH
Copy link

1: Fixed a bug regarding the type of the tqdm_class argument of snapshot_download().

2: Added the ability to track progress of individual file downloads in snapshot_download().

Users can now pass an inner_tqdm_class argument to snapshot_download(), which will be eventually passed onto http_get(), which will use that class if available to instantiate its instance of tqdm.

This enables the user to track the progress of individual file downloads, rather than overall progress (tracked through the regular tqdm_class argument passed into snapshot_download()).

Example helpful scenario:

In my Local-LLM project, I would like to relay the progress of model downloads to the user. Presently, there is no mechanism for such behavior, but through this addition, I can now pass in a custom tqdm class into inner_tqdm_class and receive progress updates through my custom class.

Note: The code below is still a work in progress, but should be sufficient to help illustrate the utility of the addition.

image

Feel free to edit.

Before the fix, tqdm_class accepts an instance of tqdm, not a
class of tqdm.
The inner_tqdm_class argument is passed onto the individual calls of
http_get inside file_download.py and gives the user the ability to
monitor the progress of individual file downloads.
This enables the user to use the regular ThreadPoolExecutor function to
execute _inner_hf_hub_download() in parallel rather than use tqdm's
thread_map(), which calls ThreadPoolExecutor.
@kevinMEH
Copy link
Author

I have made an additional change (commit b282938)

This change lets the user directly use ThreadPoolExecutor to execute file downloads in parallel instead of through tqdm's thread_map.

I propose this change because when using tqdm in a multithread / multiprocess environment, the following warning will pop up:

UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown: {'/mp-91jplfuc'}

One can imagine how seeing such a warning upon every run of their otherwise pristinely programmed application could easily drive an otherwise perfectly sane man to their limits.

Therefore, I propose the following change to skip at least the outer execution of tqdm.

PS: For anyone seeking knowledge on how to avoid the inner execution of tqdm (for the individual file downloads), you can do so by providing your own custom progress bar class: Only update(), enter(), exit() and init() are used by http_get().

image

Once you've replaced the inner_tqdm_class and skipped the outer tqdm execution, you will no longer get the semaphore leak warning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant