Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/sync retry download #23

Merged
merged 5 commits into from
Dec 25, 2024

Conversation

benedikt-schaber
Copy link
Contributor

@benedikt-schaber benedikt-schaber commented Sep 11, 2023

What this PR does

This PR is related to #18. It does not support resuming downloads. Rather it adds retry capabilities to the sync API. It adds the max_retries field to ApiBuilder and Api that is then used in download.
We stream the download to a tempfile, and, if we encounter an error retry at most max_retries times, using range requests to only request the data we do not have already.
We use a new struct ResumableReader to keep track of the number of bytes we downloaded successfully, it is basically just a reduced version of Progressbar.
We assume all bytes that we did receive to be correct.

Left ToDo

  • Use Progressbar with ResumableReader (easy)
  • Proper Testing

Testing

Testing this functionality properly is a bit harder, since we need to ensure the failure of the request. Further, it would be nice to test that we indeed do not continue from the start of the file.

There are different ways we could go about achieving this, here are some I evaluated:

  • mock a http server
  • use a mocking library to mock ureq client and response
  • use something like a hexagonal architecture approach to reduce required mocking

Using a mock server would be simplest, however many existing frameworks, like wiremock or httpmock do not support us returning an ok, in the range case 206, request but then streaming and failing the body if I understood that correctly. So I think we would be forced to implement a proper serve on our own.

Second would be a bit annoying, since we not only use the client, but then also the response (although we just turn that into a Reader), so it would be doable. We of course would not test if the Reader we get from the request truly behaves like we expect it (and like our mock does), but I think this is acceptable.

Third would be much like second although reducing what we need to mock, but also changing some things about the architecture which may or may not be acceptable.

So, this is the main area I am looking to for some feedback, since I think whatever choice is taken, it will have an effect on the rest of the project.

Additionally if we also want to test that it does indeed not just always resume from the start we would also potentially get some issues with std::io::copy since we do not know its buffer size. But that should be rather easy to resolve, so I will look into it once a basic testing strategy is outlined.

@benedikt-schaber benedikt-schaber marked this pull request as ready for review September 11, 2023 08:44
src/api/sync.rs Outdated
Comment on lines 403 to 464
let mut reader = reader::ResumableReader::new(reader, current);
std::io::copy(&mut reader, file)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is only linearly increasing, couldn't we just Seek into the file ?
Making ResumableReader not necessary (less code is almost always better) ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@McPatate FYI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. file is already at the right position, but we could SeekFrom::End(0) to get the length and use that as current. I will implement the change later today

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to SeekFrom::End I think the response Content-Length should already give that information.
And even then you don't need it either since you would be using increasing Seek.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh I just realized what you meant. Seek::End would be equal to the partial-length, I see, yes that works too (actually wouldn't the cursor already be there naturally given io::copy 's nature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cursor is already there, but we need to get its byte position to plug it into the RANGE header. We could also use SeekFrom::Current(0) (or the equivalent .stream_position()). SeekFrom::End(0) would have the advantage, that we could also use it with .incomplete files directly once we implement that, however, that would also be easily achieved by just doing it once in the beginning.

SeekFrom::Current(0) might have slightly better performance, I would have to check how it is implemented, so we might prefer it.

Or am I missing another method of retrieving the necessary byte offset?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Maybe file.metadata() then ? I don't know which is the simplest.
Also is there a way to fuse the retry loop with the outer "regular" logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also excuse the late response. file.metadata() would require us to first call file.sync_all() if I am not mistaken, so I do not think that it is desirable. I have used stream_position for now (see next commit).
I'll look into combining the logic this evening.

@Narsil Narsil force-pushed the feat/sync-retry-download branch from 6b7d7e9 to 733600d Compare December 25, 2024 23:40
Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After waaaayy too long, sorry about that, LGTM.

@Narsil Narsil merged commit edda880 into huggingface:main Dec 25, 2024
3 checks passed
@benedikt-schaber
Copy link
Contributor Author

Hey, thanks for coming back to it.
I had a very busy time back then, so definitely not the best communication from my part, Sorry.
I hope I can contribute more and more efficiently in the future.

@benedikt-schaber benedikt-schaber deleted the feat/sync-retry-download branch December 26, 2024 08:05
@Narsil
Copy link
Collaborator

Narsil commented Dec 27, 2024

No, totally my bad, better late than never in any case.

Thanks a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants