Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: iOS app crash caused by the request operation canceling #48350

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

zhouzh1
Copy link

@zhouzh1 zhouzh1 commented Dec 20, 2024

Summary:

Currently we observed many iOS app crashes caused by the [RCTFileRequestHanlder invalidate] method, just as the below screenshot.
image

Changelog:

[IOS] [FIXED] - app crash caused by the [RCTFileRequestHanlder invalidate] method

Test Plan:

I am not able to reproduce this issue locally either, so the changes in this PR are totally from my inference, I am not sure if it really makes sense, so please help take a deeper look, thanks.

@facebook-github-bot facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Shared with Meta Applied via automation to indicate that an Issue or Pull Request has been shared with the team. labels Dec 20, 2024
@zhouzh1
Copy link
Author

zhouzh1 commented Dec 20, 2024

@mojodna @jsierles @augustl Could you help take a look at this PR? Thanks.

@augustl
Copy link
Contributor

augustl commented Dec 20, 2024

@zhouzh1 hi! Out of curiosity, why did you tag me in this PR? I keep getting tagged in React Native PRs for some reason, and I'm not sure why :)

Copy link
Contributor

@cipolleschi cipolleschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is proper fix for the issue.
Could you share more about the crash?
Is there an error message?
Can you share the full crashlog?
Is it happening in production only or also in debug?
Is the app in background or in foreground?

Comment on lines +28 to +35
if (_fileQueue) {
for (NSOperation *operation in _fileQueue.operations) {
if ([operation isKindOfClass:[NSOperation class]] && !operation.isCancelled && !operation.isFinished) {
[operation cancel];
}
}
_fileQueue = nil;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary. You can only add NSOperation to an NSOperationQueue and cancellAllOperations runs the same code you are manually writing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cipolleschi Seems that the cancellAllOperations won't do the status checks for the operation, and in the crash report of my iOS app, the stack trace exactly tells me the crash point is just in the cancellAllOperations internal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the official Apple docs for cancelAllOperations.

This method calls the cancel method on all operations currently in the queue.
Canceling the operations does not automatically remove them from the queue or stop those that are currently executing. For operations that are queued and waiting execution, the queue must still attempt to execute the operation before recognizing that it is canceled and moving it to the finished state. For operations that are already executing, the operation object itself must check for cancellation and stop what it is doing so that it can move to the finished state. In both cases, a finished (or canceled) operation is still given a chance to execute its completion block before it is removed from the queue.

And this is the docs of NSOperation cancel.

In any case, calling cancel on an already cancelled or finished operation does not crash the app.

I believe that the crash is happening inside one of the operations that are being cancelled and that's why the crash reporter reports the crash there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that even though we already put assurance for the corresponding code to make it only executed on the main thread or another sole thread, e.g. the JS thread, but because of the nature of object pointer reference, the operation object could still be shared among multiple thread contexts, then the situation you said above happens.

if (_didInvalidate) {
return;
}
RCTUnsafeExecuteOnMainQueueSync(^{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could potentially deadlock. We should not run the unsafe variant of this method. Can you change it with RCTExecuteOnMainQueue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.


RCTAssertMainQueue();
RCTLogInfo(@"Invalidating %@ (parent: %@, executor: %@)", self, _parentBridge, [self executorClass]);
RCTAssertMainQueue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that confuses me is that, in your stacktrace, the crash is happening in Thread 26... but this assert should force the app to be on the main thread, which is not the Thread 26... how's this possible?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cipolleschi Good question, that's because of actually the RCTAssertMainQueue only takes effects in dev build, when in the release build, it does nothing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I added the RCTUnsafeExecuteOnMainQueueSync wrapper to ensure the code to run on the main thread.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image As you can see, when the `NS_BLOCK_ASSERTIONS` is defined, the `RCTAssert` macro is actually empty, and generally the `NS_BLOCK_ASSERTIONS` is defined in release build.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good explanation, but then we should see crashes in development happening because of the assertion. And IIUC, the app does not crash in development, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By looking at the crash log, the JS thread is triggering the invalidation. I think that this is the root of the problem: after the JS thread detect the invalidation, we should jump on the UI thread to invalidate everything...

Copy link
Author

@zhouzh1 zhouzh1 Dec 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't encounter this crash in development, but I am not sure if it happens in it, you know, it's a occasional issue on itself.

@zhouzh1
Copy link
Author

zhouzh1 commented Dec 20, 2024

@zhouzh1 hi! Out of curiosity, why did you tag me in this PR? I keep getting tagged in React Native PRs for some reason, and I'm not sure why :)

Hey @augustl, sorry, I thought you're a memeber of the react-native dev team, because your id was displayed in the hint list when I was typing the @ char in the comment box. 😸

@zhouzh1
Copy link
Author

zhouzh1 commented Dec 20, 2024

@cipolleschi Put more information here for your reference.

  1. I only observed this crash in production by sentry, have not seen it in dev.
  2. According the sentry breadcrumbs, seems that most of the events happened after the app was pushed to the background state.
  3. You can see the relevant error messages that we collected by sentry on the below screenshot.
image 4. Posted a complete crash report. [79c34beca2e542da9ab6df2ad54ec60d-symbolicated.crash.zip](https://github.com/user-attachments/files/18211158/79c34beca2e542da9ab6df2ad54ec60d-symbolicated.crash.zip)

@RSNara
Copy link
Contributor

RSNara commented Dec 20, 2024

@zhouzh1, this diff assumes that the crash occurs because file reader invalidation should occur on the main thread. Have we validated that assumption? If not, we should! If so, then, why not just just surgically modify the file reader as opposed to the bridge? As it stands, your changes will cause a lot of code to execute on the main thread, which could have significant perf implications.

@zhouzh1
Copy link
Author

zhouzh1 commented Dec 21, 2024

@zhouzh1, this diff assumes that the crash occurs because file reader invalidation should occur on the main thread. Have we validated that assumption? If not, we should! If so, then, why not just just surgically modify the file reader as opposed to the bridge? As it stands, your changes will cause a lot of code to execute on the main thread, which could have significant perf implications.

@RSNara As you can see from the code and the above conversation with @cipolleschi , there is already a RCTAssertMainQueue() invocation in the start of the invalidation, that means the original intention is exactly needing the corresponding code running on the main queue? Just exactly because of accidentally the official dev team didn't notice the RCTAssertMainQueue won't take effect in release build on most of the cases (it's just my guess and inference, if the dev team knew that and did it intentionally, please ignore).

@RSNara
Copy link
Contributor

RSNara commented Dec 23, 2024

@zhouzh1, this diff assumes that the crash occurs because file reader invalidation should occur on the main thread. Have we validated that assumption? If not, we should! If so, then, why not just just surgically modify the file reader as opposed to the bridge? As it stands, your changes will cause a lot of code to execute on the main thread, which could have significant perf implications.

@RSNara As you can see from the code and the above conversation with @cipolleschi , there is already a RCTAssertMainQueue() invocation in the start of the invalidation, that means the original intention is exactly needing the corresponding code running on the main queue? Just exactly because of accidentally the official dev team didn't notice the RCTAssertMainQueue won't take effect in release build on most of the cases (it's just my guess and inference, if the dev team knew that and did it intentionally, please ignore).

I just took a look!

RCTCxxBridge invalidate should only ever be called from RCTBridge. And RCTBridge invalidate and reload (but not dealloc) schedules RCTCxxBridge invalidate on the main thread:

https://github.com/facebook/react-native/blob/main/packages/react-native/React/Base/RCTBridge.mm?fbclid=IwZXh0bgNhZW0CMTEAAR0vNpP3fC5jmfsIbD3EGeU8ynLhumgdGU2JgStLpsMNIT8VNT9PqeT3N2I_aem__qaj5Hd0GAQmvSVLs7dQZA#L368

So, it's very curious that you're running into this issue. Could it be that in your code, you're relying on the dealloc method of RCTBridge? And that just synchronously deallocates the RCTCxxBridge on the current (i.e: potentially non-main) thread?

@zhouzh1
Copy link
Author

zhouzh1 commented Dec 24, 2024

@zhouzh1, this diff assumes that the crash occurs because file reader invalidation should occur on the main thread. Have we validated that assumption? If not, we should! If so, then, why not just just surgically modify the file reader as opposed to the bridge? As it stands, your changes will cause a lot of code to execute on the main thread, which could have significant perf implications.

@RSNara As you can see from the code and the above conversation with @cipolleschi , there is already a RCTAssertMainQueue() invocation in the start of the invalidation, that means the original intention is exactly needing the corresponding code running on the main queue? Just exactly because of accidentally the official dev team didn't notice the RCTAssertMainQueue won't take effect in release build on most of the cases (it's just my guess and inference, if the dev team knew that and did it intentionally, please ignore).

I just took a look!

RCTCxxBridge invalidate should only ever be called from RCTBridge. And RCTBridge invalidate and reload (but not dealloc) schedules RCTCxxBridge invalidate on the main thread:

https://github.com/facebook/react-native/blob/main/packages/react-native/React/Base/RCTBridge.mm?fbclid=IwZXh0bgNhZW0CMTEAAR0vNpP3fC5jmfsIbD3EGeU8ynLhumgdGU2JgStLpsMNIT8VNT9PqeT3N2I_aem__qaj5Hd0GAQmvSVLs7dQZA#L368

So, it's very curious that you're running into this issue. Could it be that in your code, you're relying on the dealloc method of RCTBridge? And that just synchronously deallocates the RCTCxxBridge on the current (i.e: potentially non-main) thread?

Your curiosity is mine as well. Before I submitted this PR, I was also suspecting if there is a certain place in my code or 3rd-party library code where the RCTBridge or the RCTCxxBridge deallocation is invoked explicitly, but I didn't manage to find it. However anyway, we always need to ensure the RCTCxxBridge invalidation to be run on the main thread, is it right? If so, it makes sense to wrap it with the RCTExecuteOnMainQueue?

@cipolleschi
Copy link
Contributor

@zhouzh1 is your app using Expo? I wonder if Expo does something under the hood to try and manage the lifecycle of the Bridge. Similarly, there might be libraries that attempt to do the same. If they connect some private API like the reload one to some JS function, it might happen that the invalidation process starts from the JS Thread instead of from the main one. 🤔

This is an hypothesis that we need to validate, thought.. It would be helpful to know what dependencies are you using. Also, are you using any crash reporting solution like Sentry? those product usually allow you to leave breadcrumbs that can be used to investigate crashes. What was the user doing in the app when the crash occurred? What was the last action issued?

@zhouzh1
Copy link
Author

zhouzh1 commented Dec 24, 2024

@zhouzh1 is your app using Expo? I wonder if Expo does something under the hood to try and manage the lifecycle of the Bridge. Similarly, there might be libraries that attempt to do the same. If they connect some private API like the reload one to some JS function, it might happen that the invalidation process starts from the JS Thread instead of from the main one. 🤔

This is an hypothesis that we need to validate, thought.. It would be helpful to know what dependencies are you using. Also, are you using any crash reporting solution like Sentry? those product usually allow you to leave breadcrumbs that can be used to investigate crashes. What was the user doing in the app when the crash occurred? What was the last action issued?

Yes, our app is using the expo and many of its associated libraries (e.g. the expo-updates, expo-camera, and so on), I think what you said above makes sense.
According to the Sentry breadcrumbs we collected, I found most of the crash events occurring after the app was pushed to the background state, and meanwhile, there were some requests in pending.
image

@zhouzh1
Copy link
Author

zhouzh1 commented Dec 25, 2024

@cipolleschi Any ideas about the above information I provided?

@cipolleschi
Copy link
Contributor

It partially makes sense. Moving an app to background triggers the events on the main thread. So the whole invalidation should already happen in the right thread.
But from your stack trace, the invalidation seems to arrive from a different thread. That's what we do not understand.

Also, the purpose of the invalidation and requestCancelling, is exactly to avoid those kind of crashes... As soon as the app goes in background, the operation are cancelled and they should not be executed... :/


Any chance that you can try and prepare a repro using this template, so that we can investigate this problem with more ease?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Shared with Meta Applied via automation to indicate that an Issue or Pull Request has been shared with the team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants