Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/lwip: fix race in async sock API with event #21093

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

maribu
Copy link
Member

@maribu maribu commented Dec 16, 2024

Contribution description

In TCP server mode, the sock_tcp_t sockets are managed by the network stack and can be reused if a previous connection is no longer in used. However, an event may still be posted in the event queue when the socket is reused. Wiping it will result in the next pointer in that event to be NULL, which will cause the event handler fetching that event to crash.

This adds an event_cancel() at two places:

  1. Just before reusing the socket
  2. During sock_tcp_disconnect()

The former catches issues in server mode e.g. when a connect has been closed (e.g. due to timeout) and is reused before a pending event (e.g. a timeout event) has been processed.

The letter may be an issue on client side. E.g. when sock_tcp_t was allocated on stack and goes out of scope after sock_tcp_disconnect but before the event handler was run.

Testing procedure

I was able to see this bug in the wild in my work-in-progress PR that adds CoAP over TCP to nanocoap. But that code is not in a well shape now and has other issues.

A test that puts an async TCP server with lots of connections through the paces would likely be a good demonstrator.

Issues/PRs references

None

In TCP server mode, the sock_tcp_t sockets are managed by the network
stack and can be reused if a previous connection is no longer in used.
However, an event may still be posted in the event queue when the socket
is reused. Wiping it will result in the `next` pointer in that event to
be NULL, which will cause the event handler fetching that event to
crash.

This adds an `event_cancel()` at two places:
1. Just before reusing the socket
2. During sock_tcp_disconnect()

The former catches issues in server mode e.g. when a connect has been
closed (e.g. due to timeout) and is reused before a pending event (e.g.
a timeout event) has been processed.

The letter may be an issue on client side. E.g. when `sock_tcp_t` was
allocated on stack and goes out of scope after `sock_tcp_disconnect`
but before the event handler was run.
@maribu maribu added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR labels Dec 16, 2024
@github-actions github-actions bot added Area: network Area: Networking Area: pkg Area: External package ports labels Dec 16, 2024
@maribu
Copy link
Member Author

maribu commented Dec 16, 2024

I'm not super confident in this PR in regard to regressions. My WIP CoAP over TCP server still behaves oddly. I was unable to crash it anymore with this applied.

@riot-ci
Copy link

riot-ci commented Dec 16, 2024

Murdock results

✔️ PASSED

8ac3a43 pkg/lwip: fix race in async sock API with event

Success Failures Total Runtime
10249 0 10249 16m:38s

Artifacts

@benpicco benpicco requested a review from yarrick December 16, 2024 20:19
@maribu
Copy link
Member Author

maribu commented Dec 17, 2024

Aaand if seen another segfault, again with the event queue pointing to the event in a sock_tcp_t and the next ptr in there being NULL.

@yarrick
Copy link
Contributor

yarrick commented Dec 19, 2024

I don't really have any insight in how this part of the code works.

@miri64
Copy link
Member

miri64 commented Dec 20, 2024

Aaand if seen another segfault, again with the event queue pointing to the event in a sock_tcp_t and the next ptr in there being NULL.

Since you are working on network interface code and the segfault happens a few layers up, maybe the packet times out and is deleted while it is still in the queue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: network Area: Networking Area: pkg Area: External package ports CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants