Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AP_ESC_Telem: fix RPM timeout race #28999

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

robertlong13
Copy link
Collaborator

Alternative approach to #28987.

We don't need to worry about explicitly checking the timestamp in most places; they can just use the data_valid flag directly. This eliminates the race without harming the performance of get_rpm at all (in fact, it improves it, even if just barely). The only places that need to actually check the timestamp are update, where the data_valid gets cleared on timeout, and are_motors_running, which has a stricter timeout than data_valid does. These two run much less frequently than get_rpm, so the extra copying shouldn't be a problem.

This might change the timing of when the data is considered stale by a few microseconds, since it's checking the timestamp at 100Hz instead of every time we get RPM, but that should be absolutely fine. The other change you might notice is that previously rpm_data_within_timeout explicitly checked if last_update_us was zero and I don't. That check was unnecessary in there though, because data_valid cannot possibly be true if last_update_us is zero.

I actually really like this approach. I'll probably do something similar to solve the TelemetryData::stale race. And if we don't mind the signed-int stuff in my other PR, then I'll take this idea over to that PR.

This approach does not solve the potential race in the slew calculation, but I'm guessing that's probably fine. If it was a problem, we probably would have noticed by now. My other approach didn't fully fix the slew calculation either (if you interrupted the backend after it updated prev_rpm and rpm, but before it updates last_update_us, then you'll get a little jump; that should be very rare though).

@robertlong13 robertlong13 requested a review from andyp1per January 3, 2025 07:09
Copy link
Collaborator

@andyp1per andyp1per left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertlong13 I am happier with this approach, but I need to point out that this code is very, very subtle with all sorts of bad things happening (crashed aircraft) if we get it wrong. @magicrub originally did the rpm_data_within_timeout() changes which fundamentally relies on the 0 value for the timeout so we should get him to review and also check his PR. We should also be striving for the smallest change possible that is demonstrable correct.

libraries/AP_ESC_Telem/AP_ESC_Telem.cpp Outdated Show resolved Hide resolved
libraries/AP_ESC_Telem/AP_ESC_Telem.cpp Outdated Show resolved Hide resolved
libraries/AP_ESC_Telem/AP_ESC_Telem.cpp Show resolved Hide resolved
@andyp1per
Copy link
Collaborator

We also need a comment about the 100Hz update in AP_Vehicle as we are now heavily reliant on that. Probably also needs a comment in AP_Vehicle.

The RPM telemetry data structure can get updated by another thread at
any time. This can cause (now - last_update_us) to be negative, which
looks like the data is nearly UINT32_MAX microseconds stale. To fix
this, we copy the last_update_us value before we get the current time
so it's guaranteed to be positive.

We also minimize the number of places we explicitly check the timestamp
and rely on the `data_valid` where possible to minimize the performance
impact of this change.
@robertlong13 robertlong13 force-pushed the pr/fix-rpm-timeout-race-2 branch from 9bcaec6 to 8ab9ff9 Compare January 3, 2025 12:31
@robertlong13
Copy link
Collaborator Author

We also need a comment about the 100Hz update in AP_Vehicle as we are now heavily reliant on that. Probably also needs a comment in AP_Vehicle.

I wouldn't say we're heavily reliant on 100Hz. It definitely needs to be faster than 0.0002Hz, but even if it was called as slow as 10Hz (which is what it used to be, but I doubt we're going back to that, and we definitely won't be going slower than it), our arbitrary 1s timeout would be worst-case 1.1s, and average 1.05s.

I'll chuck a comment in AP_Vehicle anyway.

@robertlong13
Copy link
Collaborator Author

robertlong13 commented Jan 3, 2025

@magicrub originally did the rpm_data_within_timeout() changes which fundamentally relies on the 0 value for the timeout so we should get him to review and also check his PR.

Really? I only saw the PR from @WickedShell, which is why I marked him for review (though I forgot to mark him for this second one, thanks for reminding me).

@robertlong13 robertlong13 changed the title Alt: AP_ESC_Telem: fix RPM timeout race AP_ESC_Telem: fix RPM timeout race Jan 3, 2025
@andyp1per
Copy link
Collaborator

Really? I only saw the PR from @WickedShell, which is why I marked him for review (though I forgot to mark him for this second one, thanks for reminding me).

Sorry my mistake - yes @WickedShell - there were issues with motors in transition IIRC

Copy link
Collaborator

@andyp1per andyp1per left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok to me, but I would like @WickedShell to verify

Copy link
Contributor

@WickedShell WickedShell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been awhile since I've dove through this, so I'm not 100% confident, but it generally seems fine.

It's worth noting that all the other accessors (get_active_esc_mask/get_max_rpm_esc etc) all got more expensive as they are now all getting their own copy of time. (I know getting the timestamp in milliseconds was surprisingly expensive, I'm not sure if the micros one is as expensive though).

As an aside all of the time management stuff is a here be dragons area, I don't think this makes it any worse then it was.

@andyp1per
Copy link
Collaborator

Given the sensitivity of the code I think we will need to fly before merging

@robertlong13
Copy link
Collaborator Author

It's worth noting that all the other accessors (get_active_esc_mask/get_max_rpm_esc etc) all got more expensive as they are now all getting their own copy of time.

It's the other way around. Those accessors got less expensive, as now they don't need a copy of time, they just check the data_valid flag.

The only functions that got any more expensive are are_motors_running and update

@WickedShell
Copy link
Contributor

It's worth noting that all the other accessors (get_active_esc_mask/get_max_rpm_esc etc) all got more expensive as they are now all getting their own copy of time.

It's the other way around. Those accessors got less expensive, as now they don't need a copy of time, they just check the data_valid flag.

The only functions that got any more expensive are are_motors_running and update

Right, I was referring to anything leaning against rpm_data_within_timeout but then I managed to get confused while jumping around through things here, and thought it had been left in more places then it was. Sorry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants