AP_ESC_Telem: fix RPM timeout race #28999

robertlong13 · 2025-01-03T07:09:20Z

Alternative approach to #28987.

We don't need to worry about explicitly checking the timestamp in most places; they can just use the data_valid flag directly. This eliminates the race without harming the performance of get_rpm at all (in fact, it improves it, even if just barely). The only places that need to actually check the timestamp are update, where the data_valid gets cleared on timeout, and are_motors_running, which has a stricter timeout than data_valid does. These two run much less frequently than get_rpm, so the extra copying shouldn't be a problem.

This might change the timing of when the data is considered stale by a few microseconds, since it's checking the timestamp at 100Hz instead of every time we get RPM, but that should be absolutely fine. The other change you might notice is that previously rpm_data_within_timeout explicitly checked if last_update_us was zero and I don't. That check was unnecessary in there though, because data_valid cannot possibly be true if last_update_us is zero.

I actually really like this approach. I'll probably do something similar to solve the TelemetryData::stale race. And if we don't mind the signed-int stuff in my other PR, then I'll take this idea over to that PR.

This approach does not solve the potential race in the slew calculation, but I'm guessing that's probably fine. If it was a problem, we probably would have noticed by now. My other approach didn't fully fix the slew calculation either (if you interrupted the backend after it updated prev_rpm and rpm, but before it updates last_update_us, then you'll get a little jump; that should be very rare though).

andyp1per

@robertlong13 I am happier with this approach, but I need to point out that this code is very, very subtle with all sorts of bad things happening (crashed aircraft) if we get it wrong. @magicrub originally did the rpm_data_within_timeout() changes which fundamentally relies on the 0 value for the timeout so we should get him to review and also check his PR. We should also be striving for the smallest change possible that is demonstrable correct.

libraries/AP_ESC_Telem/AP_ESC_Telem.cpp

andyp1per · 2025-01-03T09:41:44Z

We also need a comment about the 100Hz update in AP_Vehicle as we are now heavily reliant on that. Probably also needs a comment in AP_Vehicle.

The RPM telemetry data structure can get updated by another thread at any time. This can cause (now - last_update_us) to be negative, which looks like the data is nearly UINT32_MAX microseconds stale. To fix this, we copy the last_update_us value before we get the current time so it's guaranteed to be positive. We also minimize the number of places we explicitly check the timestamp and rely on the `data_valid` where possible to minimize the performance impact of this change.

robertlong13 · 2025-01-03T12:40:45Z

We also need a comment about the 100Hz update in AP_Vehicle as we are now heavily reliant on that. Probably also needs a comment in AP_Vehicle.

I wouldn't say we're heavily reliant on 100Hz. It definitely needs to be faster than 0.0002Hz, but even if it was called as slow as 10Hz (which is what it used to be, but I doubt we're going back to that, and we definitely won't be going slower than it), our arbitrary 1s timeout would be worst-case 1.1s, and average 1.05s.

I'll chuck a comment in AP_Vehicle anyway.

robertlong13 · 2025-01-03T12:51:14Z

@magicrub originally did the rpm_data_within_timeout() changes which fundamentally relies on the 0 value for the timeout so we should get him to review and also check his PR.

Really? I only saw the PR from @WickedShell, which is why I marked him for review (though I forgot to mark him for this second one, thanks for reminding me).

andyp1per · 2025-01-03T15:47:00Z

Really? I only saw the PR from @WickedShell, which is why I marked him for review (though I forgot to mark him for this second one, thanks for reminding me).

Sorry my mistake - yes @WickedShell - there were issues with motors in transition IIRC

andyp1per

This looks ok to me, but I would like @WickedShell to verify

WickedShell

It's been awhile since I've dove through this, so I'm not 100% confident, but it generally seems fine.

It's worth noting that all the other accessors (get_active_esc_mask/get_max_rpm_esc etc) all got more expensive as they are now all getting their own copy of time. (I know getting the timestamp in milliseconds was surprisingly expensive, I'm not sure if the micros one is as expensive though).

As an aside all of the time management stuff is a here be dragons area, I don't think this makes it any worse then it was.

andyp1per · 2025-01-03T19:36:23Z

Given the sensitivity of the code I think we will need to fly before merging

robertlong13 · 2025-01-03T21:55:56Z

It's worth noting that all the other accessors (get_active_esc_mask/get_max_rpm_esc etc) all got more expensive as they are now all getting their own copy of time.

It's the other way around. Those accessors got less expensive, as now they don't need a copy of time, they just check the data_valid flag.

The only functions that got any more expensive are are_motors_running and update

WickedShell · 2025-01-03T23:06:44Z

It's worth noting that all the other accessors (get_active_esc_mask/get_max_rpm_esc etc) all got more expensive as they are now all getting their own copy of time.

It's the other way around. Those accessors got less expensive, as now they don't need a copy of time, they just check the data_valid flag.

The only functions that got any more expensive are are_motors_running and update

Right, I was referring to anything leaning against rpm_data_within_timeout but then I managed to get confused while jumping around through things here, and thought it had been left in more places then it was. Sorry!

robertlong13 requested a review from andyp1per January 3, 2025 07:09

andyp1per requested changes Jan 3, 2025

View reviewed changes

libraries/AP_ESC_Telem/AP_ESC_Telem.cpp Outdated Show resolved Hide resolved

libraries/AP_ESC_Telem/AP_ESC_Telem.cpp Outdated Show resolved Hide resolved

libraries/AP_ESC_Telem/AP_ESC_Telem.cpp Show resolved Hide resolved

robertlong13 force-pushed the pr/fix-rpm-timeout-race-2 branch from 9bcaec6 to 8ab9ff9 Compare January 3, 2025 12:31

AP_Vehicle: add comment for ESC Telem update rate

6d50a36

robertlong13 requested a review from WickedShell January 3, 2025 12:51

robertlong13 changed the title ~~Alt: AP_ESC_Telem: fix RPM timeout race~~ AP_ESC_Telem: fix RPM timeout race Jan 3, 2025

andyp1per approved these changes Jan 3, 2025

View reviewed changes

WickedShell approved these changes Jan 3, 2025

View reviewed changes

andyp1per added BUG NeedsTesting labels Jan 3, 2025

tpwrules mentioned this pull request Jan 5, 2025

AP_Scripting: add bindings for servo telemetry #28857

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AP_ESC_Telem: fix RPM timeout race #28999

AP_ESC_Telem: fix RPM timeout race #28999

robertlong13 commented Jan 3, 2025

andyp1per left a comment

andyp1per commented Jan 3, 2025

robertlong13 commented Jan 3, 2025

robertlong13 commented Jan 3, 2025 •

edited

Loading

andyp1per commented Jan 3, 2025

andyp1per left a comment

WickedShell left a comment

andyp1per commented Jan 3, 2025

robertlong13 commented Jan 3, 2025

WickedShell commented Jan 3, 2025

AP_ESC_Telem: fix RPM timeout race #28999

Are you sure you want to change the base?

AP_ESC_Telem: fix RPM timeout race #28999

Conversation

robertlong13 commented Jan 3, 2025

andyp1per left a comment

Choose a reason for hiding this comment

andyp1per commented Jan 3, 2025

robertlong13 commented Jan 3, 2025

robertlong13 commented Jan 3, 2025 • edited Loading

andyp1per commented Jan 3, 2025

andyp1per left a comment

Choose a reason for hiding this comment

WickedShell left a comment

Choose a reason for hiding this comment

andyp1per commented Jan 3, 2025

robertlong13 commented Jan 3, 2025

WickedShell commented Jan 3, 2025

robertlong13 commented Jan 3, 2025 •

edited

Loading