-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AP_ESC_Telem: fix RPM timeout race #28999
base: master
Are you sure you want to change the base?
AP_ESC_Telem: fix RPM timeout race #28999
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertlong13 I am happier with this approach, but I need to point out that this code is very, very subtle with all sorts of bad things happening (crashed aircraft) if we get it wrong. @magicrub originally did the rpm_data_within_timeout() changes which fundamentally relies on the 0 value for the timeout so we should get him to review and also check his PR. We should also be striving for the smallest change possible that is demonstrable correct.
We also need a comment about the 100Hz update in AP_Vehicle as we are now heavily reliant on that. Probably also needs a comment in AP_Vehicle. |
The RPM telemetry data structure can get updated by another thread at any time. This can cause (now - last_update_us) to be negative, which looks like the data is nearly UINT32_MAX microseconds stale. To fix this, we copy the last_update_us value before we get the current time so it's guaranteed to be positive. We also minimize the number of places we explicitly check the timestamp and rely on the `data_valid` where possible to minimize the performance impact of this change.
9bcaec6
to
8ab9ff9
Compare
I wouldn't say we're heavily reliant on 100Hz. It definitely needs to be faster than 0.0002Hz, but even if it was called as slow as 10Hz (which is what it used to be, but I doubt we're going back to that, and we definitely won't be going slower than it), our arbitrary 1s timeout would be worst-case 1.1s, and average 1.05s. I'll chuck a comment in AP_Vehicle anyway. |
Really? I only saw the PR from @WickedShell, which is why I marked him for review (though I forgot to mark him for this second one, thanks for reminding me). |
Sorry my mistake - yes @WickedShell - there were issues with motors in transition IIRC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks ok to me, but I would like @WickedShell to verify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been awhile since I've dove through this, so I'm not 100% confident, but it generally seems fine.
It's worth noting that all the other accessors (get_active_esc_mask/get_max_rpm_esc etc) all got more expensive as they are now all getting their own copy of time. (I know getting the timestamp in milliseconds was surprisingly expensive, I'm not sure if the micros one is as expensive though).
As an aside all of the time management stuff is a here be dragons area, I don't think this makes it any worse then it was.
Given the sensitivity of the code I think we will need to fly before merging |
It's the other way around. Those accessors got less expensive, as now they don't need a copy of time, they just check the The only functions that got any more expensive are |
Right, I was referring to anything leaning against |
Alternative approach to #28987.
We don't need to worry about explicitly checking the timestamp in most places; they can just use the
data_valid
flag directly. This eliminates the race without harming the performance ofget_rpm
at all (in fact, it improves it, even if just barely). The only places that need to actually check the timestamp areupdate
, where thedata_valid
gets cleared on timeout, andare_motors_running
, which has a stricter timeout thandata_valid
does. These two run much less frequently thanget_rpm
, so the extra copying shouldn't be a problem.This might change the timing of when the data is considered stale by a few microseconds, since it's checking the timestamp at 100Hz instead of every time we get RPM, but that should be absolutely fine. The other change you might notice is that previously
rpm_data_within_timeout
explicitly checked iflast_update_us
was zero and I don't. That check was unnecessary in there though, becausedata_valid
cannot possibly be true iflast_update_us
is zero.I actually really like this approach. I'll probably do something similar to solve the
TelemetryData::stale
race. And if we don't mind the signed-int stuff in my other PR, then I'll take this idea over to that PR.This approach does not solve the potential race in the
slew
calculation, but I'm guessing that's probably fine. If it was a problem, we probably would have noticed by now. My other approach didn't fully fix the slew calculation either (if you interrupted the backend after it updatedprev_rpm
andrpm
, but before it updateslast_update_us
, then you'll get a little jump; that should be very rare though).