Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AP_Math: prevent FPE in SITL when limitting accel, vectors #28981

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

peterbarker
Copy link
Contributor

Co-authored-by: Leonard Hall [email protected]

the cross-product code can produce something slightly negative. Stop passing that into safe_sqrt; it isn't that safe on SITL!

@khancyr
Copy link
Contributor

khancyr commented Jan 1, 2025

Shouldn't be this done in safe_sqrt ? I would assume that we have some other place that can have this issue

@peterbarker
Copy link
Contributor Author

Shouldn't be this done in safe_sqrt ? I would assume that we have some other place that can have this issue

Well... we really want to know about where this actually does happen. This PR isn't really a satisfactory solution to what we're seeing IMO - it would be nice to trace back the maths to work out how the rhs ens up larger than the lhs in here. #28969 is the issue if you missed it

@IamPete1
Copy link
Member

IamPete1 commented Jan 1, 2025

I made a Geogebra to visualise the maths.

https://www.geogebra.org/calculator/ktxy2uyf

I don't see any issue with the maths. Must be floating point stuff.

If I were to guess it would be vel.is_zero(). It checks the individual X and Y components are not zero. This does not mean that the length is none zero (when using floats). Squaring a very small number which is not zero might result in a number which is zero. So you get zeros before the addition and square root that would make the number larger again.

Maybe we need a new method length_is_zero we could compare components against sqrt(FLT_EPSILON) rather than FLT_EPSILON.

@tpwrules

This comment was marked as outdated.

@tpwrules
Copy link
Contributor

tpwrules commented Jan 1, 2025

Pete is close, the problem is imprecision in Vector2f.limit_length() on line 387.

The vector is (99.9791489, -557.290039) and the length limit is 566.187256. length() of the vector is 566.187256 so limit_length() doesn't do anything and the safe_sqrt runs. But length_squared() is 320568.031250 while accel_max*accel_max is 320568.000000, a change of the least significant bit. The difference in these quantities is exactly the value that was passed to safe_sqrt. Lucky shot!

  • We could implement a MAX with 0, that seems okay
  • We could adjust the safe_sqrt call to not square accel_max, but that would require an extra square root
  • We could adjust limit_length to check the squared length, or have an alternate method for where we care about the squared limit

I'm not sure how to guarantee the vector length quantity is actually <= the limit after the math in limit_length() no matter how it's done. I am guessing that could have errors too if they are very close. Anybody have any idea what rounding mode we use?

@tpwrules
Copy link
Contributor

tpwrules commented Jan 1, 2025

Doing some research and fiddling suggests this is the correct solution. We need to audit other uses of limit_length() to make sure they won't explode if the vector ends up slightly over the limit.

It's not feasible due to the vagaries of floating point math to guarantee it's always equal or under.

@@ -387,7 +387,7 @@ bool limit_accel_xy(const Vector2f& vel, Vector2f& accel, float accel_max)
if (accel_cross.limit_length(accel_max)) {
accel_dir = 0.0;
} else {
float accel_max_dir = safe_sqrt(sq(accel_max) - accel_cross.length_squared());
float accel_max_dir = safe_sqrt(MAX(0, sq(accel_max) - accel_cross.length_squared()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
float accel_max_dir = safe_sqrt(MAX(0, sq(accel_max) - accel_cross.length_squared()));
// limit_length can't absolutely guarantee this subtraction won't be slightly negative
float accel_max_dir = safe_sqrt(MAX(0, sq(accel_max) - accel_cross.length_squared()));

@IamPete1
Copy link
Member

IamPete1 commented Jan 1, 2025

  • We could adjust limit_length to check the squared length, or have an alternate method for where we care about the squared limit

We could limit to the smaller of the the existing and the squared so we always conservative.

On the whole I tend to think know we know what is going on we probably don't need to panic. I'd even be tempted not to fix this case so we can see if a recent change had made this more likely.

We do have a test for the function, maybe we should add some cases there.

TEST(Vector2Test, length)
{
EXPECT_FLOAT_EQ(25, Vector2f(3, 4).length_squared());
Vector2f v_float1(1.0f, 1.0f);
EXPECT_TRUE(v_float1.limit_length(1.0f));
EXPECT_FALSE(Vector2f(-0.0f, 0.0f).limit_length(1.0f));
}

@peterbarker
Copy link
Contributor Author

Closes #28969

@tpwrules
Copy link
Contributor

tpwrules commented Jan 1, 2025

I don't actually think limit_length() can be entirely fixed unfortunately. I think it's better to just be mindful of its limitations.

And it has happened again here: https://github.com/ArduPilot/ardupilot/actions/runs/12572260649/job/35044375709?pr=28857

Co-authored-by: Leonard Hall <[email protected]>

the  cross-product code can produce something slightly negative.  Stop passing that into safe_sqrt; it isn't *that* safe on SITL!
@peterbarker peterbarker added the WIP label Jan 2, 2025
in this corner case we *do* need to scale the vector
this can avoid problems where floating point precision means that post-square-rooting the lengths are considered equal.  Subtracting the squares after making such a check can lead to attempting to take the square root of a negative number
@peterbarker peterbarker force-pushed the pr/prevent-fpe-control branch from 6d764d3 to a73e086 Compare January 2, 2025 05:22
@peterbarker
Copy link
Contributor Author

I've written a test for this now so other people can play with things in gdb easily.

I've also pushed up a patch reverting the original fix and proposing an alternative fix. Basically stop using the apples comparison to determine if we do the oranges maths - use a single fruit in both cases.

@peterbarker
Copy link
Contributor Author

And it has happened again here: https://github.com/ArduPilot/ardupilot/actions/runs/12572260649/job/35044375709?pr=28857

Well, that's coincidence... third is presumed to be enemy action...

I've had a quick look through the stack and code changes and haven't found any smoking gun. @robertlong13 changed things up in this test a bit recently be adding in the ICE frames. Is it possible that the slightly different mission is causing this? 3982d57

@tpwrules
Copy link
Contributor

tpwrules commented Jan 2, 2025

So this both fixes the problem and doesn't. The problem is fixed because we won't execute a square root with a negative number; we now trust that the limit function says it wanted to limit the vector and then set accel_dir to 0 instead of calling safe_sqrt.

But it doesn't fix the problem because the vector is still in some sense too long. Calling accel_cross.limit_length_squared(accel_max) again will still return true. The test (len_squared <= max_length_squared) is still false, but len == max_length so x and y get multiplied by precisely one and the vector doesn't change at all despite the function returning that it tried to change it.

I think adding a MAX(0, <old value>) to the safe_sqrt now is the best solution to stop CI flapping, then do some more detailed studies of who uses limit_length and what are the problems for going over.

The current fix is sort of worse because now limit_length_squared is lying. It said it limited the vector but didn't. I haven't been able to prove limit_length can't lie, it's not trivial to make it do so, but I suspect it's possible with subnormals or something.

Is it possible that the slightly different mission is causing this?

Yes, I expect so. It just happens to sometimes hit this exact point now. Here's the enemy action: https://github.com/ArduPilot/ardupilot/actions/runs/12581405966/job/35065761383?pr=28991 . The numbers are again identical.

@tpwrules
Copy link
Contributor

tpwrules commented Jan 4, 2025

Another failure: https://github.com/ArduPilot/ardupilot/actions/runs/12591859248/job/35097592628?pr=28996

Going to try to replicate locally and see if that mission change was the instigator.

Still prefer the MAX fix.

@tpwrules
Copy link
Contributor

tpwrules commented Jan 4, 2025

I was able to replicate it (running only the FlyEachFrame test and quadplane-tilt frame) but sometimes it took 50-60 tries and sometimes it was one or two. It would be hard to say what particular commit did it. I imagine it's kind of dependent on system load and stuff too.

@tpwrules
Copy link
Contributor

tpwrules commented Jan 5, 2025

Now it's infected copter: https://github.com/ArduPilot/ardupilot/actions/runs/12616584539/job/35158079748?pr=28857

accel_cross = {x = 290.641205, y = -485.89682}, accel_max = 566.187256

@peterbarker
Copy link
Contributor Author

Now it's infected copter: https://github.com/ArduPilot/ardupilot/actions/runs/12616584539/job/35158079748?pr=28857

We're probably going to need fire by the end of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants