AP_Math: prevent FPE in SITL when limitting accel, vectors #28981

peterbarker · 2025-01-01T10:53:26Z

Co-authored-by: Leonard Hall [email protected]

the cross-product code can produce something slightly negative. Stop passing that into safe_sqrt; it isn't that safe on SITL!

khancyr · 2025-01-01T13:06:44Z

Shouldn't be this done in safe_sqrt ? I would assume that we have some other place that can have this issue

peterbarker · 2025-01-01T14:07:04Z

Shouldn't be this done in safe_sqrt ? I would assume that we have some other place that can have this issue

Well... we really want to know about where this actually does happen. This PR isn't really a satisfactory solution to what we're seeing IMO - it would be nice to trace back the maths to work out how the rhs ens up larger than the lhs in here. #28969 is the issue if you missed it

IamPete1 · 2025-01-01T17:48:37Z

I made a Geogebra to visualise the maths.

https://www.geogebra.org/calculator/ktxy2uyf

I don't see any issue with the maths. Must be floating point stuff.

If I were to guess it would be vel.is_zero(). It checks the individual X and Y components are not zero. This does not mean that the length is none zero (when using floats). Squaring a very small number which is not zero might result in a number which is zero. So you get zeros before the addition and square root that would make the number larger again.

Maybe we need a new method length_is_zero we could compare components against sqrt(FLT_EPSILON) rather than FLT_EPSILON.

tpwrules · 2025-01-01T18:37:14Z

Pete is close, the problem is imprecision in Vector2f.limit_length() on line 387.

The vector is (99.9791489, -557.290039) and the length limit is 566.187256. length() of the vector is 566.187256 so limit_length() doesn't do anything and the safe_sqrt runs. But length_squared() is 320568.031250 while accel_max*accel_max is 320568.000000, a change of the least significant bit. The difference in these quantities is exactly the value that was passed to safe_sqrt. Lucky shot!

We could implement a MAX with 0, that seems okay
We could adjust the safe_sqrt call to not square accel_max, but that would require an extra square root
We could adjust limit_length to check the squared length, or have an alternate method for where we care about the squared limit

I'm not sure how to guarantee the vector length quantity is actually <= the limit after the math in limit_length() no matter how it's done. I am guessing that could have errors too if they are very close. Anybody have any idea what rounding mode we use?

tpwrules · 2025-01-01T20:18:29Z

Doing some research and fiddling suggests this is the correct solution. We need to audit other uses of limit_length() to make sure they won't explode if the vector ends up slightly over the limit.

It's not feasible due to the vagaries of floating point math to guarantee it's always equal or under.

tpwrules · 2025-01-01T20:20:08Z

libraries/AP_Math/control.cpp

@@ -387,7 +387,7 @@ bool limit_accel_xy(const Vector2f& vel, Vector2f& accel, float accel_max)
            if (accel_cross.limit_length(accel_max)) {
                accel_dir = 0.0;
            } else {
-                float accel_max_dir = safe_sqrt(sq(accel_max) - accel_cross.length_squared());
+                float accel_max_dir = safe_sqrt(MAX(0, sq(accel_max) - accel_cross.length_squared()));


Suggested change

float accel_max_dir = safe_sqrt(MAX(0, sq(accel_max) - accel_cross.length_squared()));

// limit_length can't absolutely guarantee this subtraction won't be slightly negative

float accel_max_dir = safe_sqrt(MAX(0, sq(accel_max) - accel_cross.length_squared()));

IamPete1 · 2025-01-01T21:12:27Z

We could adjust limit_length to check the squared length, or have an alternate method for where we care about the squared limit

We could limit to the smaller of the the existing and the squared so we always conservative.

On the whole I tend to think know we know what is going on we probably don't need to panic. I'd even be tempted not to fix this case so we can see if a recent change had made this more likely.

We do have a test for the function, maybe we should add some cases there.

ardupilot/libraries/AP_Math/tests/test_vector2.cpp

Lines 125 to 131 in 5726c03

    
           TEST(Vector2Test, length) 
        
           { 
        
               EXPECT_FLOAT_EQ(25, Vector2f(3, 4).length_squared()); 
        
               Vector2f v_float1(1.0f, 1.0f); 
        
               EXPECT_TRUE(v_float1.limit_length(1.0f)); 
        
               EXPECT_FALSE(Vector2f(-0.0f, 0.0f).limit_length(1.0f)); 
        
           }

peterbarker · 2025-01-01T21:22:17Z

Closes #28969

tpwrules · 2025-01-01T22:36:50Z

I don't actually think limit_length() can be entirely fixed unfortunately. I think it's better to just be mindful of its limitations.

And it has happened again here: https://github.com/ArduPilot/ardupilot/actions/runs/12572260649/job/35044375709?pr=28857

Co-authored-by: Leonard Hall <[email protected]> the cross-product code can produce something slightly negative. Stop passing that into safe_sqrt; it isn't *that* safe on SITL!

…r case

in this corner case we *do* need to scale the vector

this can avoid problems where floating point precision means that post-square-rooting the lengths are considered equal. Subtracting the squares after making such a check can lead to attempting to take the square root of a negative number

peterbarker · 2025-01-02T05:28:15Z

I've written a test for this now so other people can play with things in gdb easily.

I've also pushed up a patch reverting the original fix and proposing an alternative fix. Basically stop using the apples comparison to determine if we do the oranges maths - use a single fruit in both cases.

peterbarker · 2025-01-02T05:41:44Z

And it has happened again here: https://github.com/ArduPilot/ardupilot/actions/runs/12572260649/job/35044375709?pr=28857

Well, that's coincidence... third is presumed to be enemy action...

I've had a quick look through the stack and code changes and haven't found any smoking gun. @robertlong13 changed things up in this test a bit recently be adding in the ICE frames. Is it possible that the slightly different mission is causing this? 3982d57

tpwrules · 2025-01-02T16:51:56Z

So this both fixes the problem and doesn't. The problem is fixed because we won't execute a square root with a negative number; we now trust that the limit function says it wanted to limit the vector and then set accel_dir to 0 instead of calling safe_sqrt.

But it doesn't fix the problem because the vector is still in some sense too long. Calling accel_cross.limit_length_squared(accel_max) again will still return true. The test (len_squared <= max_length_squared) is still false, but len == max_length so x and y get multiplied by precisely one and the vector doesn't change at all despite the function returning that it tried to change it.

I think adding a MAX(0, <old value>) to the safe_sqrt now is the best solution to stop CI flapping, then do some more detailed studies of who uses limit_length and what are the problems for going over.

The current fix is sort of worse because now limit_length_squared is lying. It said it limited the vector but didn't. I haven't been able to prove limit_length can't lie, it's not trivial to make it do so, but I suspect it's possible with subnormals or something.

Is it possible that the slightly different mission is causing this?

Yes, I expect so. It just happens to sometimes hit this exact point now. Here's the enemy action: https://github.com/ArduPilot/ardupilot/actions/runs/12581405966/job/35065761383?pr=28991 . The numbers are again identical.

tpwrules · 2025-01-04T05:06:26Z

Another failure: https://github.com/ArduPilot/ardupilot/actions/runs/12591859248/job/35097592628?pr=28996

Going to try to replicate locally and see if that mission change was the instigator.

Still prefer the MAX fix.

tpwrules · 2025-01-04T21:32:38Z

I was able to replicate it (running only the FlyEachFrame test and quadplane-tilt frame) but sometimes it took 50-60 tries and sometimes it was one or two. It would be hard to say what particular commit did it. I imagine it's kind of dependent on system load and stuff too.

tpwrules · 2025-01-05T05:41:52Z

Now it's infected copter: https://github.com/ArduPilot/ardupilot/actions/runs/12616584539/job/35158079748?pr=28857

accel_cross = {x = 290.641205, y = -485.89682}, accel_max = 566.187256

peterbarker · 2025-01-05T09:37:31Z

Now it's infected copter: https://github.com/ArduPilot/ardupilot/actions/runs/12616584539/job/35158079748?pr=28857

We're probably going to need fire by the end of this.

This comment was marked as outdated.

Sign in to view

tpwrules reviewed Jan 1, 2025

View reviewed changes

peterbarker added 2 commits January 2, 2025 15:36

AP_Math: add test for FPE in control code

c77bef7

AP_Math: prevent FPE in SITL when limitting accel, vectors

2ba0458

Co-authored-by: Leonard Hall <[email protected]> the cross-product code can produce something slightly negative. Stop passing that into safe_sqrt; it isn't *that* safe on SITL!

peterbarker added the WIP label Jan 2, 2025

peterbarker added 3 commits January 2, 2025 15:41

AP_Math: tests: we no longer expect and exception when limiting corne…

ea6ad30

…r case

AP_Math: tests: correct tested return value for limit_accel_xy

ba62e86

in this corner case we *do* need to scale the vector

AP_Math: create and use limit_length_squared

a73e086

this can avoid problems where floating point precision means that post-square-rooting the lengths are considered equal. Subtracting the squares after making such a check can lead to attempting to take the square root of a negative number

peterbarker force-pushed the pr/prevent-fpe-control branch from 6d764d3 to a73e086 Compare January 2, 2025 05:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AP_Math: prevent FPE in SITL when limitting accel, vectors #28981

AP_Math: prevent FPE in SITL when limitting accel, vectors #28981

peterbarker commented Jan 1, 2025

khancyr commented Jan 1, 2025

peterbarker commented Jan 1, 2025

IamPete1 commented Jan 1, 2025

This comment was marked as outdated.

tpwrules commented Jan 1, 2025 •

edited

Loading

tpwrules commented Jan 1, 2025

tpwrules Jan 1, 2025

IamPete1 commented Jan 1, 2025

peterbarker commented Jan 1, 2025

tpwrules commented Jan 1, 2025

peterbarker commented Jan 2, 2025

peterbarker commented Jan 2, 2025

tpwrules commented Jan 2, 2025 •

edited

Loading

tpwrules commented Jan 4, 2025

tpwrules commented Jan 4, 2025

tpwrules commented Jan 5, 2025

peterbarker commented Jan 5, 2025

	float accel_max_dir = safe_sqrt(MAX(0, sq(accel_max) - accel_cross.length_squared()));
	// limit_length can't absolutely guarantee this subtraction won't be slightly negative
	float accel_max_dir = safe_sqrt(MAX(0, sq(accel_max) - accel_cross.length_squared()));

AP_Math: prevent FPE in SITL when limitting accel, vectors #28981

Are you sure you want to change the base?

AP_Math: prevent FPE in SITL when limitting accel, vectors #28981

Conversation

peterbarker commented Jan 1, 2025

khancyr commented Jan 1, 2025

peterbarker commented Jan 1, 2025

IamPete1 commented Jan 1, 2025

This comment was marked as outdated.

tpwrules commented Jan 1, 2025 • edited Loading

tpwrules commented Jan 1, 2025

tpwrules Jan 1, 2025

Choose a reason for hiding this comment

IamPete1 commented Jan 1, 2025

peterbarker commented Jan 1, 2025

tpwrules commented Jan 1, 2025

peterbarker commented Jan 2, 2025

peterbarker commented Jan 2, 2025

tpwrules commented Jan 2, 2025 • edited Loading

tpwrules commented Jan 4, 2025

tpwrules commented Jan 4, 2025

tpwrules commented Jan 5, 2025

peterbarker commented Jan 5, 2025

tpwrules commented Jan 1, 2025 •

edited

Loading

tpwrules commented Jan 2, 2025 •

edited

Loading