-
-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace known-problematic variance algorithm (for column_variance
)
#1320
base: main
Are you sure you want to change the base?
Conversation
The existing algorithm for `column_variance` uses the textbook formula (`E[X^2]` - E[X]^2), which is well-established to have numerical issues. While the intention (traversal of the elements in column-major order) of the extant algorithm is apparent, we should not sacrifice precision when we do not need to -- the two-pass algorithm for variance (N.B. the existing algorithm is already a two-pass algorithm, anyway) using the formula `E[(x - E[x])(x - E[x]])` can be substituted without issue. Notably, the other variance implementations in the `statistics` module use `E[(x -E[x])(x - E[x]])`. Loss of precision aside, keeping the existing implementation of `column_variance` causes the obvious absurdity: ```rust use nalgebra::Matrix2x3; let m = Matrix2x3::new(1.0, 2.0, 3.0, 4.0, 5.0, 6.0); assert_ne!(m.column_variance().transpose(), m.transpose().row_variance()); ``` We can eliminate both the loss of precision the glaring inconsistency by switching to the implementation provided by this PR. For a comprehensive analysis of variance algorithms, see this [reference](https://ds.ifi.uni-heidelberg.de/files/Team/eschubert/publications/SSDBM18-covariance-authorcopy.pdf), in particular, Table 2. The "two-pass" described in the paper is the implementation given in this PR. In terms of simplicity (hence, easier to maintain), "two-pass" is a suitable choice; in terms of runtime performance and precision, it is a good balance (c.f. Youngs & Cramer and "textbook"). Furthermore, it is consistent with the variance algorithm used in the other "*variance" algorithms in the `statistics` module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! We might be able to use compress_columns
here, to make this effectively just a clone of the implementation of variance
-- however I think that might introduce unnecessary column clones when dealing with an unconstrained scalar type and dynamic-sized columns, so I think this element-wise approach makes sense.
I considered the |
Stats aren't really my forte, but looks good to me. However, I think if there's an improvement in precision, it would be good to actually try to verify this with a test that failed before and passes now. I'm also concerned about the general lack of tests for this method. There appears only to be the single doc test in its documentation. This is not the fault of @andrewjradcliffe, of course, but it would be nice to have a few more tests that ensure that we're not introducing any new bugs here. |
Indeed, it would be nice to have test coverage for the statistics-related methods -- though, computation of variance is really the only place mistakes can occur. I take it the proposal is for me to write a set of tests which fail under the current algorithm -- due to loss of precision -- but succeed with the new? |
Well, I think ideally we would already have tests for:
Unfortunately we don't have this at the moment, so we can only hope that the existing implementation works correctly for these cases. However, I'm a little reluctant to merge a new implementation without these tests, at the very least to make sure we don't regress on any of these cases. Would you be able to add these tests?
I think, if the claim is that the new implementation has higher precision, it would indeed be prudent to somehow try to verify that it actually delivers this — at least for one test case. |
The existing algorithm for
column_variance
uses the textbook formula (E[X^2]
- E[X]^2), which is well-established to have numerical issues. While the intention (traversalof the elements in column-major order) of the extant algorithm is apparent, we should not sacrifice precision when we do not need to -- the two-pass algorithm for variance (N.B. the existing algorithm is already a two-pass algorithm, anyway) using the formula
E[(x - E[x])(x - E[x]])
can be substituted without issue.Notably, the other variance implementations in the
statistics
module useE[(x -E[x])(x - E[x]])
. Loss of precision aside, keeping the existing implementation ofcolumn_variance
causes the obvious absurdity:We can eliminate both the loss of precision the glaring inconsistency by switching to the implementation provided by this PR.
For a comprehensive analysis of variance algorithms, see this reference, in particular, Table 2. The "two-pass" described in the paper is the implementation given in this PR. In terms of simplicity (hence, easier to maintain), "two-pass" is a suitable choice; in terms of runtime performance and precision, it is a good balance (c.f. Youngs & Cramer and "textbook"). Furthermore, it is consistent with the variance algorithm used in the other "*variance" algorithms in the
statistics
module.