-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest/mixedversion: upgrade plan should be bounded #138014
Comments
cc @cockroachdb/test-eng |
If `MVT_DRY_RUN_MODE` env. var. is set, print the mixedversion test plan and exit. Resolves: cockroachdb#138014 Epic: none Release note: None
Previously, the mixedversion framework did not bound the total number of steps in a test plan. Since steps are generated according to different pseudo-random distributions, the total number of resulting steps can vary significantly. E.g., for `tpcc/mixed-headroom/n5cpu16`, the smallest test plan has 14 steps, whereas the largest, based on a sampling of 1_000_000 valid test plans, has 135 steps! High variability in the size of the test plan is directly proportional to the running time. Thus, a very large test plan can cause a test to time out, due to exceeding its max running time. That is the case for `tpcc/mixed-headroom/n5cpu16`. This PR adds an option, namely `MaxNumPlanSteps`, which enforces an upper bound. If a generated test plan exceeds the specified value, a new one is generated until the resulting length is within the specification. This PR also adds a primitive `dry-run` mode, which can be useful for debugging test plans. If `MVT_DRY_RUN_MODE` env. var. is set, print the mixedversion test plan and exit. Resolves: cockroachdb#138014 Informs: cockroachdb#137332 Epic: none Release note: None
Zooming in, here are the top-5 smallest and largest plan lengths,
|
The smallest (possible) plan is
The largest (sampled) plan is
|
From the CDF plot, the upper-bound of Out of 10 full runs, the plan exceeded the upper-bound (i.e.,
We can see that
|
Currently, the mixedversion framework planner doesn't bound the (total) number of steps. This can lead to flakes, wherein an otherwise stable test like
tpcc/mixed-headroom/n5cpu16
, can exceed its execution time limit, e.g.,7h
in case of [1].Indeed, as we illustrate below, the length of an execution plan can vary quite a bit. Thus, a conservative default, e.g.,
50
, would enforce an upper bound on the resulting plan, preventing time outs due to excessive running time.[1] #137332 (comment)
Jira issue: CRDB-45848
The text was updated successfully, but these errors were encountered: