Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest/mixedversion: upgrade plan should be bounded #138014

Open
srosenberg opened this issue Dec 26, 2024 · 5 comments · May be fixed by #137963
Open

roachtest/mixedversion: upgrade plan should be bounded #138014

srosenberg opened this issue Dec 26, 2024 · 5 comments · May be fixed by #137963
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-testeng TestEng Team

Comments

@srosenberg
Copy link
Member

srosenberg commented Dec 26, 2024

Currently, the mixedversion framework planner doesn't bound the (total) number of steps. This can lead to flakes, wherein an otherwise stable test like tpcc/mixed-headroom/n5cpu16, can exceed its execution time limit, e.g., 7h in case of [1].
Indeed, as we illustrate below, the length of an execution plan can vary quite a bit. Thus, a conservative default, e.g., 50, would enforce an upper bound on the resulting plan, preventing time outs due to excessive running time.

[1] #137332 (comment)

Jira issue: CRDB-45848

@srosenberg srosenberg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-testeng TestEng Team labels Dec 26, 2024
Copy link

blathers-crl bot commented Dec 26, 2024

cc @cockroachdb/test-eng

srosenberg added a commit to srosenberg/cockroach that referenced this issue Dec 28, 2024
If `MVT_DRY_RUN_MODE` env. var. is set, print
the mixedversion test plan and exit.

Resolves: cockroachdb#138014
Epic: none
Release note: None
srosenberg added a commit to srosenberg/cockroach that referenced this issue Dec 29, 2024
Previously, the mixedversion framework did not bound
the total number of steps in a test plan. Since steps
are generated according to different pseudo-random
distributions, the total number of resulting steps
can vary significantly.
E.g., for `tpcc/mixed-headroom/n5cpu16`, the smallest
test plan has 14 steps, whereas the largest, based
on a sampling of 1_000_000 valid test plans, has
135 steps!

High variability in the size of the test plan
is directly proportional to the running time.
Thus, a very large test plan can cause a test
to time out, due to exceeding its max running time.
That is the case for `tpcc/mixed-headroom/n5cpu16`.

This PR adds an option, namely `MaxNumPlanSteps`,
which enforces an upper bound. If a generated
test plan exceeds the specified value, a new
one is generated until the resulting length
is within the specification.

This PR also adds a primitive `dry-run` mode,
which can be useful for debugging test plans.
If `MVT_DRY_RUN_MODE` env. var. is set, print
the mixedversion test plan and exit.

Resolves: cockroachdb#138014
Informs: cockroachdb#137332
Epic: none
Release note: None
@srosenberg
Copy link
Member Author

Below we can see the distribution of all plan lengths after 1_000_000 runs,

export MVT_DRY_RUN=true
roachtest run --count 1000000 --local --cockroach `pwd`/cockroach --workload `pwd`/bin/workload  'tpcc/mixed-headroom/n5cpu16
cdf

@srosenberg
Copy link
Member Author

Zooming in, here are the top-5 smallest and largest plan lengths,

grep -B 1 "separator" /tmp/out |awk '{print $NF}' |tr -d ")" |tr -d "(" |grep -E "[0-9]+"|sort -u |sort -n |head -5
14
15
16
17
18
grep -B 1 "separator" /tmp/out |awk '{print $NF}' |tr -d ")" |tr -d "(" |grep -E "[0-9]+"|sort -u |sort -n |tail -5
131
132
133
134
135

@srosenberg
Copy link
Member Author

The smallest (possible) plan is 14 steps,

Seed:               -4434193504661377170
Upgrades:           v24.3.0 → <current>
Deployment mode:    system-only
Plan:
├── start cluster at version "v24.3.0" (1)
├── wait for all nodes (:1-4) to acknowledge cluster version '24.3' on system tenant (2)
├── run startup hooks concurrently
│   ├── run "maybe enable tenant features", after 3s delay (3)
│   ├── run "load TPCC dataset", after 0s delay (4)
│   └── run "load bank dataset", after 50ms delay (5)
└── upgrade cluster from "v24.3.0" to "<current>"
├── prevent auto-upgrades on system tenant by setting `preserve_downgrade_option` (6)
├── upgrade nodes :1-4 from "v24.3.0" to "<current>"
│   ├── restart node 2 with binary version <current> (7)
│   ├── run "TPCC workload" (8)
│   ├── restart node 1 with binary version <current> (9)
│   ├── restart node 4 with binary version <current> (10)
│   └── restart node 3 with binary version <current> (11)
├── allow upgrade to happen on system tenant by resetting `preserve_downgrade_option` (12)
├── wait for all nodes (:1-4) to acknowledge cluster version <current> on system tenant (13)
└── run "check TPCC workload" (14)

The largest (sampled) plan is 135 steps,

Seed:               -4434193504661377170
Upgrades:           v23.2.17 → v24.1.5 → v24.3.1 → <current>
Deployment mode:    separate-process
Mutators:           cluster_setting[kv.expiration_leases_only.enabled], cluster_setting[storage.ingest_split.enabled], cluster_setting[kv.snapshot_receiver.excise.enabled], cluster_setting[storage.sstable.compression_algorithm]
Plan:
├── install fixtures for version "v23.2.17" (1)
├── start cluster at version "v23.2.17" (2)
├── wait for all nodes (:1-4) to acknowledge cluster version '23.2' on system tenant (3)
├── start separate process virtual cluster mixed-version-tenant-gnpkm with binary version v23.2.17 (4)
├── wait for all nodes (:1-4) to acknowledge cluster version '23.2' on mixed-version-tenant-gnpkm tenant (5)
├── set cluster setting "spanconfig.tenant_limit" to '50000' on mixed-version-tenant-gnpkm tenant (6)
├── disable KV and tenant(SQL) rate limiter on mixed-version-tenant-gnpkm tenant (7)
├── set cluster setting "server.secondary_tenants.authorization.mode" to 'allow-all' on system tenant (8)
├── delete all-tenants override for the `version` key (9)
├── run startup hooks concurrently
│   ├── run "maybe enable tenant features", after 500ms delay (10)
│   ├── run "load TPCC dataset", after 50ms delay (11)
│   └── run "load bank dataset", after 500ms delay (12)
├── upgrade cluster from "v23.2.17" to "v24.1.5"
│   ├── upgrade storage cluster
│   │   ├── prevent auto-upgrades on system tenant by setting `preserve_downgrade_option` (13)
│   │   ├── upgrade nodes :1-4 from "v23.2.17" to "v24.1.5"
│   │   │   ├── restart system server on node 4 with binary version v24.1.5 (14)
│   │   │   ├── run "TPCC workload" (15)
│   │   │   ├── restart system server on node 1 with binary version v24.1.5 (16)
│   │   │   ├── restart system server on node 3 with binary version v24.1.5 (17)
│   │   │   └── restart system server on node 2 with binary version v24.1.5 (18)
│   │   ├── downgrade nodes :1-4 from "v24.1.5" to "v23.2.17"
│   │   │   ├── restart system server on node 4 with binary version v23.2.17 (19)
│   │   │   ├── restart system server on node 1 with binary version v23.2.17 (20)
│   │   │   ├── restart system server on node 2 with binary version v23.2.17 (21)
│   │   │   ├── run "TPCC workload" (22)
│   │   │   └── restart system server on node 3 with binary version v23.2.17 (23)
│   │   ├── upgrade nodes :1-4 from "v23.2.17" to "v24.1.5"
│   │   │   ├── restart system server on node 3 with binary version v24.1.5 (24)
│   │   │   ├── restart system server on node 4 with binary version v24.1.5 (25)
│   │   │   ├── set cluster setting "storage.sstable.compression_algorithm" to 'zstd' on system tenant (26)
│   │   │   ├── run "TPCC workload" (27)
│   │   │   ├── restart system server on node 1 with binary version v24.1.5 (28)
│   │   │   └── restart system server on node 2 with binary version v24.1.5 (29)
│   │   ├── allow upgrade to happen on system tenant by resetting `preserve_downgrade_option` (30)
│   │   ├── set cluster setting "storage.ingest_split.enabled" to 'false' on system tenant (31)
│   │   ├── run "TPCC workload" (32)
│   │   ├── wait for all nodes (:1-4) to acknowledge cluster version '24.1' on system tenant (33)
│   │   ├── run following steps concurrently
│   │   │   ├── set cluster setting "kv.expiration_leases_only.enabled" to 'false' on system tenant, after 10ms delay (34)
│   │   │   └── reset cluster setting "storage.ingest_split.enabled" on system tenant, after 50ms delay (35)
│   │   └── run "check TPCC workload" (36)
│   └── upgrade tenant
│      ├── upgrade nodes :1-4 from "v23.2.17" to "v24.1.5"
│      │   ├── restart mixed-version-tenant-gnpkm server on node 2 with binary version v24.1.5 (37)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 1 with binary version v24.1.5 (38)
│      │   ├── run "TPCC workload" (39)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 4 with binary version v24.1.5 (40)
│      │   └── restart mixed-version-tenant-gnpkm server on node 3 with binary version v24.1.5 (41)
│      ├── downgrade nodes :1-4 from "v24.1.5" to "v23.2.17"
│      │   ├── restart mixed-version-tenant-gnpkm server on node 4 with binary version v23.2.17 (42)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 2 with binary version v23.2.17 (43)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 3 with binary version v23.2.17 (44)
│      │   └── restart mixed-version-tenant-gnpkm server on node 1 with binary version v23.2.17 (45)
│      ├── upgrade nodes :1-4 from "v23.2.17" to "v24.1.5"
│      │   ├── restart mixed-version-tenant-gnpkm server on node 2 with binary version v24.1.5 (46)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 4 with binary version v24.1.5 (47)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 1 with binary version v24.1.5 (48)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 3 with binary version v24.1.5 (49)
│      │   └── run "TPCC workload" (50)
│      ├── run following steps concurrently
│      │   ├── set `version` to '24.1' on mixed-version-tenant-gnpkm tenant, after 500ms delay (51)
│      │   └── run "TPCC workload", after 10ms delay (52)
│      ├── wait for all nodes (:1-4) to acknowledge cluster version '24.1' on mixed-version-tenant-gnpkm tenant (53)
│      └── run "check TPCC workload" (54)
├── upgrade cluster from "v24.1.5" to "v24.3.1"
│   ├── upgrade storage cluster
│   │   ├── prevent auto-upgrades on system tenant by setting `preserve_downgrade_option` (55)
│   │   ├── upgrade nodes :1-4 from "v24.1.5" to "v24.3.1"
│   │   │   ├── restart system server on node 3 with binary version v24.3.1 (56)
│   │   │   ├── run following steps concurrently
│   │   │   │   ├── run "TPCC workload", after 500ms delay (57)
│   │   │   │   ├── set cluster setting "kv.snapshot_receiver.excise.enabled" to 'false' on system tenant, after 500ms delay (58)
│   │   │   │   └── set cluster setting "kv.expiration_leases_only.enabled" to 'true' on system tenant, after 18s delay (59)
│   │   │   ├── restart system server on node 2 with binary version v24.3.1 (60)
│   │   │   ├── restart system server on node 1 with binary version v24.3.1 (61)
│   │   │   └── restart system server on node 4 with binary version v24.3.1 (62)
│   │   ├── downgrade nodes :1-4 from "v24.3.1" to "v24.1.5"
│   │   │   ├── restart system server on node 2 with binary version v24.1.5 (63)
│   │   │   ├── run "TPCC workload" (64)
│   │   │   ├── restart system server on node 1 with binary version v24.1.5 (65)
│   │   │   ├── restart system server on node 3 with binary version v24.1.5 (66)
│   │   │   └── restart system server on node 4 with binary version v24.1.5 (67)
│   │   ├── upgrade nodes :1-4 from "v24.1.5" to "v24.3.1"
│   │   │   ├── restart system server on node 2 with binary version v24.3.1 (68)
│   │   │   ├── restart system server on node 4 with binary version v24.3.1 (69)
│   │   │   ├── restart system server on node 3 with binary version v24.3.1 (70)
│   │   │   ├── restart system server on node 1 with binary version v24.3.1 (71)
│   │   │   └── run "TPCC workload" (72)
│   │   ├── allow upgrade to happen on system tenant by resetting `preserve_downgrade_option` (73)
│   │   ├── run "TPCC workload" (74)
│   │   ├── wait for all nodes (:1-4) to acknowledge cluster version '24.3' on system tenant (75)
│   │   ├── set cluster setting "storage.sstable.compression_algorithm" to 'snappy' on system tenant (76)
│   │   └── run "check TPCC workload" (77)
│   └── upgrade tenant
│      ├── upgrade nodes :1-4 from "v24.1.5" to "v24.3.1"
│      │   ├── restart mixed-version-tenant-gnpkm server on node 2 with binary version v24.3.1 (78)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 3 with binary version v24.3.1 (79)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 4 with binary version v24.3.1 (80)
│      │   ├── run "TPCC workload" (81)
│      │   └── restart mixed-version-tenant-gnpkm server on node 1 with binary version v24.3.1 (82)
│      ├── downgrade nodes :1-4 from "v24.3.1" to "v24.1.5"
│      │   ├── restart mixed-version-tenant-gnpkm server on node 1 with binary version v24.1.5 (83)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 4 with binary version v24.1.5 (84)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 3 with binary version v24.1.5 (85)
│      │   └── restart mixed-version-tenant-gnpkm server on node 2 with binary version v24.1.5 (86)
│      ├── upgrade nodes :1-4 from "v24.1.5" to "v24.3.1"
│      │   ├── restart mixed-version-tenant-gnpkm server on node 3 with binary version v24.3.1 (87)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 1 with binary version v24.3.1 (88)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 4 with binary version v24.3.1 (89)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 2 with binary version v24.3.1 (90)
│      │   └── run "TPCC workload" (91)
│      ├── run following steps concurrently
│      │   ├── set `version` to '24.3' on mixed-version-tenant-gnpkm tenant, after 18s delay (92)
│      │   └── run "TPCC workload", after 18s delay (93)
│      ├── wait for all nodes (:1-4) to acknowledge cluster version '24.3' on mixed-version-tenant-gnpkm tenant (94)
│      └── run "check TPCC workload" (95)
└── upgrade cluster from "v24.3.1" to "<current>"
   ├── upgrade storage cluster
   │   ├── reset cluster setting "kv.expiration_leases_only.enabled" on system tenant (96)
   │   ├── prevent auto-upgrades on system tenant by setting `preserve_downgrade_option` (97)
   │   ├── upgrade nodes :1-4 from "v24.3.1" to "<current>"
   │   │   ├── restart system server on node 4 with binary version <current> (98)
   │   │   ├── restart system server on node 1 with binary version <current> (99)
   │   │   ├── run "TPCC workload" (100)
   │   │   ├── restart system server on node 3 with binary version <current> (101)
   │   │   └── restart system server on node 2 with binary version <current> (102)
   │   ├── downgrade nodes :1-4 from "<current>" to "v24.3.1"
   │   │   ├── restart system server on node 3 with binary version v24.3.1 (103)
   │   │   ├── restart system server on node 1 with binary version v24.3.1 (104)
   │   │   ├── run "TPCC workload" (105)
   │   │   ├── restart system server on node 4 with binary version v24.3.1 (106)
   │   │   └── restart system server on node 2 with binary version v24.3.1 (107)
   │   ├── upgrade nodes :1-4 from "v24.3.1" to "<current>"
   │   │   ├── restart system server on node 1 with binary version <current> (108)
   │   │   ├── restart system server on node 3 with binary version <current> (109)
   │   │   ├── restart system server on node 2 with binary version <current> (110)
   │   │   ├── restart system server on node 4 with binary version <current> (111)
   │   │   └── run "TPCC workload" (112)
   │   ├── allow upgrade to happen on system tenant by resetting `preserve_downgrade_option` (113)
   │   ├── reset cluster setting "storage.sstable.compression_algorithm" on system tenant (114)
   │   ├── wait for all nodes (:1-4) to acknowledge cluster version <current> on system tenant (115)
   │   └── run "check TPCC workload" (116)
   └── upgrade tenant
      ├── upgrade nodes :1-4 from "v24.3.1" to "<current>"
      │   ├── restart mixed-version-tenant-gnpkm server on node 1 with binary version <current> (117)
      │   ├── restart mixed-version-tenant-gnpkm server on node 3 with binary version <current> (118)
      │   ├── run "TPCC workload" (119)
      │   ├── restart mixed-version-tenant-gnpkm server on node 2 with binary version <current> (120)
      │   └── restart mixed-version-tenant-gnpkm server on node 4 with binary version <current> (121)
      ├── downgrade nodes :1-4 from "<current>" to "v24.3.1"
      │   ├── restart mixed-version-tenant-gnpkm server on node 3 with binary version v24.3.1 (122)
      │   ├── run "TPCC workload" (123)
      │   ├── restart mixed-version-tenant-gnpkm server on node 2 with binary version v24.3.1 (124)
      │   ├── restart mixed-version-tenant-gnpkm server on node 1 with binary version v24.3.1 (125)
      │   └── restart mixed-version-tenant-gnpkm server on node 4 with binary version v24.3.1 (126)
      ├── upgrade nodes :1-4 from "v24.3.1" to "<current>"
      │   ├── restart mixed-version-tenant-gnpkm server on node 1 with binary version <current> (127)
      │   ├── restart mixed-version-tenant-gnpkm server on node 2 with binary version <current> (128)
      │   ├── restart mixed-version-tenant-gnpkm server on node 3 with binary version <current> (129)
      │   ├── restart mixed-version-tenant-gnpkm server on node 4 with binary version <current> (130)
      │   └── run "TPCC workload" (131)
      ├── run following steps concurrently
      │   ├── set `version` to <current> on mixed-version-tenant-gnpkm tenant, after 0s delay (132)
      │   └── run "TPCC workload", after 18s delay (133)
      ├── wait for all nodes (:1-4) to acknowledge cluster version <current> on mixed-version-tenant-gnpkm tenant (134)
      └── run "check TPCC workload" (135)

@srosenberg
Copy link
Member Author

From the CDF plot, the upper-bound of 70 yields roughly 80% of all plan lengths. So, it seems like a reasonable limit. The expected number of retries, when a generated plan exceeds is fairly small, somewhere between 1 and 3.

Out of 10 full runs, the plan exceeded the upper-bound (i.e., 70), and we ended up with a smaller plan,

 grep WARN artifacts/tpcc/mixed-headroom/n5cpu16/run_?/test.log
artifacts/tpcc/mixed-headroom/n5cpu16/run_1/test.log:[mixed-version-test] 2024/12/30 21:18:27 mixedversion.go:872: WARNING: generated a smaller (65) test plan after 1 retries
artifacts/tpcc/mixed-headroom/n5cpu16/run_2/test.log:[mixed-version-test] 2024/12/30 21:18:38 mixedversion.go:872: WARNING: generated a smaller (40) test plan after 2 retries
artifacts/tpcc/mixed-headroom/n5cpu16/run_3/test.log:[mixed-version-test] 2024/12/30 21:18:17 mixedversion.go:872: WARNING: generated a smaller (63) test plan after 1 retries
artifacts/tpcc/mixed-headroom/n5cpu16/run_6/test.log:[mixed-version-test] 2024/12/30 23:31:06 mixedversion.go:872: WARNING: generated a smaller (50) test plan after 1 retries

We can see that 65 resulted in the longest time, namely 23024.43s, while 40 resulted in the shortest time, namely 7322.53s.

artifacts/tpcc/mixed-headroom/n5cpu16/run_1/test.log:2024/12/31 03:42:16 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#1 (23024.43s)
artifacts/tpcc/mixed-headroom/n5cpu16/run_2/test.log:2024/12/30 23:20:45 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#2 (7322.53s)
artifacts/tpcc/mixed-headroom/n5cpu16/run_5/test.log:2024/12/31 02:09:35 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#5 (10091.87s)
artifacts/tpcc/mixed-headroom/n5cpu16/run_6/test.log:2024/12/31 03:41:27 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#6 (15016.34s)
artifacts/tpcc/mixed-headroom/n5cpu16/run_7/test.log:2024/12/31 03:32:04 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#7 (11087.49s)
artifacts/tpcc/mixed-headroom/n5cpu16/run_8/test.log:2024/12/31 06:01:30 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#8 (13877.50s)
artifacts/tpcc/mixed-headroom/n5cpu16/run_9/test.log:2024/12/31 06:53:45 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#9 (12065.47s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-testeng TestEng Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant