roachtest/mixedversion: upgrade plan should be bounded #138014

srosenberg · 2024-12-26T22:39:15Z

Currently, the mixedversion framework planner doesn't bound the (total) number of steps. This can lead to flakes, wherein an otherwise stable test like tpcc/mixed-headroom/n5cpu16, can exceed its execution time limit, e.g., 7h in case of [1].
Indeed, as we illustrate below, the length of an execution plan can vary quite a bit. Thus, a conservative default, e.g., 50, would enforce an upper bound on the resulting plan, preventing time outs due to excessive running time.

[1] #137332 (comment)

Jira issue: CRDB-45848

The text was updated successfully, but these errors were encountered:

blathers-crl · 2024-12-26T22:39:16Z

cc @cockroachdb/test-eng

If `MVT_DRY_RUN_MODE` env. var. is set, print the mixedversion test plan and exit. Resolves: cockroachdb#138014 Epic: none Release note: None

Previously, the mixedversion framework did not bound the total number of steps in a test plan. Since steps are generated according to different pseudo-random distributions, the total number of resulting steps can vary significantly. E.g., for `tpcc/mixed-headroom/n5cpu16`, the smallest test plan has 14 steps, whereas the largest, based on a sampling of 1_000_000 valid test plans, has 135 steps! High variability in the size of the test plan is directly proportional to the running time. Thus, a very large test plan can cause a test to time out, due to exceeding its max running time. That is the case for `tpcc/mixed-headroom/n5cpu16`. This PR adds an option, namely `MaxNumPlanSteps`, which enforces an upper bound. If a generated test plan exceeds the specified value, a new one is generated until the resulting length is within the specification. This PR also adds a primitive `dry-run` mode, which can be useful for debugging test plans. If `MVT_DRY_RUN_MODE` env. var. is set, print the mixedversion test plan and exit. Resolves: cockroachdb#138014 Informs: cockroachdb#137332 Epic: none Release note: None

srosenberg · 2024-12-31T18:05:33Z

Below we can see the distribution of all plan lengths after 1_000_000 runs,

export MVT_DRY_RUN=true
roachtest run --count 1000000 --local --cockroach `pwd`/cockroach --workload `pwd`/bin/workload  'tpcc/mixed-headroom/n5cpu16

srosenberg · 2024-12-31T18:06:25Z

Zooming in, here are the top-5 smallest and largest plan lengths,

grep -B 1 "separator" /tmp/out |awk '{print $NF}' |tr -d ")" |tr -d "(" |grep -E "[0-9]+"|sort -u |sort -n |head -5
14
15
16
17
18

grep -B 1 "separator" /tmp/out |awk '{print $NF}' |tr -d ")" |tr -d "(" |grep -E "[0-9]+"|sort -u |sort -n |tail -5
131
132
133
134
135

srosenberg · 2024-12-31T18:09:21Z

The smallest (possible) plan is 14 steps,

Seed:               -4434193504661377170
Upgrades:           v24.3.0 → <current>
Deployment mode:    system-only
Plan:
├── start cluster at version "v24.3.0" (1)
├── wait for all nodes (:1-4) to acknowledge cluster version '24.3' on system tenant (2)
├── run startup hooks concurrently
│   ├── run "maybe enable tenant features", after 3s delay (3)
│   ├── run "load TPCC dataset", after 0s delay (4)
│   └── run "load bank dataset", after 50ms delay (5)
└── upgrade cluster from "v24.3.0" to "<current>"
├── prevent auto-upgrades on system tenant by setting `preserve_downgrade_option` (6)
├── upgrade nodes :1-4 from "v24.3.0" to "<current>"
│   ├── restart node 2 with binary version <current> (7)
│   ├── run "TPCC workload" (8)
│   ├── restart node 1 with binary version <current> (9)
│   ├── restart node 4 with binary version <current> (10)
│   └── restart node 3 with binary version <current> (11)
├── allow upgrade to happen on system tenant by resetting `preserve_downgrade_option` (12)
├── wait for all nodes (:1-4) to acknowledge cluster version <current> on system tenant (13)
└── run "check TPCC workload" (14)

The largest (sampled) plan is 135 steps,

Seed:               -4434193504661377170
Upgrades:           v23.2.17 → v24.1.5 → v24.3.1 → <current>
Deployment mode:    separate-process
Mutators:           cluster_setting[kv.expiration_leases_only.enabled], cluster_setting[storage.ingest_split.enabled], cluster_setting[kv.snapshot_receiver.excise.enabled], cluster_setting[storage.sstable.compression_algorithm]
Plan:
├── install fixtures for version "v23.2.17" (1)
├── start cluster at version "v23.2.17" (2)
├── wait for all nodes (:1-4) to acknowledge cluster version '23.2' on system tenant (3)
├── start separate process virtual cluster mixed-version-tenant-gnpkm with binary version v23.2.17 (4)
├── wait for all nodes (:1-4) to acknowledge cluster version '23.2' on mixed-version-tenant-gnpkm tenant (5)
├── set cluster setting "spanconfig.tenant_limit" to '50000' on mixed-version-tenant-gnpkm tenant (6)
├── disable KV and tenant(SQL) rate limiter on mixed-version-tenant-gnpkm tenant (7)
├── set cluster setting "server.secondary_tenants.authorization.mode" to 'allow-all' on system tenant (8)
├── delete all-tenants override for the `version` key (9)
├── run startup hooks concurrently
│   ├── run "maybe enable tenant features", after 500ms delay (10)
│   ├── run "load TPCC dataset", after 50ms delay (11)
│   └── run "load bank dataset", after 500ms delay (12)
├── upgrade cluster from "v23.2.17" to "v24.1.5"
│   ├── upgrade storage cluster
│   │   ├── prevent auto-upgrades on system tenant by setting `preserve_downgrade_option` (13)
│   │   ├── upgrade nodes :1-4 from "v23.2.17" to "v24.1.5"
│   │   │   ├── restart system server on node 4 with binary version v24.1.5 (14)
│   │   │   ├── run "TPCC workload" (15)
│   │   │   ├── restart system server on node 1 with binary version v24.1.5 (16)
│   │   │   ├── restart system server on node 3 with binary version v24.1.5 (17)
│   │   │   └── restart system server on node 2 with binary version v24.1.5 (18)
│   │   ├── downgrade nodes :1-4 from "v24.1.5" to "v23.2.17"
│   │   │   ├── restart system server on node 4 with binary version v23.2.17 (19)
│   │   │   ├── restart system server on node 1 with binary version v23.2.17 (20)
│   │   │   ├── restart system server on node 2 with binary version v23.2.17 (21)
│   │   │   ├── run "TPCC workload" (22)
│   │   │   └── restart system server on node 3 with binary version v23.2.17 (23)
│   │   ├── upgrade nodes :1-4 from "v23.2.17" to "v24.1.5"
│   │   │   ├── restart system server on node 3 with binary version v24.1.5 (24)
│   │   │   ├── restart system server on node 4 with binary version v24.1.5 (25)
│   │   │   ├── set cluster setting "storage.sstable.compression_algorithm" to 'zstd' on system tenant (26)
│   │   │   ├── run "TPCC workload" (27)
│   │   │   ├── restart system server on node 1 with binary version v24.1.5 (28)
│   │   │   └── restart system server on node 2 with binary version v24.1.5 (29)
│   │   ├── allow upgrade to happen on system tenant by resetting `preserve_downgrade_option` (30)
│   │   ├── set cluster setting "storage.ingest_split.enabled" to 'false' on system tenant (31)
│   │   ├── run "TPCC workload" (32)
│   │   ├── wait for all nodes (:1-4) to acknowledge cluster version '24.1' on system tenant (33)
│   │   ├── run following steps concurrently
│   │   │   ├── set cluster setting "kv.expiration_leases_only.enabled" to 'false' on system tenant, after 10ms delay (34)
│   │   │   └── reset cluster setting "storage.ingest_split.enabled" on system tenant, after 50ms delay (35)
│   │   └── run "check TPCC workload" (36)
│   └── upgrade tenant
│      ├── upgrade nodes :1-4 from "v23.2.17" to "v24.1.5"
│      │   ├── restart mixed-version-tenant-gnpkm server on node 2 with binary version v24.1.5 (37)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 1 with binary version v24.1.5 (38)
│      │   ├── run "TPCC workload" (39)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 4 with binary version v24.1.5 (40)
│      │   └── restart mixed-version-tenant-gnpkm server on node 3 with binary version v24.1.5 (41)
│      ├── downgrade nodes :1-4 from "v24.1.5" to "v23.2.17"
│      │   ├── restart mixed-version-tenant-gnpkm server on node 4 with binary version v23.2.17 (42)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 2 with binary version v23.2.17 (43)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 3 with binary version v23.2.17 (44)
│      │   └── restart mixed-version-tenant-gnpkm server on node 1 with binary version v23.2.17 (45)
│      ├── upgrade nodes :1-4 from "v23.2.17" to "v24.1.5"
│      │   ├── restart mixed-version-tenant-gnpkm server on node 2 with binary version v24.1.5 (46)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 4 with binary version v24.1.5 (47)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 1 with binary version v24.1.5 (48)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 3 with binary version v24.1.5 (49)
│      │   └── run "TPCC workload" (50)
│      ├── run following steps concurrently
│      │   ├── set `version` to '24.1' on mixed-version-tenant-gnpkm tenant, after 500ms delay (51)
│      │   └── run "TPCC workload", after 10ms delay (52)
│      ├── wait for all nodes (:1-4) to acknowledge cluster version '24.1' on mixed-version-tenant-gnpkm tenant (53)
│      └── run "check TPCC workload" (54)
├── upgrade cluster from "v24.1.5" to "v24.3.1"
│   ├── upgrade storage cluster
│   │   ├── prevent auto-upgrades on system tenant by setting `preserve_downgrade_option` (55)
│   │   ├── upgrade nodes :1-4 from "v24.1.5" to "v24.3.1"
│   │   │   ├── restart system server on node 3 with binary version v24.3.1 (56)
│   │   │   ├── run following steps concurrently
│   │   │   │   ├── run "TPCC workload", after 500ms delay (57)
│   │   │   │   ├── set cluster setting "kv.snapshot_receiver.excise.enabled" to 'false' on system tenant, after 500ms delay (58)
│   │   │   │   └── set cluster setting "kv.expiration_leases_only.enabled" to 'true' on system tenant, after 18s delay (59)
│   │   │   ├── restart system server on node 2 with binary version v24.3.1 (60)
│   │   │   ├── restart system server on node 1 with binary version v24.3.1 (61)
│   │   │   └── restart system server on node 4 with binary version v24.3.1 (62)
│   │   ├── downgrade nodes :1-4 from "v24.3.1" to "v24.1.5"
│   │   │   ├── restart system server on node 2 with binary version v24.1.5 (63)
│   │   │   ├── run "TPCC workload" (64)
│   │   │   ├── restart system server on node 1 with binary version v24.1.5 (65)
│   │   │   ├── restart system server on node 3 with binary version v24.1.5 (66)
│   │   │   └── restart system server on node 4 with binary version v24.1.5 (67)
│   │   ├── upgrade nodes :1-4 from "v24.1.5" to "v24.3.1"
│   │   │   ├── restart system server on node 2 with binary version v24.3.1 (68)
│   │   │   ├── restart system server on node 4 with binary version v24.3.1 (69)
│   │   │   ├── restart system server on node 3 with binary version v24.3.1 (70)
│   │   │   ├── restart system server on node 1 with binary version v24.3.1 (71)
│   │   │   └── run "TPCC workload" (72)
│   │   ├── allow upgrade to happen on system tenant by resetting `preserve_downgrade_option` (73)
│   │   ├── run "TPCC workload" (74)
│   │   ├── wait for all nodes (:1-4) to acknowledge cluster version '24.3' on system tenant (75)
│   │   ├── set cluster setting "storage.sstable.compression_algorithm" to 'snappy' on system tenant (76)
│   │   └── run "check TPCC workload" (77)
│   └── upgrade tenant
│      ├── upgrade nodes :1-4 from "v24.1.5" to "v24.3.1"
│      │   ├── restart mixed-version-tenant-gnpkm server on node 2 with binary version v24.3.1 (78)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 3 with binary version v24.3.1 (79)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 4 with binary version v24.3.1 (80)
│      │   ├── run "TPCC workload" (81)
│      │   └── restart mixed-version-tenant-gnpkm server on node 1 with binary version v24.3.1 (82)
│      ├── downgrade nodes :1-4 from "v24.3.1" to "v24.1.5"
│      │   ├── restart mixed-version-tenant-gnpkm server on node 1 with binary version v24.1.5 (83)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 4 with binary version v24.1.5 (84)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 3 with binary version v24.1.5 (85)
│      │   └── restart mixed-version-tenant-gnpkm server on node 2 with binary version v24.1.5 (86)
│      ├── upgrade nodes :1-4 from "v24.1.5" to "v24.3.1"
│      │   ├── restart mixed-version-tenant-gnpkm server on node 3 with binary version v24.3.1 (87)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 1 with binary version v24.3.1 (88)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 4 with binary version v24.3.1 (89)
│      │   ├── restart mixed-version-tenant-gnpkm server on node 2 with binary version v24.3.1 (90)
│      │   └── run "TPCC workload" (91)
│      ├── run following steps concurrently
│      │   ├── set `version` to '24.3' on mixed-version-tenant-gnpkm tenant, after 18s delay (92)
│      │   └── run "TPCC workload", after 18s delay (93)
│      ├── wait for all nodes (:1-4) to acknowledge cluster version '24.3' on mixed-version-tenant-gnpkm tenant (94)
│      └── run "check TPCC workload" (95)
└── upgrade cluster from "v24.3.1" to "<current>"
   ├── upgrade storage cluster
   │   ├── reset cluster setting "kv.expiration_leases_only.enabled" on system tenant (96)
   │   ├── prevent auto-upgrades on system tenant by setting `preserve_downgrade_option` (97)
   │   ├── upgrade nodes :1-4 from "v24.3.1" to "<current>"
   │   │   ├── restart system server on node 4 with binary version <current> (98)
   │   │   ├── restart system server on node 1 with binary version <current> (99)
   │   │   ├── run "TPCC workload" (100)
   │   │   ├── restart system server on node 3 with binary version <current> (101)
   │   │   └── restart system server on node 2 with binary version <current> (102)
   │   ├── downgrade nodes :1-4 from "<current>" to "v24.3.1"
   │   │   ├── restart system server on node 3 with binary version v24.3.1 (103)
   │   │   ├── restart system server on node 1 with binary version v24.3.1 (104)
   │   │   ├── run "TPCC workload" (105)
   │   │   ├── restart system server on node 4 with binary version v24.3.1 (106)
   │   │   └── restart system server on node 2 with binary version v24.3.1 (107)
   │   ├── upgrade nodes :1-4 from "v24.3.1" to "<current>"
   │   │   ├── restart system server on node 1 with binary version <current> (108)
   │   │   ├── restart system server on node 3 with binary version <current> (109)
   │   │   ├── restart system server on node 2 with binary version <current> (110)
   │   │   ├── restart system server on node 4 with binary version <current> (111)
   │   │   └── run "TPCC workload" (112)
   │   ├── allow upgrade to happen on system tenant by resetting `preserve_downgrade_option` (113)
   │   ├── reset cluster setting "storage.sstable.compression_algorithm" on system tenant (114)
   │   ├── wait for all nodes (:1-4) to acknowledge cluster version <current> on system tenant (115)
   │   └── run "check TPCC workload" (116)
   └── upgrade tenant
      ├── upgrade nodes :1-4 from "v24.3.1" to "<current>"
      │   ├── restart mixed-version-tenant-gnpkm server on node 1 with binary version <current> (117)
      │   ├── restart mixed-version-tenant-gnpkm server on node 3 with binary version <current> (118)
      │   ├── run "TPCC workload" (119)
      │   ├── restart mixed-version-tenant-gnpkm server on node 2 with binary version <current> (120)
      │   └── restart mixed-version-tenant-gnpkm server on node 4 with binary version <current> (121)
      ├── downgrade nodes :1-4 from "<current>" to "v24.3.1"
      │   ├── restart mixed-version-tenant-gnpkm server on node 3 with binary version v24.3.1 (122)
      │   ├── run "TPCC workload" (123)
      │   ├── restart mixed-version-tenant-gnpkm server on node 2 with binary version v24.3.1 (124)
      │   ├── restart mixed-version-tenant-gnpkm server on node 1 with binary version v24.3.1 (125)
      │   └── restart mixed-version-tenant-gnpkm server on node 4 with binary version v24.3.1 (126)
      ├── upgrade nodes :1-4 from "v24.3.1" to "<current>"
      │   ├── restart mixed-version-tenant-gnpkm server on node 1 with binary version <current> (127)
      │   ├── restart mixed-version-tenant-gnpkm server on node 2 with binary version <current> (128)
      │   ├── restart mixed-version-tenant-gnpkm server on node 3 with binary version <current> (129)
      │   ├── restart mixed-version-tenant-gnpkm server on node 4 with binary version <current> (130)
      │   └── run "TPCC workload" (131)
      ├── run following steps concurrently
      │   ├── set `version` to <current> on mixed-version-tenant-gnpkm tenant, after 0s delay (132)
      │   └── run "TPCC workload", after 18s delay (133)
      ├── wait for all nodes (:1-4) to acknowledge cluster version <current> on mixed-version-tenant-gnpkm tenant (134)
      └── run "check TPCC workload" (135)

srosenberg · 2024-12-31T18:20:02Z

From the CDF plot, the upper-bound of 70 yields roughly 80% of all plan lengths. So, it seems like a reasonable limit. The expected number of retries, when a generated plan exceeds is fairly small, somewhere between 1 and 3.

Out of 10 full runs, the plan exceeded the upper-bound (i.e., 70), and we ended up with a smaller plan,

 grep WARN artifacts/tpcc/mixed-headroom/n5cpu16/run_?/test.log
artifacts/tpcc/mixed-headroom/n5cpu16/run_1/test.log:[mixed-version-test] 2024/12/30 21:18:27 mixedversion.go:872: WARNING: generated a smaller (65) test plan after 1 retries
artifacts/tpcc/mixed-headroom/n5cpu16/run_2/test.log:[mixed-version-test] 2024/12/30 21:18:38 mixedversion.go:872: WARNING: generated a smaller (40) test plan after 2 retries
artifacts/tpcc/mixed-headroom/n5cpu16/run_3/test.log:[mixed-version-test] 2024/12/30 21:18:17 mixedversion.go:872: WARNING: generated a smaller (63) test plan after 1 retries
artifacts/tpcc/mixed-headroom/n5cpu16/run_6/test.log:[mixed-version-test] 2024/12/30 23:31:06 mixedversion.go:872: WARNING: generated a smaller (50) test plan after 1 retries

We can see that 65 resulted in the longest time, namely 23024.43s, while 40 resulted in the shortest time, namely 7322.53s.

artifacts/tpcc/mixed-headroom/n5cpu16/run_1/test.log:2024/12/31 03:42:16 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#1 (23024.43s)
artifacts/tpcc/mixed-headroom/n5cpu16/run_2/test.log:2024/12/30 23:20:45 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#2 (7322.53s)
artifacts/tpcc/mixed-headroom/n5cpu16/run_5/test.log:2024/12/31 02:09:35 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#5 (10091.87s)
artifacts/tpcc/mixed-headroom/n5cpu16/run_6/test.log:2024/12/31 03:41:27 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#6 (15016.34s)
artifacts/tpcc/mixed-headroom/n5cpu16/run_7/test.log:2024/12/31 03:32:04 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#7 (11087.49s)
artifacts/tpcc/mixed-headroom/n5cpu16/run_8/test.log:2024/12/31 06:01:30 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#8 (13877.50s)
artifacts/tpcc/mixed-headroom/n5cpu16/run_9/test.log:2024/12/31 06:53:45 test_runner.go:1205: --- PASS: tpcc/mixed-headroom/n5cpu16#9 (12065.47s)

srosenberg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-testeng TestEng Team labels Dec 26, 2024

This was referenced Dec 29, 2024

roachtest/mixedversion: upgrade plan should be bounded #137963

Open

roachtest: tpcc/mixed-headroom/n5cpu16 failed #137332

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest/mixedversion: upgrade plan should be bounded #138014

roachtest/mixedversion: upgrade plan should be bounded #138014

srosenberg commented Dec 26, 2024 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Dec 26, 2024

srosenberg commented Dec 31, 2024

srosenberg commented Dec 31, 2024

srosenberg commented Dec 31, 2024

srosenberg commented Dec 31, 2024

roachtest/mixedversion: upgrade plan should be bounded #138014

roachtest/mixedversion: upgrade plan should be bounded #138014

Comments

srosenberg commented Dec 26, 2024 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Dec 26, 2024

srosenberg commented Dec 31, 2024

srosenberg commented Dec 31, 2024

srosenberg commented Dec 31, 2024

srosenberg commented Dec 31, 2024

srosenberg commented Dec 26, 2024 •

edited by cockroach-jira-scripts

Loading