You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Setup: basic hub spoke with leafnodes.
hub is a single jetstream enabled server in its own domain, so are the leaf nodes.
On the hub, there is a stream named match_events that sources from the match_events stream in each leaf node.
Symptoms:
Sourcing from match_events on one leafnode stopped working.
Despite new messages being added to it, no data is sent to match_events in the hub.
Cause:
I believe we ran into a crash during recovery that sounds similar to one fixed in a release candidate of 2.10.23.
In order to recover from this issue, we renamed the data directory, which resulted in data loss but allowed the system to become functional again.
On the leaf node with the issue, sequence numbers in the stream are less than 80.
Create time is right after the time when the jetstream data directory got renamed and everything started to seemingly function again.
On the hub node, stream info showed the edge connected with low lag.
But when subscribing to the consumer create message of that leaf, we noticed that the requested start seq no is 995 and hence way bigger than 80.
Meaning 995 was the sequence number the previous incarnation of match_events had on that leaf and 79 is the value of the current incarnation
This would explain why there seems to be no data flow ... yet.
Failed workaround:
We edited the match_events stream in the hub to remove the source corresponding to the leaf.
Then we re-added it. This did not seem to work as we recorded the consumer create messages and the sequence number remained at the old value it had before the source got removed & re-added.
I then edited the source to set the op_start_seq to 1. This didn't work either, as presumably a higher sequence number was still in memory.
A few hours later we restarted the hub server. After that the sequence number was reset to one.
Logs:
Hub consumer create messages:
[#13] Received on "$JS.edge-652efb2da425e854a1e8d203.API.CONSUMER.CREATE.match_events" with reply "$JSC.R.FEkNK9Is"
{"stream_name":"match_events","config":{"deliver_policy":"by_start_sequence","opt_start_seq":995,"ack_policy":"none","ack_wait":79200000000000,"max_deliver":1,"replay_policy":"instant","idle_heartbeat":1000000000,"flow_control":true,"deliver_subject":"deliver.edge-652efb2da425e854a1e8d203.match_events.S.l2lkXkVm","inactive_threshold":10000000000,"num_replicas":0,"direct":true},"action":""}
[#21] Received on "$JS.edge-652efb2da425e854a1e8d203.API.CONSUMER.CREATE.match_events" with reply "$JSC.R.3L70SiTW"
{"stream_name":"match_events","config":{"deliver_policy":"by_start_sequence","opt_start_seq":995,"ack_policy":"none","ack_wait":79200000000000,"max_deliver":1,"replay_policy":"instant","idle_heartbeat":1000000000,"flow_control":true,"deliver_subject":"deliver.edge-652efb2da425e854a1e8d203.match_events.S.qyIUgE5D","inactive_threshold":10000000000,"num_replicas":0,"direct":true},"action":""}
Leafnode match_events stream:
> nats stream ls -a
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Streams │
├─────────────────────┬────────────────────────────────────────────┬─────────────────────┬───────────┬─────────┬──────────────┤
│ Name │ Description │ Created │ Messages │ Size │ Last Message │
├─────────────────────┼────────────────────────────────────────────┼─────────────────────┼───────────┼─────────┼──────────────┤
...
│ match_events │ │ 2024-11-27 17:53:49 │ 79 │ 15 MiB │ 25m10s │
...
╰─────────────────────┴────────────────────────────────────────────┴─────────────────────┴───────────┴─────────┴──────────────╯
> nats stream info match_events
Information for Stream match_events created 2024-11-27 17:53:49
Subjects: faceio.match.>
Replicas: 1
Storage: File
Options:
Retention: Limits
Acknowledgments: true
Discard Policy: Old
Duplicate Window: 2m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: unlimited
Maximum Bytes: unlimited
Maximum Age: 7d0h0m0s
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: edge-652efb2da425e854a1e8d203
Leader: edge-652efb2da425e854a1e8d203
State:
Messages: 79
Bytes: 15 MiB
First Sequence: 1 @ 2024-11-27 18:13:14
Last Sequence: 79 @ 2024-12-02 17:57:52
Active Consumers: 1
Number of Subjects: 79
>
Hub consumer create messages for failed workaround:
The first message was received after re-creating the source.
The second message was received after editing opt_start_seq of the source.
In both cases, opt_start_seq in the request remains at the last old value.
>nats sub "$JS.edge-652efb2da425e854a1e8d203.API.CONSUMER.CREATE.match_events"
10:44:28 Subscribing on $JS.edge-652efb2da425e854a1e8d203.API.CONSUMER.CREATE.match_events
[#1] Received on "$JS.edge-652efb2da425e854a1e8d203.API.CONSUMER.CREATE.match_events" with reply "$JSC.R.XoBsIZ61"
{"stream_name":"match_events","config":{"deliver_policy":"by_start_sequence","opt_start_seq":995,"ack_policy":"none","ack_wait":79200000000000,"max_deliver":1,"replay_policy":"instant","idle_heartbeat":1000000000,"flow_control":true,"deliver_subject":"deliver.edge-652efb2da425e854a1e8d203.match_events.S.58t1cta4","inactive_threshold":10000000000,"num_replicas":0,"direct":true},"action":""}
[#2] Received on "$JS.edge-652efb2da425e854a1e8d203.API.CONSUMER.CREATE.match_events" with reply "$JSC.R.h5dA0pYu"
{"stream_name":"match_events","config":{"deliver_policy":"by_start_sequence","opt_start_seq":995,"ack_policy":"none","ack_wait":79200000000000,"max_deliver":1,"replay_policy":"instant","idle_heartbeat":1000000000,"flow_control":true,"deliver_subject":"deliver.edge-652efb2da425e854a1e8d203.match_events.S.j4aQssSi","inactive_threshold":10000000000,"num_replicas":0,"direct":true},"action":""}
Hub consumer create messages after hub nats server restart show opt_start_seq=1.
[#7] Received on "$JS.edge-652efb2da425e854a1e8d203.API.CONSUMER.CREATE.match_events" with reply "$JSC.R.snChCWwx"
{"stream_name":"match_events","config":{"deliver_policy":"by_start_sequence","opt_start_seq":995,"ack_policy":"none","ack_wait":79200000000000,"max_deliver":1,"replay_policy":"instant","idle_heartbeat":1000000000,"flow_control":true,"deliver_subject":"deliver.edge-652efb2da425e854a1e8d203.match_events.S.6U6FCNWD","inactive_threshold":10000000000,"num_replicas":0,"direct":true},"action":""}
15:19:43 Disconnected due to: read tcp 127.0.0.1:32592->127.0.0.1:4222: wsarecv: An existing connection was forcibly closed by the remote host., will attempt reconnect
15:20:09 Reconnected [nats://ff-queue-user:[email protected]:4222]
[#8] Received on "$JS.edge-652efb2da425e854a1e8d203.API.CONSUMER.CREATE.match_events" with reply "$JSC.R.hwd0hutm"
{"stream_name":"match_events","config":{"deliver_policy":"by_start_sequence","opt_start_seq":1,"ack_policy":"none","ack_wait":79200000000000,"max_deliver":1,"replay_policy":"instant","idle_heartbeat":1000000000,"flow_control":true,"deliver_subject":"deliver.edge-652efb2da425e854a1e8d203.match_events.S.ZoqTXdlB","inactive_threshold":10000000000,"num_replicas":0,"direct":true},"action":""}
Hub match_events stream taken right after consumer create [#2] from above, showing start seqno 1
Information for Stream match_events created 2024-11-15 10:45:14
Subjects: faceio.match.>
Replicas: 1
Storage: File
Options:
Retention: Limits
Acknowledgments: true
Discard Policy: Old
Duplicate Window: 2m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: unlimited
Maximum Bytes: unlimited
Maximum Age: 7d0h0m0s
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Replication:
Sources: ...
match_events, Start Seq: 1, API Prefix: $JS.edge-652efb2da425e854a1e8d203.API, Delivery Prefix: deliver.edge-652efb2da425e854a1e8d203.match_events
...
Source Information:
...
Stream Name: match_events
Lag: 0
Last Seen: 934ms
Ext. API Prefix: $JS.edge-652efb2da425e854a1e8d203.API
Ext. Delivery Prefix: deliver.edge-652efb2da425e854a1e8d203.match_events
...
State:
Messages: 40,303
Bytes: 7.4 GiB
First Sequence: 505,520 @ 2024-11-26 11:02:59
Last Sequence: 545,822 @ 2024-12-03 11:02:56
Active Consumers: 1
Number of Subjects: 37,538
Leafnode nats-server logs showing restart loop:
nats server basically starts and exits right away. (within same second. different from being killed because it's taking too long)
The problem is that jetstream at the core does not differentiate between the old and new version of the stream match_events.
If the source stream where to receive and keep track of the remote stream's create time and included that time in the consumer create, the remote server can check if the time submitted differs from the create time of the stream and return an error instead.
This error could be the signal for the hub jetstream to start over by resetting the sequence number, obtain and update the new timestamp, create the new consumer again.
This probably has implications for mirroring as well. There a differing timestamp should probably result in deleting the current content prior to starting at 0 again.
Server and client version
2.10.20 & 2.10.23
Host environment
No response
Steps to reproduce
No response
The text was updated successfully, but these errors were encountered:
wallyqs
changed the title
Jetstream stream source stalls/is out of sync after sourced stream is re-created for unrelated reason
Jetstream stream source stalls/is out of sync after sourced stream is re-created for unrelated reason [v2.10.20]
Dec 11, 2024
wallyqs
changed the title
Jetstream stream source stalls/is out of sync after sourced stream is re-created for unrelated reason [v2.10.20]
JetStream stream source stalls/is out of sync after sourced stream is re-created for unrelated reason [v2.10.20]
Dec 11, 2024
My earlier attempt at a workaround included a nats-server restart that appeared to have fixed it after re-creating the source.
Now I believe this was just a timing coincidence as that leaf nodes data probably was expired by ttl in the meantime.
What works reliably is delete the source, purge that leafs data, create the source again.
Observed behavior
Setup: basic hub spoke with leafnodes.
hub is a single jetstream enabled server in its own domain, so are the leaf nodes.
On the hub, there is a stream named match_events that sources from the match_events stream in each leaf node.
Symptoms:
Sourcing from match_events on one leafnode stopped working.
Despite new messages being added to it, no data is sent to match_events in the hub.
Cause:
I believe we ran into a crash during recovery that sounds similar to one fixed in a release candidate of 2.10.23.
In order to recover from this issue, we renamed the data directory, which resulted in data loss but allowed the system to become functional again.
On the leaf node with the issue, sequence numbers in the stream are less than 80.
Create time is right after the time when the jetstream data directory got renamed and everything started to seemingly function again.
On the hub node, stream info showed the edge connected with low lag.
But when subscribing to the consumer create message of that leaf, we noticed that the requested start seq no is 995 and hence way bigger than 80.
Meaning 995 was the sequence number the previous incarnation of match_events had on that leaf and 79 is the value of the current incarnation
This would explain why there seems to be no data flow ... yet.
Failed workaround:
We edited the match_events stream in the hub to remove the source corresponding to the leaf.
Then we re-added it. This did not seem to work as we recorded the consumer create messages and the sequence number remained at the old value it had before the source got removed & re-added.
I then edited the source to set the op_start_seq to 1. This didn't work either, as presumably a higher sequence number was still in memory.
A few hours later we restarted the hub server. After that the sequence number was reset to one.
Logs:
Hub consumer create messages:
Leafnode match_events stream:
Hub consumer create messages for failed workaround:
The first message was received after re-creating the source.
The second message was received after editing opt_start_seq of the source.
In both cases, opt_start_seq in the request remains at the last old value.
Hub consumer create messages after hub nats server restart show opt_start_seq=1.
Hub match_events stream taken right after consumer create [#2] from above, showing start seqno 1
Leafnode nats-server logs showing restart loop:
nats server basically starts and exits right away. (within same second. different from being killed because it's taking too long)
Expected behavior
The problem is that jetstream at the core does not differentiate between the old and new version of the stream match_events.
If the source stream where to receive and keep track of the remote stream's create time and included that time in the consumer create, the remote server can check if the time submitted differs from the create time of the stream and return an error instead.
This error could be the signal for the hub jetstream to start over by resetting the sequence number, obtain and update the new timestamp, create the new consumer again.
This probably has implications for mirroring as well. There a differing timestamp should probably result in deleting the current content prior to starting at 0 again.
Server and client version
2.10.20 & 2.10.23
Host environment
No response
Steps to reproduce
No response
The text was updated successfully, but these errors were encountered: