-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out forward stuck establishing connection #4396
Comments
Thanks for your report. We may need to do some digging. |
We're encountering a similar issue while distributing 80,000 events per second to two separate systems. Have there been any updates or workarounds identified for this? |
No. I'll see if I can reproduce it. |
Yes... With the below configurations for forwarder and aggregator: Both of these are in a Multi Process Workers environment, with 4 workers on each node.
<match udp.input.**>
@type forward
require_ack_response true
heartbeat_type udp
<buffer>
@type memory
flush_interval 1s
flush_thread_count 10
chunk_limit_size 50m
queue_limit_length 500
chunk_limit_size 100m
overflow_action drop_oldest_chunk
retry_max_interval 10m
retry_forever true
delayed_commit_timeout 100
</buffer>
<server>
host 10.10.1.2
port 24224
weight 60
</server>
<server>
host 10.10.1.3
port 24224
weight 60
</server>
</match>
<source>
@type forward
port 24224
bind 0.0.0.0
tag udp.forward
</source> It appears that when attempting to distribute events, Node 1 receives events via UDP and then shares them with other nodes using the forwarder plugin. However, after approximately 10 seconds, fluentd enters a stale mode where it no longer accepts new incoming events and only forwards heartbeats. And I haven't seen any error logs stating the above behaviour. After checking the network calls with strace we are suspecting this
If there's any other information needed on this issue, please let me know. |
did you manage to replicate the issue @daipom ? |
Sorry, I haven't made time for this.
Does this mean that this issue reproduces within the same machine (setting Or, does this reproduce only under the following infrastructure (where
|
This setup is currently deployed in an Esxi host. We have three nodes deployed independently, and we have been able to reproduce the issue by continuously streaming across the nodes. The actual flow will be like this I have three questions:
|
I'm observing the same issue in our setup on MacOS, it usually occurs when the macbook lid is closed. Based on the logs I noticed that macbook wakes up from time to time to do certain things and during these wake-ups fluentd is more likely to enter the infinite loop in the establish_connection. I'm using TLS and heartbeat disabled. I've added some extra error logging to the
When it enter the infinite loop the error stays the same, but the retry count increase forever:
My current workaround probably doesn't address the actual issue (why
|
Describe the bug
I encountered a critical issue with Fluentd where it gets stuck in an infinite loop while establishing connections between out_forward and in_forward servers. Our infrastructure has unstable connection that may lead to this bug occurring on bigger frequency. This bug has led to Fluentd ceasing to send data (and heartbeat) even when the underlying network infrastructure was back online. Interestingly, killing the socket resolved this issue and cause the thread to continue sending data.
The most simple way i can describe the bug is that fluentd suddenly stops sending any data to the remote server, no errors in the logs, and if i check inside the container i can see there is a open socket on "ESTABLISHED" to the remote server.
After some digging in the code i think i might have found the reason for this behaviour.
Every heartbeat/write function, a new socket is created and then
connect(dest_addr)
is called, doing the TLS handshake.After that
establish_connection
is called to do the hello, ping, pong protocol.it seems that if after the TLS handshake, when we reach
establish_connection
, the connection drops between the fluentd and destination server, fluentd doesn't seem to detect or timeout the socket and as far as I can tell get stuck in this loop, where socket is still considered up, but will never contain any data.It's important to take note that in our infrastructure, the out_forward and in_forward servers doesn't connect directly to each other, they has a few components in-between them, so if the overall connection drops it doesn't necessarily mean that the socket will drop, so we have to rely on options like timeouts.
It would seems that
socketops
RCVTIMEO
/SNDTIMEO
should cause an exception to be raised, breaking the loop, but as far as i can tell in ruby, those socket options seems to be problematic.https://stackoverflow.com/questions/9853516/set-socket-timeout-in-ruby-via-so-rcvtimeo-socket-option#:~:text=Based%20on%20my%20testing%2C%20and%20Jesse%20Storimer%27s%20excellent%20ebook%20on%20%22Working%20with%20TCP%20Sockets%22%20(in%20Ruby)%2C%20the%20timeout%20socket%20options%20do%20not%20work%20in%20Ruby%201.9%20(and%2C%20I%20presume%202.0%20and%202.1).%20Jesse%20says%3A
https://www.ruby-forum.com/t/how-to-set-socket-option-so-rcvtimeo/77758/8
https://stackoverflow.com/questions/57895907/ruby-2-5-set-timeout-on-tcp-sockets-read
To Reproduce
I'm not sure how to consistently reproduce the issue, it seem to depend on when the connection was dropped and where the
establish_connection
is currently at.Expected behavior
I expect the socket to timeout if it failed to establish connection after some time.
Perhaps using
IO.select
.Your Environment
- Fluentd version:1.16.2
Your Configuration
Your Error Log
Additional context
No response
The text was updated successfully, but these errors were encountered: