Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout in liveness / readiness probes leading to outage #13680

Open
1 task done
rodolfobrunner opened this issue Sep 16, 2024 · 3 comments
Open
1 task done

Timeout in liveness / readiness probes leading to outage #13680

rodolfobrunner opened this issue Sep 16, 2024 · 3 comments

Comments

@rodolfobrunner
Copy link

rodolfobrunner commented Sep 16, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Kong version ($ kong version)

Kong 3.6.1

Current Behavior

When Kong loses connectivity to the Database, the first couple of requests to /status and /metrics take too long to return. Meanwhile, Kubernetes uses the /status to probe for the readiness / liveness status and since it got a timeout, it assumed that the pod was unhealthy.

Until Kubernets decided to restart the container, Kong was proxing requests without a problem.

After looking at kong.api.routes.health.lua we realized that kong.db:connect()instructing to test connectivity is being executed both synchronously and without passing a timeout leading to latency of 5+ secs of the first requests.

The sequence of events were:

Several "harmless" entries of:
[error] 1322#0: *664431 [lua] health.lua:61: fn(): failed to connect to PostgreSQL during /status endpoint check: [PostgreSQL error] timeout, client: 10.145.65.245, server: kong_status, request: "GET /status HTTP/1.1", host: "10.145.43.33:8802"
[error] 1322#0: *660572 [kong] exporter.lua:353 prometheus: failed to reach database while processing/metrics endpoint: timeout, client: 10.145.46.50, server: kong_status, request: "GET /metrics HTTP/1.1", host: "10.145.43.33:8802"

Followed by a surprise SIGQUIT:

2024/09/16 07:03:42 [notice] 1#0: exit
2024/09/16 07:03:42 [notice] 1#0: worker process 1322 exited with code 0
2024/09/16 07:03:42 [notice] 1#0: signal 17 (SIGCHLD) received from 1322
2024/09/16 07:03:41 [notice] 1322#0: exit
2024/09/16 07:03:41 [notice] 1322#0: exiting
2024/09/16 07:03:38 [notice] 1#0: signal 29 (SIGIO) received
2024/09/16 07:03:38 [notice] 1#0: worker process 1323 exited with code 0
2024/09/16 07:03:38 [notice] 1#0: signal 17 (SIGCHLD) received from 1323
2024/09/16 07:03:37 [notice] 1323#0: exit
2024/09/16 07:03:37 [notice] 1323#0: exiting
2024/09/16 07:03:37 [notice] 1323#0: gracefully shutting down
2024/09/16 07:03:37 [notice] 1322#0: gracefully shutting down
2024/09/16 07:03:37 [notice] 1#0: signal 3 (SIGQUIT) received from 1990, shutting down

Followed by the apparent shutdown process, we see:
Screenshot 2024-09-16 at 14 09 53

Please note that the pod was not able to die even though it received a SIGQUIT. After the pod regained the DB connectivity, it remained the same. The reason for this is likely a crashloopbackoff while shutting down, as it tried to save some state in a DB it had no connectivity.

Expected Behavior

Kong had no problem to route request without DB connectivity until it is taken out from the cluster due to /status timeout.

As the DB is not in the hot path, we did not expect an outage on a DB restart.

In our opinion, kong.db:connect() to test connectivity should be done either asynchronously or have a timeout

Steps To Reproduce

  1. Setup a Kong in traditional mode
  2. Disable the DB
  3. Make a request to any route (should work)
  4. Make a request to /status (should have a latency spike for the first attempts)

Anything else?

No response

@chobits
Copy link
Contributor

chobits commented Sep 20, 2024

hi,

I just want to briefly mention the timeout issue.

After looking at kong.api.routes.health.lua we realized that kong.db:connect()instructing to test connectivity is being executed both synchronously and without passing a timeout leading to latency of 5+ secs of the first requests.

In our opinion, kong.db:connect() to test connectivity should be done either asynchronously or have a timeout

I think this behaviour is expected, the timeout is configured through the pg_timeout option from the kong.conf file, see the default value as follows:

#pg_timeout = 5000               # Defines the timeout (in ms), for connecting,
                                 # reading and writing.

@rodolfobrunner
Copy link
Author

rodolfobrunner commented Sep 20, 2024

Thank you for the reply @chobits
I tried to follow the code and the connect / reconnect looked to be receiving an config that apparently was not being passed in kong.db:connect(). But I now realize that it is as a self as first argument.

In your opinion, do you think the Kong chart liveness / readiness timeout having default value as pg_timeout might be problematic?

I am not sure about touching pg_timeout as idk exactly what Kong is doing under the hood and what are its needs, but it is bounded to 5seconds i can raise the probes timeout.

@regnaio
Copy link

regnaio commented Dec 29, 2024

I see this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants