Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent stops taking jobs after server throws 5XX errors #4446

Open
3 tasks done
aaronriedel opened this issue Nov 23, 2024 · 4 comments
Open
3 tasks done

Agent stops taking jobs after server throws 5XX errors #4446

aaronriedel opened this issue Nov 23, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@aaronriedel
Copy link

aaronriedel commented Nov 23, 2024

Component

agent

Describe the bug

When the server (running in kubernetes) restarts my docker agent refuses to take new jobs until restarted. In the agent logs I can see several 5XX Errors while the server reboots. After that the agent shows as online in the UI but does not take jobs.

Agent logs: See below

Steps to reproduce

  1. Install Woodpecker server in Kubernetes
  2. Install agent in seperate server using docker
  3. Kill the server so that it recreates
  4. Trigger pipeline that would use the docker agent
  5. See it pending

Expected behavior

The agent should properly reconnect to the Server via gRPC after the server restarts.

System Info

Server:
{"source":"https://github.com/woodpecker-ci/woodpecker","version":"2.7.3"}

Helm values:

---
server:
  ingress:
    # -- Enable the ingress for the server component
    enabled: true
    # -- Add annotations to the ingress
    annotations:
      # kubernetes.io/ingress.class: nginx
      kubernetes.io/tls-acme: "true"
    hosts:
      - host: woodpecker.example.com
        paths:
          - path: /
            backend:
              serviceName: woodpecker-svc
              servicePort: 80
    tls:
      - hosts:
          - woodpecker.example.com
        secretName: woodpecker-tls-key
  statefulSet:
    replicaCount: 1
  env:
    WOODPECKER_ADMIN: 'aaron'
    WOODPECKER_HOST: 'https://woodpecker.example.com'
    WOODPECKER_OPEN: true
    WOODPECKER_FORGEJO: true
    WOODPECKER_FORGEJO_URL: 'https://git.example.com'
    WOODPECKER_LOG_LEVEL: "error"
  extraSecretNamesForEnvFrom:
    - woodpecker-forgejo

gRPC Ingress:

---
apiVersion: v1
kind: Service
metadata:
  name: woodpecker-grpc
  namespace: woodpecker
  annotations:
    traefik.ingress.kubernetes.io/service.serversscheme: h2c
spec:
  selector:
    app.kubernetes.io/instance: woodpecker
    app.kubernetes.io/name: server
  ports:
    - name: grpc
      protocol: TCP
      port: 9000
      targetPort: grpc
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/tls-acme: "true"
    traefik.ingress.kubernetes.io/loadbalancer.server.scheme: h2c
    traefik.ingress.kubernetes.io/service.serversscheme: h2c
  name: woodpecker-grpc
  namespace: woodpecker
spec:
  rules:
    - host: "woodpecker-grpc.apps.example.com"
      http:
        paths:
          - pathType: Prefix
            path: "/"
            backend:
              service:
                name: woodpecker-grpc
                port:
                  name: grpc
  tls:
    - hosts:
        - woodpecker-grpc.apps.example.com
      secretName: woodpecker-grpc-tls-key

docker-compose config for agent:

services:
  woodpecker-agent-1:
    container_name: woodpecker-agent-1
    image: woodpeckerci/woodpecker-agent:latest
    command: agent
    restart: unless-stopped
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - WOODPECKER_SERVER=woodpecker-grpc.apps.example.com:443
      - WOODPECKER_AGENT_SECRET=${WOODPECKER_AGENT_SECRET}
      - WOODPECKER_MAX_WORKFLOWS=4
      - WOODPECKER_FILTER_LABELS="backend=docker"
      - WOODPECKER_BACKEND_DOCKER_ENABLE_IPV6=true
      - WOODPECKER_GRPC_SECURE=true
      - WOODPECKER_GRPC_VERIFY=true
    labels:
      - "com.centurylinklabs.watchtower.enable=true"

Additional context

Agent logs:

{"level":"info","time":"2024-11-23T08:44:52Z","message":"starting Woodpecker agent with version '2.7.3' and backend 'docker' using platform 'linux/amd64' running up to 4 pipelines in parallel"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:26:59Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:00Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:01Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:02Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:04Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:06Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:12Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:19Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"grpc error: next(): code: Unknown"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"runner done with error"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"grpc error: next(): code: Unknown"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"runner done with error"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"grpc error: next(): code: Unknown"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"runner done with error"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"grpc error: next(): code: Unknown"}
{"level":"error","error":"rpc error: code = Unknown desc = unexpected HTTP status code received from server: 500 (Internal Server Error); malformed header: missing HTTP content-type","time":"2024-11-23T14:27:21Z","message":"runner done with error"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:24Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:34Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:39Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:53Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:00Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:15Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:29Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:40Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:54Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:29:02Z","message":"grpc error: report_health(): code: Unavailable"}

Validations

  • Read the docs.
  • Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
  • Checked that the bug isn't fixed in the next version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]
@aaronriedel aaronriedel added the bug Something isn't working label Nov 23, 2024
@zc-devs
Copy link
Contributor

zc-devs commented Nov 23, 2024

Does it work if you deploy an agent in Kubernetes (direct Agent-Server connection, not via Traefik)?

JFYI, that is my IngressRoute, which worked a couple of months ago:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: woodpecker-server
spec:
  entryPoints:
  - websecure
  routes:
  - kind: Rule
    match: Host(`wp.domain.tld`)
    services:
    - name: woodpecker-server
      port: http
  - kind: Rule
    match: Host(`wp.domain.tld`) && Headers(`Content-Type`, `application/grpc`)
    services:
    - name: woodpecker-server
      port: grpc
      scheme: h2c

However, I didn't restarted the server, if I remember correctly.

@aaronriedel
Copy link
Author

The kubernetes-agents work fine and are not affected by the problem. It is very likely that the 5XX errors come from Traefik mainly. However I would also expect the agent to not poop itself when there are errors for a few seconds.

Matching the application type is a good hint, I might implement this. I currently don't use IngressRoute objects and instead configure normal Ingresses with annotations.

@zc-devs
Copy link
Contributor

zc-devs commented Nov 23, 2024

received unexpected content-type "text/plain; charset=utf-8""
errors come from Traefik

I think so and I had this.

The agent should properly reconnect

{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:24Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:34Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:39Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:27:53Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:00Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:15Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:29Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:40Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:28:54Z","message":"grpc error: report_health(): code: Unavailable"}
{"level":"warn","error":"rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 503 (Service Unavailable); transport: received unexpected content-type \"text/plain; charset=utf-8\"","time":"2024-11-23T14:29:02Z","message":"grpc error: report_health(): code: Unavailable"}

Seems, it is trying.


Do you have 2 ingresses: one for HTTP, another for gRPC? Could you show HTTP one?

@pat-s pat-s added the forge/gitea gitea forge related label Nov 24, 2024
@pat-s
Copy link
Contributor

pat-s commented Nov 24, 2024

Accidentally added the label. Can't remove it anymore :/

@qwerty287 qwerty287 removed the forge/gitea gitea forge related label Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants