Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] salt.utils.event get_event on master receiving missing data #67094

Open
6 tasks
david-pulkowski opened this issue Dec 11, 2024 · 0 comments
Open
6 tasks
Labels
Bug broken, incorrect, or confusing behavior needs-triage

Comments

@david-pulkowski
Copy link

Description
This has been happening only recently, starting about one month ago.
There is an intermittent situation on our salt-hubs (masters), two out of our three environments are affected.
Salt jobs are sent out to the minions from the hub, the jobs are received and performed on the minion and on the final send back to the hub, we can see the job is successful from the minion & the hub is receiving the event, but the data needed is missing.

The minion & hub will error on timeouts then.

Setup

# HUB config #
worker_threads: 24
gather_job_timeout: 15

user: service-user
show_jid: True
timeout: 60
roots_update_interval: 30
top_file_merging_strategy: same

pillarenv_from_saltenv: true
pillar_source_merging_strategy: none

gitfs_provider: gitpython
gitfs_global_lock: False
git_pillar_global_lock: False

gitfs_saltenv_whitelist:
  - dev
  - qa

fileserver_backend:
  - gitfs
  - roots

# Filesystem locations #
file_roots:
  base:
    - /srv/salt
  qa:
    - /srv/salt/qa
    - /mnt/environment/qa
  dev:
    - /srv/salt/dev
    - /mnt/environment/dev


# Pillar file locations #
pillar_roots:
  base:
    - /srv/pillar
  qa:
    - /srv/pillar/qa
  dev:
    - /srv/pillar/dev

ext_pillar:
  - git:
    - dev [email protected]:salt/pillar.git:
      - env: dev

# Salt file locations #
gitfs_remotes:
  - [email protected]:salt/salt.git

# ACL's #
publisher_acl:
  service-user:
    - test.*
    - state.apply

# LOGGING #
# The level of messages to send to the log file.
# One of 'garbage', 'trace', 'debug', 'info', 'warning', 'error', 'critical'.

log_granular_levels:
  'salt.loaded.int.returner.appoptics_return': 'critical'

log_level: info
log_level_logfile: trace
log_fmt_console: '%(colorlevel)s %(colormsg)s %(colorname)s %(colorprocess)s'
log_fmt_logfile: '%(asctime)s,%(msecs)03d %(colorname)s %(colorlevel)s %(colormsg)s'
# Minion config # 
master: salt-hub

grains:
  owner: devOps

top_file_merging_strategy: same
pillarenv_from_saltenv: true
saltenv: dev

Please be as specific as possible and give set-up details.

  • [X ] on-prem machine
  • [X ] VM (Virtualbox, KVM, etc. please specify) KVM
  • VM running on a cloud service, please be explicit and add details
  • container (Kubernetes, Docker, containerd, etc. please specify)
  • or a combination, please be explicit
  • jails if it is FreeBSD
  • classic packaging
  • [X ] onedir packaging
  • used bootstrap to install

Steps to Reproduce the behavior
While its an intermittent issue, here is a simple output of a test.ping from the hub to the minion.

Minion gets and performs the job:

2024-12-10 15:22:50,819 [salt.minion      ] [INFO    ] Returning information for job: 20241210212250747583
2024-12-10 15:22:50,819 [salt.minion      ] [TRACE   ] Return data: {'success': True, 'return': True, 'retcode': 0, 'jid': '20241210212250747583', 'fun': 'test.ping', 'fun_args': [], 'user': 'sudo_user'}
2024-12-10 15:22:50,820 [salt.utils.event ] [DEBUG   ] SaltEvent PUB socket URI: /var/run/salt/minion/minion_event_66df7bdfff_pub.ipc
2024-12-10 15:22:50,820 [salt.utils.event ] [DEBUG   ] SaltEvent PULL socket URI: /var/run/salt/minion/minion_event_66df7bdfff_pull.ipc
2024-12-10 15:22:50,821 [salt.transport.ipc] [TRACE   ] IPCClient: Connecting to socket: /var/run/salt/minion/minion_event_66df7bdfff_pull.ipc
2024-12-10 15:22:50,821 [salt.transport.ipc] [TRACE   ] IPCServer: Handling connection to address: <socket.socket fd=48, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0, laddr=/var/run/salt/minion/minion_event_66df7bdfff_pull.ipc>
2024-12-10 15:22:50,822 [salt.utils.event ] [DEBUG   ] Sending event: tag = __master_req_channel_payload/salt-hub; data = {'cmd': '_return', 'id': 'salt-minion', 'success': True, 'return': True, 'retcode': 0, 'jid': '20241210212250747583', 'fun': 'test.ping', 'fun_args': [], 'user': 'sudo_user', '_stamp': '2024-12-10T21:22:50.821984'}
2024-12-10 15:22:50,822 [salt.transport.ipc] [DEBUG   ] Closing IPCMessageClient instance
2024-12-10 15:22:50,822 [salt.minion      ] [TRACE   ] ret_val = True
2024-12-10 15:22:50,823 [salt.minion      ] [DEBUG   ] minion return: {'success': True, 'return': True, 'retcode': 0, 'jid': '20241210212250747583', 'fun': 'test.ping', 'fun_args': [], 'user': 'sudo_user'}
2024-12-10 15:22:50,823 [salt.transport.ipc] [TRACE   ] Client disconnected from IPC /var/run/salt/minion/minion_event_66df7bdfff_pull.ipc
2024-12-10 15:22:50,823 [salt.minion      ] [DEBUG   ] Minion of 'salt-hub' is handling event tag '__master_req_channel_payload/salt-hub'
2024-12-10 15:22:50,823 [salt.minion      ] [DEBUG   ] Minion return retry timer set to 7 seconds (randomized)
2024-12-10 15:22:50,823 [salt.channel.client] [TRACE   ] ReqChannel send crypt load={'cmd': '_return', 'id': 'salt-minion', 'success': True, 'return': True, 'retcode': 0, 'jid': '20241210212250747583', 'fun': 'test.ping', 'fun_args': [], 'user': 'sudo_user', '_stamp': '2024-12-10T21:22:50.821984'}
2024-12-10 15:22:51,622 [salt.utils.schedule] [TRACE   ] ==== evaluating schedule now None =====
2024-12-10 15:22:51,622 [salt.utils.process] [DEBUG   ] Subprocess ProcessPayload(jid=20241210212250747583) cleaned up
2024-12-10 15:22:52,622 [salt.utils.schedule] [TRACE   ] ==== evaluating schedule now None =====
2024-12-10 15:22:53,622 [salt.utils.schedule] [TRACE   ] ==== evaluating schedule now None =====
2024-12-10 15:22:54,622 [salt.utils.schedule] [TRACE   ] ==== evaluating schedule now None =====
2024-12-10 15:22:55,622 [salt.utils.schedule] [TRACE   ] ==== evaluating schedule now None =====
2024-12-10 15:22:56,622 [salt.utils.schedule] [TRACE   ] ==== evaluating schedule now None =====
2024-12-10 15:22:57,622 [salt.utils.schedule] [TRACE   ] ==== evaluating schedule now None =====
2024-12-10 15:22:57,825 [salt.channel.client] [TRACE   ] Failed to send msg SaltReqTimeoutError('Message timed out')
2024-12-10 15:22:57,825 [salt.channel.client] [TRACE   ] ReqChannel send crypt load={'cmd': '_return', 'id': 'salt-minion', 'success': True, 'return': True, 'retcode': 0, 'jid': '20241210212250747583', 'fun': 'test.ping', 'fun_args': [], 'user': 'sudo_user', '_stamp': '2024-12-10T21:22:50.821984', 'nonce': '8af1100041c04d4f9b082f05b01e667c'}

Hubs receiving of the event but with the missing data:

2024-12-10 15:22:50,875 [salt.client      ] [DEBUG   ] get_iter_returns for jid 20241210212250747583 sent to {'salt-minion'} will timeout at 15:23:50.875442
2024-12-10 15:22:50,875 [salt.utils.event ] [TRACE   ] Get event. tag: salt/job/20241210212250747583
2024-12-10 15:22:50,876 [salt.utils.event ] [TRACE   ] _get_event() waited 0 seconds and received nothing
2024-12-10 15:22:50,886 [salt.utils.event ] [TRACE   ] Get event. tag: salt/job/20241210212250747583
2024-12-10 15:22:50,888 [salt.utils.event ] [TRACE   ] get_event() received = {'data': {'jid': '20241210212250747583', 'tgt_type': 'glob', 'tgt': 'salt-minion', 'user': 'sudo_user', 'fun': 'test.ping', 'arg': [], 'minions': ['salt-minion'], 'missing': [], '_stamp': '2024-12-10T21:22:50.748600'}, 'tag': 'salt/job/20241210212250747583/new'}

We also noticed this can cause minions to lose a connection with the hub at some point (testing with auth_safemode: True ping_interval: 1 on the minions

Setting the --batch-size 1 on the hub at least helps only target the minions that are connected, (but timeout issues still occur)

Expected behavior
That the:
Minions 'ReqChannel send' is the same as the hubs 'get_event() received'

Screenshots
NA

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
          Salt: 3006.9

Python Version:
        Python: 3.10.14 (main, Jun 26 2024, 11:44:37) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.16.0
      cherrypy: 18.6.1
  cryptography: 42.0.5
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: 4.0.11
     gitpython: 3.1.40
        Jinja2: 3.1.4
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: Not Installed
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: 0.17.0
         smmap: 5.0.1
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: rhel 9.5 Plow
        locale: utf-8
       machine: x86_64
       release: 5.14.0-503.16.1.el9_5.x86_64
        system: Linux
       version: Red Hat Enterprise Linux 9.5 Plow

Additional context
We are really just trying to understand how we can debug this any further and uncover the issue.

@david-pulkowski david-pulkowski added Bug broken, incorrect, or confusing behavior needs-triage labels Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior needs-triage
Projects
None yet
Development

No branches or pull requests

1 participant