Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    NRPE3 process 100% CPU load

    Scheduled Pinned Locked Moved General pfSense Questions
    3 Posts 2 Posters 478 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • junicastJ
      junicast
      last edited by junicast

      Hi,

      We run a CARP setup with two dedicated Xeon Servers and pfSense 2.5.1.
      For HA there's a dedicated 1G copper link. Everything else is connected over 2 times 10G LACP. For every downstream interface there's a VLAN interface on top of that LAGG.
      We're monitoring both nodes with Icinga2 and NRPE individually. We recently switched from Nagios to Icinga2 and also from IPv4 addresses to IPv6 addresses. The problem started occuring after this switch.

      Only the secondary device suffers from the problem that the NRPE3 process itself after a couple of minutes takes 100% CPU load. After some more time even more NRPE processes spawn, also with 100% CPU load. After few hours load goes higher than 5. Never waited much longer but I guess it's a neverending story.

      pfsense_nrpe.png

      NRPE process looks like this:

      nagios  36022 100.0  0.0  16144  6192  -  R    13:40       7:27.85 /usr/local/sbin/nrpe3 -d -c /usr/local/etc/nrpe.cfg
      

      config file looks like this:

      log_facility=daemon
      pid_file=/var/run/nrpe3.pid
      server_port=5666
      nrpe_user=nagios
      nrpe_group=nagios
      allowed_hosts=2a00:1234:0:106::61
      dont_blame_nrpe=0
      debug=0
      command_timeout=60
      connection_timeout=300
      command[check_users]=/usr/local/libexec/nagios/check_users -w 5 -c 10 
      command[check_load]=/usr/local/libexec/nagios/check_load -w 15,10,5 -c 30,25,20 
      command[check_root]=/usr/local/libexec/nagios/check_disk -w 20% -c 10% -p /
      command[check_var]=/usr/local/libexec/nagios/check_disk -w 20% -c 10% -p /var/run
      command[check_zombie_procs]=/usr/local/libexec/nagios/check_procs -w 5 -c 10 -s Z
      command[check_total_procs]=/usr/local/libexec/nagios/check_procs -w 150 -c 200 
      command[check_swap]=/usr/local/libexec/nagios/check_swap -w 50% -c 25% 
      command[check_synclink]=/usr/local/libexec/nagios/check_ping -w 10,2% -c 20,5% -H 192.168.15.2
      server_address=2a00:1234:0:16::3
      

      It looks like sometimes the icinga2 daemon gets a timeout when connecting to the nrpe process and it somehow looks like at that event a new nrpe3 process is being spawned.
      I wasn't able to find anything related to that.

      1 Reply Last reply Reply Quote 0
      • junicastJ
        junicast
        last edited by

        Update:
        It seems there is an asymmetric routing problem which leads to this phenomenon. TCP Sessions die because of that after 30 seconds. We will fix this by using a different IPv6 address for the check.

        Yet I believe this should never lead to the nrpe service going haywire.

        1 Reply Last reply Reply Quote 0
        • stephenw10S
          stephenw10 Netgate Administrator
          last edited by

          You could be hitting the route-to/reply-to bug that was fixed in 2.5.2:
          https://docs.netgate.com/pfsense/en/latest/releases/2-5-2.html#rules-nat
          https://redmine.pfsense.org/issues/11805

          Though I agree the nrpe service should not behave like that. That's probably an upstream bug though.

          Steve

          1 Reply Last reply Reply Quote 0
          • First post
            Last post
          Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.