Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    2.4.5-p1 HA carp setup on Hyper-V, high hvevent0 CPU usage

    Scheduled Pinned Locked Moved General pfSense Questions
    11 Posts 2 Posters 713 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • R
      rgijsen
      last edited by rgijsen

      We have a HA setup with CARP running on Hyper-V, and we've had it like that for years without any real issues. We are running 2.4.5-p1, on Server 2019 (core) with Hyper-V role. some VLAN trunk ports and all is functionally well. I noticed though that the primary pfSense node consumes a lot of CPU with kernel{hvenent0}, top extract:

      PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
      11 root 155 ki31 0K 64K RUN 1 513:56 87.41% [idle{idle: cpu1}]
      11 root 155 ki31 0K 64K RUN 3 505:40 86.37% [idle{idle: cpu3}]
      11 root 155 ki31 0K 64K CPU2 2 499:32 85.76% [idle{idle: cpu2}]
      11 root 155 ki31 0K 64K CPU0 0 489:10 70.66% [idle{idle: cpu0}]
      0 root -92 - 0K 800K - 0 73:35 26.15% [kernel{hvevent0}]
      32077 www 27 0 22016K 11672K CPU1 1 37:05 12.46% /usr/local/sbin/haproxy -f /var/etc/haproxy/haproxy.cf
      0 root -92 - 0K 800K - 2 21:36 8.74% [kernel{hvevent2}]
      0 root -92 - 0K 800K - 3 20:04 7.87% [kernel{hvevent3}]

      Now with this copy / paste the CPU is about 26%, but I've seen it over 30 a lot of times as well. The second node (running on another Hyper-V box of course) does not have that problem, even when it's made active and it has work to do. Both hosts running these VM's are on the same patch-level, KB4577668 as of now. As far as I'm aware nothing obvious has been changed
      in the pfSense VM's apart from the operational firewall rule changes we need from time to time.

      As per https://forum.netgate.com/topic/121407/solved-with-workaround-higher-cpu-load-after-upgrade-to-2-4-on-hyper-v/3 and https://redmine.pfsense.org/issues/6882 I did soms checks, but both VM's do NOT have a CD drive attached at all, nor do they have SNMP enabled. I tried disabling the Hyper-V integration services from Hyper-V side, powered down the primary pfSense VM and restarted it and tried enabling and disabling SNMP just to make sure that wan't the issue. No cigar.

      How to troubleshoot this further?

      1 Reply Last reply Reply Quote 0
      • stephenw10S
        stephenw10 Netgate Administrator
        last edited by

        I would check vmstat -i and ps -auxwwd so if anything is showing high there.

        It may not though. That seems to be load servicing the hv_vmbus.

        Steve

        1 Reply Last reply Reply Quote 0
        • R
          rgijsen
          last edited by rgijsen

          This post is deleted!
          1 Reply Last reply Reply Quote 0
          • stephenw10S
            stephenw10 Netgate Administrator
            last edited by

            Yes, that's exactly what I would try; failover to the secondary and see if the load exists there when it is master.

            Steve

            1 Reply Last reply Reply Quote 0
            • R
              rgijsen
              last edited by

              As stated before when I do a carp failover to the second node, or even shutdown the primary node, the second node shows practically no load at all. However, with the secondary node being active, the primary stills shows that load on the hvevent process.
              So my next step I guess will be to mount the pfsense disk to a new VM with the same NIC config, and see if anything's wrong from that perspective. If that fails, I can still try rebuilding the whole VM.
              Or would there be any other clues?

              1 Reply Last reply Reply Quote 0
              • R
                rgijsen
                last edited by rgijsen

                I had to delete my second post as it turns out I forgot to mask the IP addresses 😖

                I'll post it again below, but mind that it actually is a response to stephenw10's first reply.

                vmstat -i

                interrupt                          total       rate
                cpu0:timer                      18539648        160
                cpu1:timer                      14713803        127
                cpu2:timer                      15964083        138
                cpu3:timer                      15701489        136
                cpu0:hyperv                    478580870       4130
                cpu1:hyperv                     99633883        860
                cpu2:hyperv                     99907288        862
                cpu3:hyperv                     90973736        785
                Total                          834014800       7198
                

                ps -auxwwd

                USER    PID  %CPU %MEM    VSZ   RSS TT  STAT STARTED       TIME COMMAND
                root      0  88.8  0.1      0   800  -  DLs  Wed01    499:07.14 [kernel]
                root     11 292.9  0.0      0    64  -  RNL  Wed01   6830:15.44 - [idle]
                root     12   1.0  0.0      0   224  -  WL   Wed01      8:17.97 - [intr]
                root      1   0.0  0.1   5016   848  -  ILs  Wed01      0:00.01 - /sbin/init --
                www   32077  16.0  0.8  22016 11932  -  Ss   Wed01    124:37.03 |-- /usr/local/sbin/haproxy -f /var/etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D
                root    344   0.0  1.7  94900 25024  -  Ss   Wed01      0:02.65 |-- php-fpm: master process (/usr/local/lib/php-fpm.conf) (php-fpm)
                root  10455   0.0  2.6  99124 39452  -  S    23:31      0:41.73 | |-- php-fpm: pool nginx (php-fpm)
                root  55168   0.0  2.6  99124 39256  -  S    23:34      0:41.27 | |-- php-fpm: pool nginx (php-fpm)
                root  98139   0.0  2.6  99124 39256  -  S    23:08      0:42.40 | `-- php-fpm: pool nginx (php-fpm)
                root    359   0.0  0.2   6756  2640  -  INs  Wed01      0:00.02 |-- /usr/local/sbin/check_reload_status
                root    361   0.0  0.2   6756  2396  -  IN   Wed01      0:00.00 | `-- check_reload_status: Monitoring daemon of check_reload_status
                root    414   0.0  0.3   9160  4984  -  Is   Wed01      0:00.06 |-- /sbin/devd -q -f /etc/pfSense-devd.conf
                root   7792   0.0  0.4  14460  6796  -  Is   Wed01      0:00.00 |-- /usr/local/sbin/mpd5 -b -d /var/etc/l2tp-vpn -p /var/run/l2tp-vpn.pid -s l2tps l2tps
                root   8312   0.0  0.2   6380  2376  -  Is   Wed01      0:00.18 |-- /usr/sbin/cron -s
                root  13350   0.0  0.4  12472  5948  -  Ss   Wed01      0:08.50 |-- /usr/local/sbin/ntpd -g -c /var/etc/ntpd.conf -p /var/run/ntpd.pid
                proxy 14438   0.0  0.1   6268  2212  -  Ss   Wed01      0:02.25 |-- /usr/sbin/ftp-proxy -a <masked ip> -v
                root  46084   0.0  0.5  21636  7396  -  Is   11:02      0:00.00 |-- nginx: master process /usr/local/sbin/nginx -c /var/etc/nginx-webConfigurator.conf (nginx)
                root  46259   0.0  0.6  23684  8360  -  S    11:02      0:14.40 | |-- nginx: worker process (nginx)
                root  46466   0.0  0.5  23684  8280  -  I    11:02      0:03.23 | `-- nginx: worker process (nginx)
                root  49364   0.0  0.1   6208  2220  -  Is   11:07      0:04.66 |-- /usr/sbin/hv_kvp_daemon
                root  49818   0.0  0.1   6196  2004  -  Is   11:07      0:00.00 |-- /usr/sbin/hv_vss_daemon
                root  51968   0.0  0.2   6976  2612  -  IN   00:01      0:00.00 |-- /bin/sh /etc/rc.update_pkg_metadata
                root  52083   0.0  0.1   4144  1824  -  INC  00:01      0:00.00 | `-- sleep 78926
                root  60998   0.0  0.2   8168  3736  -  Is   Wed01      0:00.00 |-- /usr/local/libexec/ipsec/starter --daemon charon
                root  61228   0.0  1.1  55900 17208  -  Is   Wed01      0:10.75 | `-- /usr/local/libexec/ipsec/charon --use-syslog
                root  73334   0.0  0.5  12676  7344  -  Is   11:03      0:00.00 |-- /usr/sbin/sshd
                root   7545   0.0  0.5  12968  7828  -  Ss   09:53      0:00.27 | `-- sshd: <user>@pts/0 (sshd)
                root  37777   0.0  0.2   6976  2640  0  Is   09:53      0:00.00 |   `-- /bin/sh /etc/rc.initial
                root  48222   0.0  0.2   7284  3484  0  S    09:53      0:00.03 |     `-- /bin/tcsh
                root   2424   0.0  0.2   6824  2672  0  R+   10:00      0:00.00 |       `-- ps -auxwwd
                root  77295   0.0  0.2   6908  2344  -  Is   Wed01      0:15.63 |-- /usr/local/bin/dpinger -S -r 0 -i <gateway> -B <masked ip> -p /var/run/dpinger_<gateway>~<masked ip>~<masked ip>.pid -u /var/run/dpinger_<gateway>~<masked ip>~<masked ip>.sock -C /etc/rc.gateway_alarm -d 1 -s 500 -l 2000 -t 60000 -A 1000 -D 500 -L 20 <masked ip>
                root  86712   0.0  0.4  10312  5692  -  Ss   Wed01    208:14.88 |-- /usr/local/sbin/openvpn --config /var/etc/openvpn/server1.conf
                root  88850   0.0  0.2   6968  2808  -  Ss   Wed01      0:52.38 |-- /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid
                root  90697   0.0  0.2   6412  2476  -  Ss   Wed01      0:40.06 |-- /usr/sbin/syslogd -s -c -c -l /var/dhcpd/var/run/log -l /tmp/haproxy_chroot/var/run/log -P /var/run/syslog.pid -f /etc/syslog.conf
                root  43415   0.0  0.2   6976  2488  -  Is   10:59      0:00.00 | `-- /bin/sh /usr/local/sbin/sshguard -i /var/run/sshguard.pid
                root  43635   0.0  0.1   6196  1876  -  I    10:59      0:00.00 |   |-- /bin/cat
                root  43907   0.0  0.3  12016  5184  -  IC   10:59      0:00.00 |   |-- /usr/local/libexec/sshg-parser
                root  44184   0.0  0.2   6536  2372  -  IC   10:59      0:00.05 |   |-- /usr/local/libexec/sshg-blocker
                root  44469   0.0  0.2   6976  2488  -  I    10:59      0:00.00 |   `-- /bin/sh /usr/local/sbin/sshguard -i /var/run/sshguard.pid
                root  44518   0.0  0.2   6976  2476  -  I    10:59      0:00.00 |     `-- /bin/sh /usr/local/libexec/sshg-fw-pf
                root  91192   0.0  0.1   6192  1896  -  Is   Wed01      0:00.00 |-- /usr/local/bin/minicron 240 /var/run/ping_hosts.pid /usr/local/bin/ping_hosts.sh
                root  91342   0.0  0.1   6192  1912  -  I    Wed01      0:00.04 | `-- minicron: helper /usr/local/bin/ping_hosts.sh  (minicron)
                root  91357   0.0  0.1   6192  1896  -  Is   Wed01      0:00.00 |-- /usr/local/bin/minicron 3600 /var/run/expire_accounts.pid /usr/local/sbin/fcgicli -f /etc/rc.expireaccounts
                root  91586   0.0  0.1   6192  1912  -  I    Wed01      0:00.00 | `-- minicron: helper /usr/local/sbin/fcgicli -f /etc/rc.expireaccounts  (minicron)
                root  91798   0.0  0.1   6192  1896  -  Is   Wed01      0:00.00 |-- /usr/local/bin/minicron 86400 /var/run/update_alias_url_data.pid /usr/local/sbin/fcgicli -f /etc/rc.update_alias_url_data
                root  92231   0.0  0.1   6192  1912  -  I    Wed01      0:00.00 | `-- minicron: helper /usr/local/sbin/fcgicli -f /etc/rc.update_alias_url_data  (minicron)
                root  95054   0.0  1.9 241920 28616  -  Is   Wed01      0:13.11 |-- /usr/local/sbin/filterdns -p /var/run/filterdns.pid -i 300 -c /var/etc/filterdns.conf -d 1
                root  47601   0.0  0.2   6976  2536 v0- SN   11:02      0:15.75 |-- /bin/sh /var/db/rrd/updaterrd.sh
                root  91893   0.0  0.1   4144  1824  -  SNC  10:00      0:00.00 | `-- sleep 60
                root  69685   0.0  0.1   6316  2032 v0  Is+  11:26      0:00.00 |-- /usr/libexec/getty Pc ttyv0
                root  61754   0.0  0.1   6316  2032 v1  Is+  Wed01      0:00.00 |-- /usr/libexec/getty Pc ttyv1
                root  62014   0.0  0.1   6316  2032 v2  Is+  Wed01      0:00.00 |-- /usr/libexec/getty Pc ttyv2
                root  62204   0.0  0.1   6316  2032 v3  Is+  Wed01      0:00.00 |-- /usr/libexec/getty Pc ttyv3
                root  62552   0.0  0.1   6316  2032 v4  Is+  Wed01      0:00.00 |-- /usr/libexec/getty Pc ttyv4
                root  62819   0.0  0.1   6316  2032 v5  Is+  Wed01      0:00.00 |-- /usr/libexec/getty Pc ttyv5
                root  62865   0.0  0.1   6316  2032 v6  Is+  Wed01      0:00.00 |-- /usr/libexec/getty Pc ttyv6
                root  63124   0.0  0.1   6316  2032 v7  Is+  Wed01      0:00.00 `-- /usr/libexec/getty Pc ttyv7
                root      2   0.0  0.0      0    16  -  DL   Wed01      0:00.00 - [crypto]
                root      3   0.0  0.0      0    16  -  DL   Wed01      0:00.00 - [crypto returns 0]
                root      4   0.0  0.0      0    16  -  DL   Wed01      0:00.00 - [crypto returns 1]
                root      5   0.0  0.0      0    16  -  DL   Wed01      0:00.00 - [crypto returns 2]
                root      6   0.0  0.0      0    16  -  DL   Wed01      0:00.00 - [crypto returns 3]
                root      7   0.0  0.0      0    32  -  DL   Wed01      0:00.00 - [cam]
                root      8   0.0  0.0      0    16  -  DL   Wed01      0:00.02 - [soaiod1]
                root      9   0.0  0.0      0    16  -  DL   Wed01      0:00.02 - [soaiod2]
                root     10   0.0  0.0      0    16  -  DL   Wed01      0:00.00 - [audit]
                root     13   0.0  0.0      0    64  -  DL   Wed01      0:00.00 - [ng_queue]
                root     14   0.0  0.0      0    48  -  DL   Wed01      0:00.03 - [geom]
                root     15   0.0  0.0      0    16  -  DL   Wed01      0:00.00 - [sequencer 00]
                root     16   0.0  0.0      0    16  -  DL   Wed01      0:00.02 - [soaiod3]
                root     17   0.0  0.0      0    16  -  DL   Wed01      0:00.01 - [soaiod4]
                root     18   0.0  0.0      0    16  -  DL   Wed01      0:00.00 - [sctp_iterator]
                root     19   0.0  0.0      0    16  -  DL   Wed01      0:26.85 - [pf purge]
                root     20   0.0  0.0      0    16  -  DL   Wed01      0:40.00 - [rand_harvestq]
                root     21   0.0  0.0      0    48  -  DL   Wed01      0:02.03 - [pagedaemon]
                root     22   0.0  0.0      0    16  -  DL   Wed01      0:00.00 - [vmdaemon]
                root     23   0.0  0.0      0    16  -  DNL  Wed01      0:00.00 - [pagezero]
                root     24   0.0  0.0      0    32  -  DL   Wed01      0:02.41 - [bufdaemon]
                root     25   0.0  0.0      0    16  -  DL   Wed01      0:00.54 - [bufspacedaemon]
                root     26   0.0  0.0      0    16  -  DL   Wed01      0:04.46 - [syncer]
                root     27   0.0  0.0      0    16  -  DL   Wed01      0:00.60 - [vnlru]
                root     66   0.0  0.0      0    16  -  DL   Wed01      0:00.09 - [md0]
                

                No obvious culprits to me, just kernel and specifically hvevent0 being high.

                The only obvious difference in both pfSense VM's is that the primary one runs on a dedicated HPe DL360 gen8, which, even with the latest firmwares and bios, has Current PTI status: Enabled, by means of 'auto detect'. The second node runs on a Gen10 with loads of other VMs, but that one has PTI disabled. That shouldn't make too much difference, especially as it's the hyper-v servics that get loaded, rather than some actual BSD OS related stuff.

                I'll try moving the passive box to the same host as the primary, and see what happens.

                1 Reply Last reply Reply Quote 0
                • R
                  rgijsen
                  last edited by

                  I've had no success so far. In the past I've had one VM which had something of a corrupted configuration (from hypervisor perspective) which made it very slow to live-migrate to another host. In the end I removed that VM, created a new VM with the same disk / network config and that issue was gone.
                  I did the same with this pfSense VM, but alas, no cigar. The next thing I'm going to try is moving this first node to a gen10 host, like the second node is. While pfSense shouldn't be that CPU intensive, who know there is something of a hardware issue, even if virtualized. If THAT doesn't work, I guess I'll reinstall it from scratch and restore the config.
                  It'll take a while though, I can't live failover pfSense unfortunately, as most connections are through haproxy, which doesn't sync states.

                  1 Reply Last reply Reply Quote 0
                  • stephenw10S
                    stephenw10 Netgate Administrator
                    last edited by

                    It could well be something like that. That load appears to be all hypervisor generated, it's unclear what could be causing it.
                    https://www.freebsd.org/cgi/man.cgi?query=hv_vmbus&sektion=4

                    Steve

                    1 Reply Last reply Reply Quote 0
                    • R
                      rgijsen
                      last edited by

                      Coming back to this, I have to come clear. Next weekend is my monthly maintenance, and to rule some things out I did a carp failover to the second node this night. Practically no CPU load on the second node. However, now there's actually a lot of traffic going on, I have the same phenomenon on the second node as well. I guess I ran top -a on that second node, rather than top -aSH, showing no kernel use at all. Now on the second node I have haproxy consuming about 9-11% and kernel{hvevent0} about 15%.

                      Even if I wonder why there's so much kernel usage on a rather low overall load (about 10.000 states, that't peanuts), it seems to be a hypervisor integration issue, or maybe even just regular kernel load. I can't try another hypervisor or even physical machine, as these pfSense boxes are in production. In my test setup I can't generate that amount of load, which is why I never saw the issue there probably.

                      1 Reply Last reply Reply Quote 0
                      • stephenw10S
                        stephenw10 Netgate Administrator
                        last edited by

                        Ah, OK so when you failover you are actually seeing about the same loading on the other node?

                        It's probably just where the load appears for that. Unless you're actually seeing an issue there I wouldn't worry too much.

                        Steve

                        1 Reply Last reply Reply Quote 0
                        • R
                          rgijsen
                          last edited by

                          I'll leave it at it is. Thanks for the insights!

                          1 Reply Last reply Reply Quote 1
                          • First post
                            Last post
                          Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.