2.4.5-p1 HA carp setup on Hyper-V, high hvevent0 CPU usage
-
We have a HA setup with CARP running on Hyper-V, and we've had it like that for years without any real issues. We are running 2.4.5-p1, on Server 2019 (core) with Hyper-V role. some VLAN trunk ports and all is functionally well. I noticed though that the primary pfSense node consumes a lot of CPU with kernel{hvenent0}, top extract:
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
11 root 155 ki31 0K 64K RUN 1 513:56 87.41% [idle{idle: cpu1}]
11 root 155 ki31 0K 64K RUN 3 505:40 86.37% [idle{idle: cpu3}]
11 root 155 ki31 0K 64K CPU2 2 499:32 85.76% [idle{idle: cpu2}]
11 root 155 ki31 0K 64K CPU0 0 489:10 70.66% [idle{idle: cpu0}]
0 root -92 - 0K 800K - 0 73:35 26.15% [kernel{hvevent0}]
32077 www 27 0 22016K 11672K CPU1 1 37:05 12.46% /usr/local/sbin/haproxy -f /var/etc/haproxy/haproxy.cf
0 root -92 - 0K 800K - 2 21:36 8.74% [kernel{hvevent2}]
0 root -92 - 0K 800K - 3 20:04 7.87% [kernel{hvevent3}]Now with this copy / paste the CPU is about 26%, but I've seen it over 30 a lot of times as well. The second node (running on another Hyper-V box of course) does not have that problem, even when it's made active and it has work to do. Both hosts running these VM's are on the same patch-level, KB4577668 as of now. As far as I'm aware nothing obvious has been changed
in the pfSense VM's apart from the operational firewall rule changes we need from time to time.As per https://forum.netgate.com/topic/121407/solved-with-workaround-higher-cpu-load-after-upgrade-to-2-4-on-hyper-v/3 and https://redmine.pfsense.org/issues/6882 I did soms checks, but both VM's do NOT have a CD drive attached at all, nor do they have SNMP enabled. I tried disabling the Hyper-V integration services from Hyper-V side, powered down the primary pfSense VM and restarted it and tried enabling and disabling SNMP just to make sure that wan't the issue. No cigar.
How to troubleshoot this further?
-
I would check
vmstat -i
andps -auxwwd
so if anything is showing high there.It may not though. That seems to be load servicing the hv_vmbus.
Steve
-
This post is deleted! -
Yes, that's exactly what I would try; failover to the secondary and see if the load exists there when it is master.
Steve
-
As stated before when I do a carp failover to the second node, or even shutdown the primary node, the second node shows practically no load at all. However, with the secondary node being active, the primary stills shows that load on the hvevent process.
So my next step I guess will be to mount the pfsense disk to a new VM with the same NIC config, and see if anything's wrong from that perspective. If that fails, I can still try rebuilding the whole VM.
Or would there be any other clues? -
I had to delete my second post as it turns out I forgot to mask the IP addresses
I'll post it again below, but mind that it actually is a response to stephenw10's first reply.
vmstat -i
interrupt total rate cpu0:timer 18539648 160 cpu1:timer 14713803 127 cpu2:timer 15964083 138 cpu3:timer 15701489 136 cpu0:hyperv 478580870 4130 cpu1:hyperv 99633883 860 cpu2:hyperv 99907288 862 cpu3:hyperv 90973736 785 Total 834014800 7198
ps -auxwwd
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 0 88.8 0.1 0 800 - DLs Wed01 499:07.14 [kernel] root 11 292.9 0.0 0 64 - RNL Wed01 6830:15.44 - [idle] root 12 1.0 0.0 0 224 - WL Wed01 8:17.97 - [intr] root 1 0.0 0.1 5016 848 - ILs Wed01 0:00.01 - /sbin/init -- www 32077 16.0 0.8 22016 11932 - Ss Wed01 124:37.03 |-- /usr/local/sbin/haproxy -f /var/etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D root 344 0.0 1.7 94900 25024 - Ss Wed01 0:02.65 |-- php-fpm: master process (/usr/local/lib/php-fpm.conf) (php-fpm) root 10455 0.0 2.6 99124 39452 - S 23:31 0:41.73 | |-- php-fpm: pool nginx (php-fpm) root 55168 0.0 2.6 99124 39256 - S 23:34 0:41.27 | |-- php-fpm: pool nginx (php-fpm) root 98139 0.0 2.6 99124 39256 - S 23:08 0:42.40 | `-- php-fpm: pool nginx (php-fpm) root 359 0.0 0.2 6756 2640 - INs Wed01 0:00.02 |-- /usr/local/sbin/check_reload_status root 361 0.0 0.2 6756 2396 - IN Wed01 0:00.00 | `-- check_reload_status: Monitoring daemon of check_reload_status root 414 0.0 0.3 9160 4984 - Is Wed01 0:00.06 |-- /sbin/devd -q -f /etc/pfSense-devd.conf root 7792 0.0 0.4 14460 6796 - Is Wed01 0:00.00 |-- /usr/local/sbin/mpd5 -b -d /var/etc/l2tp-vpn -p /var/run/l2tp-vpn.pid -s l2tps l2tps root 8312 0.0 0.2 6380 2376 - Is Wed01 0:00.18 |-- /usr/sbin/cron -s root 13350 0.0 0.4 12472 5948 - Ss Wed01 0:08.50 |-- /usr/local/sbin/ntpd -g -c /var/etc/ntpd.conf -p /var/run/ntpd.pid proxy 14438 0.0 0.1 6268 2212 - Ss Wed01 0:02.25 |-- /usr/sbin/ftp-proxy -a <masked ip> -v root 46084 0.0 0.5 21636 7396 - Is 11:02 0:00.00 |-- nginx: master process /usr/local/sbin/nginx -c /var/etc/nginx-webConfigurator.conf (nginx) root 46259 0.0 0.6 23684 8360 - S 11:02 0:14.40 | |-- nginx: worker process (nginx) root 46466 0.0 0.5 23684 8280 - I 11:02 0:03.23 | `-- nginx: worker process (nginx) root 49364 0.0 0.1 6208 2220 - Is 11:07 0:04.66 |-- /usr/sbin/hv_kvp_daemon root 49818 0.0 0.1 6196 2004 - Is 11:07 0:00.00 |-- /usr/sbin/hv_vss_daemon root 51968 0.0 0.2 6976 2612 - IN 00:01 0:00.00 |-- /bin/sh /etc/rc.update_pkg_metadata root 52083 0.0 0.1 4144 1824 - INC 00:01 0:00.00 | `-- sleep 78926 root 60998 0.0 0.2 8168 3736 - Is Wed01 0:00.00 |-- /usr/local/libexec/ipsec/starter --daemon charon root 61228 0.0 1.1 55900 17208 - Is Wed01 0:10.75 | `-- /usr/local/libexec/ipsec/charon --use-syslog root 73334 0.0 0.5 12676 7344 - Is 11:03 0:00.00 |-- /usr/sbin/sshd root 7545 0.0 0.5 12968 7828 - Ss 09:53 0:00.27 | `-- sshd: <user>@pts/0 (sshd) root 37777 0.0 0.2 6976 2640 0 Is 09:53 0:00.00 | `-- /bin/sh /etc/rc.initial root 48222 0.0 0.2 7284 3484 0 S 09:53 0:00.03 | `-- /bin/tcsh root 2424 0.0 0.2 6824 2672 0 R+ 10:00 0:00.00 | `-- ps -auxwwd root 77295 0.0 0.2 6908 2344 - Is Wed01 0:15.63 |-- /usr/local/bin/dpinger -S -r 0 -i <gateway> -B <masked ip> -p /var/run/dpinger_<gateway>~<masked ip>~<masked ip>.pid -u /var/run/dpinger_<gateway>~<masked ip>~<masked ip>.sock -C /etc/rc.gateway_alarm -d 1 -s 500 -l 2000 -t 60000 -A 1000 -D 500 -L 20 <masked ip> root 86712 0.0 0.4 10312 5692 - Ss Wed01 208:14.88 |-- /usr/local/sbin/openvpn --config /var/etc/openvpn/server1.conf root 88850 0.0 0.2 6968 2808 - Ss Wed01 0:52.38 |-- /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid root 90697 0.0 0.2 6412 2476 - Ss Wed01 0:40.06 |-- /usr/sbin/syslogd -s -c -c -l /var/dhcpd/var/run/log -l /tmp/haproxy_chroot/var/run/log -P /var/run/syslog.pid -f /etc/syslog.conf root 43415 0.0 0.2 6976 2488 - Is 10:59 0:00.00 | `-- /bin/sh /usr/local/sbin/sshguard -i /var/run/sshguard.pid root 43635 0.0 0.1 6196 1876 - I 10:59 0:00.00 | |-- /bin/cat root 43907 0.0 0.3 12016 5184 - IC 10:59 0:00.00 | |-- /usr/local/libexec/sshg-parser root 44184 0.0 0.2 6536 2372 - IC 10:59 0:00.05 | |-- /usr/local/libexec/sshg-blocker root 44469 0.0 0.2 6976 2488 - I 10:59 0:00.00 | `-- /bin/sh /usr/local/sbin/sshguard -i /var/run/sshguard.pid root 44518 0.0 0.2 6976 2476 - I 10:59 0:00.00 | `-- /bin/sh /usr/local/libexec/sshg-fw-pf root 91192 0.0 0.1 6192 1896 - Is Wed01 0:00.00 |-- /usr/local/bin/minicron 240 /var/run/ping_hosts.pid /usr/local/bin/ping_hosts.sh root 91342 0.0 0.1 6192 1912 - I Wed01 0:00.04 | `-- minicron: helper /usr/local/bin/ping_hosts.sh (minicron) root 91357 0.0 0.1 6192 1896 - Is Wed01 0:00.00 |-- /usr/local/bin/minicron 3600 /var/run/expire_accounts.pid /usr/local/sbin/fcgicli -f /etc/rc.expireaccounts root 91586 0.0 0.1 6192 1912 - I Wed01 0:00.00 | `-- minicron: helper /usr/local/sbin/fcgicli -f /etc/rc.expireaccounts (minicron) root 91798 0.0 0.1 6192 1896 - Is Wed01 0:00.00 |-- /usr/local/bin/minicron 86400 /var/run/update_alias_url_data.pid /usr/local/sbin/fcgicli -f /etc/rc.update_alias_url_data root 92231 0.0 0.1 6192 1912 - I Wed01 0:00.00 | `-- minicron: helper /usr/local/sbin/fcgicli -f /etc/rc.update_alias_url_data (minicron) root 95054 0.0 1.9 241920 28616 - Is Wed01 0:13.11 |-- /usr/local/sbin/filterdns -p /var/run/filterdns.pid -i 300 -c /var/etc/filterdns.conf -d 1 root 47601 0.0 0.2 6976 2536 v0- SN 11:02 0:15.75 |-- /bin/sh /var/db/rrd/updaterrd.sh root 91893 0.0 0.1 4144 1824 - SNC 10:00 0:00.00 | `-- sleep 60 root 69685 0.0 0.1 6316 2032 v0 Is+ 11:26 0:00.00 |-- /usr/libexec/getty Pc ttyv0 root 61754 0.0 0.1 6316 2032 v1 Is+ Wed01 0:00.00 |-- /usr/libexec/getty Pc ttyv1 root 62014 0.0 0.1 6316 2032 v2 Is+ Wed01 0:00.00 |-- /usr/libexec/getty Pc ttyv2 root 62204 0.0 0.1 6316 2032 v3 Is+ Wed01 0:00.00 |-- /usr/libexec/getty Pc ttyv3 root 62552 0.0 0.1 6316 2032 v4 Is+ Wed01 0:00.00 |-- /usr/libexec/getty Pc ttyv4 root 62819 0.0 0.1 6316 2032 v5 Is+ Wed01 0:00.00 |-- /usr/libexec/getty Pc ttyv5 root 62865 0.0 0.1 6316 2032 v6 Is+ Wed01 0:00.00 |-- /usr/libexec/getty Pc ttyv6 root 63124 0.0 0.1 6316 2032 v7 Is+ Wed01 0:00.00 `-- /usr/libexec/getty Pc ttyv7 root 2 0.0 0.0 0 16 - DL Wed01 0:00.00 - [crypto] root 3 0.0 0.0 0 16 - DL Wed01 0:00.00 - [crypto returns 0] root 4 0.0 0.0 0 16 - DL Wed01 0:00.00 - [crypto returns 1] root 5 0.0 0.0 0 16 - DL Wed01 0:00.00 - [crypto returns 2] root 6 0.0 0.0 0 16 - DL Wed01 0:00.00 - [crypto returns 3] root 7 0.0 0.0 0 32 - DL Wed01 0:00.00 - [cam] root 8 0.0 0.0 0 16 - DL Wed01 0:00.02 - [soaiod1] root 9 0.0 0.0 0 16 - DL Wed01 0:00.02 - [soaiod2] root 10 0.0 0.0 0 16 - DL Wed01 0:00.00 - [audit] root 13 0.0 0.0 0 64 - DL Wed01 0:00.00 - [ng_queue] root 14 0.0 0.0 0 48 - DL Wed01 0:00.03 - [geom] root 15 0.0 0.0 0 16 - DL Wed01 0:00.00 - [sequencer 00] root 16 0.0 0.0 0 16 - DL Wed01 0:00.02 - [soaiod3] root 17 0.0 0.0 0 16 - DL Wed01 0:00.01 - [soaiod4] root 18 0.0 0.0 0 16 - DL Wed01 0:00.00 - [sctp_iterator] root 19 0.0 0.0 0 16 - DL Wed01 0:26.85 - [pf purge] root 20 0.0 0.0 0 16 - DL Wed01 0:40.00 - [rand_harvestq] root 21 0.0 0.0 0 48 - DL Wed01 0:02.03 - [pagedaemon] root 22 0.0 0.0 0 16 - DL Wed01 0:00.00 - [vmdaemon] root 23 0.0 0.0 0 16 - DNL Wed01 0:00.00 - [pagezero] root 24 0.0 0.0 0 32 - DL Wed01 0:02.41 - [bufdaemon] root 25 0.0 0.0 0 16 - DL Wed01 0:00.54 - [bufspacedaemon] root 26 0.0 0.0 0 16 - DL Wed01 0:04.46 - [syncer] root 27 0.0 0.0 0 16 - DL Wed01 0:00.60 - [vnlru] root 66 0.0 0.0 0 16 - DL Wed01 0:00.09 - [md0]
No obvious culprits to me, just kernel and specifically hvevent0 being high.
The only obvious difference in both pfSense VM's is that the primary one runs on a dedicated HPe DL360 gen8, which, even with the latest firmwares and bios, has Current PTI status: Enabled, by means of 'auto detect'. The second node runs on a Gen10 with loads of other VMs, but that one has PTI disabled. That shouldn't make too much difference, especially as it's the hyper-v servics that get loaded, rather than some actual BSD OS related stuff.
I'll try moving the passive box to the same host as the primary, and see what happens.
-
I've had no success so far. In the past I've had one VM which had something of a corrupted configuration (from hypervisor perspective) which made it very slow to live-migrate to another host. In the end I removed that VM, created a new VM with the same disk / network config and that issue was gone.
I did the same with this pfSense VM, but alas, no cigar. The next thing I'm going to try is moving this first node to a gen10 host, like the second node is. While pfSense shouldn't be that CPU intensive, who know there is something of a hardware issue, even if virtualized. If THAT doesn't work, I guess I'll reinstall it from scratch and restore the config.
It'll take a while though, I can't live failover pfSense unfortunately, as most connections are through haproxy, which doesn't sync states. -
It could well be something like that. That load appears to be all hypervisor generated, it's unclear what could be causing it.
https://www.freebsd.org/cgi/man.cgi?query=hv_vmbus&sektion=4Steve
-
Coming back to this, I have to come clear. Next weekend is my monthly maintenance, and to rule some things out I did a carp failover to the second node this night. Practically no CPU load on the second node. However, now there's actually a lot of traffic going on, I have the same phenomenon on the second node as well. I guess I ran top -a on that second node, rather than top -aSH, showing no kernel use at all. Now on the second node I have haproxy consuming about 9-11% and kernel{hvevent0} about 15%.
Even if I wonder why there's so much kernel usage on a rather low overall load (about 10.000 states, that't peanuts), it seems to be a hypervisor integration issue, or maybe even just regular kernel load. I can't try another hypervisor or even physical machine, as these pfSense boxes are in production. In my test setup I can't generate that amount of load, which is why I never saw the issue there probably.
-
Ah, OK so when you failover you are actually seeing about the same loading on the other node?
It's probably just where the load appears for that. Unless you're actually seeing an issue there I wouldn't worry too much.
Steve
-
I'll leave it at it is. Thanks for the insights!