pfSense 2.7.2 in Hyper-V freezing with no crash report after reboot

Bismarck

Hi Techniker_ctr,

Unfortunately yesterday it happened again, but it occurs way less than before.

To be honest, I'm out of ideas. Do you use the system patches and apply the security fixes?

Because this system was running fine for 1 - 1 1/2 years without any changes, the issue started around January if I remember it right, and only one of three firewalls is affected.

Host: Windows Server 2022, Drivers are all up to date (2H24), nothing suspicious in the logs.

I saw your FreeBSD forum thread; I have the same issue. I was considering testing the alternate "sense", but a different hypervisor seems like a better solution.

Btw, did you try to run it as a Gen 1 VM?

Techniker_ctr

Hi @Bismarck ,

we tried Gen1 as well as Gen2, happens on both setups. Recommended Security Patches are applied on all systems via the patches addon. Crazy is that some of our 2.7.2 pfSenses are running fine since nearly a year on this version, but most of the firewalls we installed or updated to 2.7 were crashing. Sometimes after a crash and reboot they crash a second time shortly after, maybe even a third time on some occasions.
We're also out of ideas, and as it's not wanted by higher levels to set up some proxmox hosts only for the ~300 firewalls, we might try some other firewall systems in near future. We do not know what else we can do for now, as we're unable to replicate it in a way someone at pfSense/OPNsense of FreeBSD could replicate it as well.

Bismarck

@Techniker_ctr

Regarding your installations is it UFS or ZFS, could you check top for an ARC entry (e.g., "ARC: 2048B Total, 2048B Header")?

I have three UFS pfSense instances (one bare metal, two VMs), and only one VM exhibits this issue. It displays an ARC entry in top and has zfs.ko loaded, despite being UFS.

Unloading zfs.ko temporarily removes the ARC entry, but it reappears shortly after. This behavior also occurred during OPNsense testing.

I've unloaded and renamed zfs.ko to observe the outcome.

HyperV CPU hvevent goes to 100%
Should a UFS machine have an ARC entry in top?
looking for the reason for sudden very high CPU utilization on network/zfs related process

"However this is not what is observed. What was observed is consistent with busy waiting. The storagenode process was constantly in the “uwait” process state. As if instead of pulling back, moderating bandwidth, sleeping more, storagenode instead loops at a very fast rate on system calls that fail or return a retry error, or just polls on some event or mutex, thereby consuming high CPU while effectively doing nothing (which is also observed: lower performance, lower overall IO, more missed uploads and downloads)."

Sound familiar?

/edit
Quick recap: zfs.ko loads due to a trigger in kernel like ZFS storage access. Is your storage dynamic or fixed? This instance experiences the most reads/writes, just a guess.

maitops

@Bismarck Hi, I'm the OPNsense user from the FreeBSD topic and the OPNsense forum thread.

I performed a fresh install of OPNsense 25.1.4 using UFS on a fixed-size VHDX. I only imported the configuration that I use in production.

After about 12 hours during a high-load period in the morning the issue reappeared.

root@proxy01:~ # top -aHSTn
last pid: 42090;  load averages:  3.50,  1.88,  0.94  up 0+14:51:36    12:33:11
252 threads:   13 running, 226 sleeping, 13 waiting
CPU:  1.5% user,  0.0% nice,  0.4% system,  0.0% interrupt, 98.1% idle
Mem: 135M Active, 1875M Inact, 1813M Wired, 1100M Buf, 7823M Free
Swap: 8192M Total, 8192M Free

   THR USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
100004 root        187 ki31     0B   128K RUN      1 872:01 100.00% [idle{idle: cpu1}]
100009 root        187 ki31     0B   128K CPU6     6 871:57 100.00% [idle{idle: cpu6}]
100003 root        187 ki31     0B   128K CPU0     0 871:36 100.00% [idle{idle: cpu0}]
100007 root        187 ki31     0B   128K CPU4     4 871:35 100.00% [idle{idle: cpu4}]
100005 root        187 ki31     0B   128K CPU2     2 871:33 100.00% [idle{idle: cpu2}]
100116 root        -64    -     0B  1472K CPU7     7   4:51 100.00% [kernel{hvevent7}]
100008 root        187 ki31     0B   128K RUN      5 871:22  98.97% [idle{idle: cpu5}]
100006 root        187 ki31     0B   128K CPU3     3 871:28  96.97% [idle{idle: cpu3}]
101332 www          21    0   537M    59M kqread   2  11:31   3.96% /usr/local/sbin/haproxy -q -f /usr/local/etc/haproxy.conf -p /var/run/haproxy.pid{haproxy}
100902 www          21    0   537M    59M kqread   4  12:29   1.95% /usr/local/sbin/haproxy -q -f /usr/local/etc/haproxy.conf -p /var/run/haproxy.pid{haproxy}
101333 www          23    0   537M    59M kqread   5  11:17   1.95% /usr/local/sbin/haproxy -q -f /usr/local/etc/haproxy.conf -p /var/run/haproxy.pid{haproxy}
101334 www          21    0   537M    59M CPU1     1  10:57   1.95% /usr/local/sbin/haproxy -q -f /usr/local/etc/haproxy.conf -p /var/run/haproxy.pid{haproxy}
100010 root        187 ki31     0B   128K RUN      7 867:24   0.00% [idle{idle: cpu7}]
100805 root         20    0    81M    52M nanslp   5   3:09   0.00% /usr/local/bin/php /usr/local/opnsense/scripts/routes/gateway_watcher.php interface routes alarm
100102 root        -64    -     0B  1472K -        0   1:14   0.00% [kernel{hvevent0}]
100104 root        -64    -     0B  1472K -        1   0:59   0.00% [kernel{hvevent1}]
100114 root        -64    -     0B  1472K -        6   0:56   0.00% [kernel{hvevent6}]
100106 root        -64    -     0B  1472K -        2   0:56   0.00% [kernel{hvevent2}]

root@proxy01:~ # kldstat | grep zfs
root@proxy01:~ # top | grep arc

As you can see, hvenvent7 is at 100% CPU. ZFS is not loaded, and there's no ARC entry visible in the top command.

I also have another router configured as a CARP secondary with the same configuration. If the primary goes down, the secondary experiences the same issue. However, on three other routers (also using CARP but with different usage patterns), I’ve never encountered this problem. Those routers have different loads and usage scenarios.

About a year ago, we were using pfSense and had the exact same issue, which led us to switch to OPNsense. We didn’t encounter the issue again—until the FreeBSD version was updated. If I downgrade to OPNsense 24.1 (based on FreeBSD 13), the problem does not occur.

stephenw10

Are you able to confirm if it still happens in FreeBSD 15? So in Plus 24.03 or later or 2.8-beta?

Bismarck

@maitops Yeah, it not ZFS or Dynamic disk related, had 2 incidences this week, both in the Tuesday/Thursday afternoons.

While WAN, LAN, and OpenVPN connectivity remain partially functional, OPT1/OPT2/IPSec become unreachable at that time.

In my opinion this issue is not just only a isolated Hyper-V problem, I've had the same on a bare metal setup twice over a period of 5 months since upgradet to 2.7(.2), with Hyper-V it happens on a weekly basis.

Things I've tried this week to mitigate the issue, based on my research with similar problems related to pfSense/OPNsense/FreeBSD 14:

Host

disabled NUMA Spanning
set powerplan to High performance
disable guest timesync integration

VM

created a loader.conf.local with following lines:
net.isr.bindthreads="1"
net.isr.maxthreads="-1"
hw.hvtimesync.sample_thresh="-1"
hw.hvtimesync.ignore_sync="1"
System Tunables

Disclaimer, some of those settings are pure guesswork, as desperation kicks in.

maitops

@stephenw10 Sadly I can't test on freebsd 15 easily, i don't have anymore the pfsense config, just the opnsense one.

maitops

@Bismarck Do you use the latest pfsense version with freeBSD 15 ?
All the VM that fail have the HAProxy service running ?

The unavaibility of some service are not due because the core used by hvevent is flooded ? So if you're not lucky even the GUI is down.
If its not a Hyper-V issue, what cause a core to run at 100% ? The issue appaer for me on opnsense weekly mostly, sometimes under 12h, sometimes nothing for 15 days. It seem to be linked with high load on HAProxy, because my opnsense router just serve as a HAProxy host.

Bismarck

@maitops Running pfSense 2.7.2 (FBSD 14) with HAProxy as the only VM. Load wasn't high during the events, even occurring once at 3 AM with zero HAProxy load. The issue may not be HAProxy itself, but a kernel resource over time exhaustion?

I also have another router configured as a CARP secondary with the same configuration. If the primary goes down, the secondary experiences the same issue.

I found this sentence very interesting. Why is that? Maybe that's a starting point?

maitops

@Bismarck I will provide more context.

I made a cron script that detect if the hvevent issue is triggering and force the router to enter in a CARP maintenance mode. So the secondary is suppose to take the lead when the hvevent occurs. Once the cron worked at 3am, at 6am the second router triggered the hvevent issue too. So the 2nd router probably didn't had an exhaustion over time, it didn't took a lot of traffic during 3h in the night.

Btw the CARP maintenance mode can fail to release some VIP when the hvevent issue occurs. I trigger the CARP Maintenance mode with the web API of the OpnSense (probably work the same on Pfsense).

The VMs are not run on the same host, but all hosts are hyper-v windows server 2022 on AMD EPYC Genoa CPU.

Bismarck

@maitops Thanks for the detailed explanation.

No hvevent storm here for 6 days and 23.5 hours since my last update, but it probably needs at least 20 and more days to be significant.

Theory: Server 2022 Hyper-V power management, network driver changes may be incompatible with some FreeBSD kernel components, causing issues under certain conditions. Windows and Debian guests in Hyper-V Manager display more detailed information (e.g., RAM usage) than FreeBSD 14 guests. Interesting that the MS Hyper-V FreeBSD Guest compability list only goes to 13 and 2019, where pfSense runs just fine.

jacolex

Hello, I'm struggling with simmilar case. pfSense with 6 interfaces, hosts suddenly lost connection from/to pfsense gateway, which means distruption of web services. It happens once a month, but last week it happened 3 times. Only restart can help. Today I swiched to UFS. If it not resolves the issue, I'll try with disabling pfblocker for achieve minimal resources consuming. I wonder whether pfsense 2.8.0 on FreeBSD 15 would be more stable or worse.

maitops

@Bismarck Hi,

The system is still running fine ?

Bismarck

@maitops

Yes, no problems so far.

jacolex

@maitops yes, since 10 days. I also disabled hn ALTQ support (no clue if it's necessary). Observing kernel hvevents, no issues. But I have to wait 2-3 months to say, that it's stable.

jacolex

@maitops unfortunately, today morning we encountered network outages and the firewall needs to be restarted.

stephenw10

@Bismarck said in pfSense 2.7.2 in Hyper-V freezing with no crash report after reboot:

No hvevent storm here for 6 days and 23.5 hours since my last update,

So this was setting the power management to 'high power' in Hyper-V? Which presumably disables throttling down the VM in some way.

Bismarck

@stephenw10

Yes, or just disabling the power management/green feature of the Nic should be enough, this is how it is right now on my Hyper-V host. There was a message (no error) in the Windows event logs about switching states or so, while the hvevent storm.

maitops

This post is deleted!

Bismarck

Quick update:

Had to reboot once because auf updates, but since than rock solid no incidence. Enabled all extra IP-Lists and Suricata and so again.

@maitops Just disable all energy saving features of the nic or select high performance power profile in windows for a test.

It must be the power state switching or the system tunables I did im my last update post.