24.03 crashing (again)
-
@stephenw10 said in 24.03 crashing (again):
tcp_m_copym
Ah Ok this looks exactly like this: https://redmine.pfsense.org/issues/15457
There is a new HAProxy package available in 24.03 that has that fixed. It now uses HAProxy 2.9.7.
-
I'm on the latest HAProxy package (IIRC there was an update a few days ago).
One thing I found interesting is that after the crash and reboot two site-to-site OpenVPN connections refused to come up even after restarting the service. They only came up after a reboot.
-
Check the HAProxy stats page. Make sure you're really running 2.9.7. That panic looks exactly like the one triggered by HAProxy 2.9.1.
-
@stephenw10 0 Correct, it did not.
-
@stephenw10 -
Yep, stats page shows: HAProxy version 2.9.7-5742051, released 2024/04/05 -
Hmm, so that latest crash was with HAProxy 2.9.7 installed?
-
Yep, it occurred last night.
-
-
Hmm, same crash again. So it appears that panic is unrelated to that known issue with HAProxy using 100% CPU which should now be fixed.
-
@stephenw10
Yes I can confirm. HAProxy 2.9.7 never use 100% CPU -
But.
I have another PfSense plus with HAproxy 2.9.7, very little traffic, almost nothing. Well, that PfSense has never presented any problems.
The problem is related not only to the presence of HAProxy 2.9.7 but also to the traffic or use of it. -
or....
is there a correlation between HAProxy 2.9.7 with the VM's virtual CPU ? In my case both VM running in Proxmox 8.2.2 (same version of QEMU, identical).
On the version that has NEVER given problems (and is very low traffic):And on the version that has crashes:
-
Hmm, possibly some new instruction that HAProxy is using (or trying to use)?
If it was that expect to see it in some crypto operation but the backtrace doesn't look like that, it's in the network stack.
Is there any difference in the network config of those VMs?
-
mmmmm no.
I just checked and the configuration of the two network interfaces going to Pfsense is completely identical.
N.2 virtio type NICs with same configuration running on the same version of QEMU. -
Hmm, OK the next step here is probably to enable a full kernel core dump and wait for it to happen again.
Do you have SWAP enabled on that VM? How much?
-
OK, if I can help, by installing a new kernel configured for debugging in case of core dump on the production HAProxy server, I'm available.
-
Ok the first step is to enable a full core dump. Edit the file /etc/pfSense-ddb.conf and add a new kdb.enter.default script line like:
# $FreeBSD$ # # This file is read when going to multi-user and its contents piped thru # ``ddb'' to define debugging scripts. # # see ``man 4 ddb'' and ``man 8 ddb'' for details. # script lockinfo=show locks; show alllocks; show lockedvnods script pfs=bt ; show registers ; show pcpu ; run lockinfo ; acttrace ; ps ; alltrace # kdb.enter.panic panic(9) was called. # script kdb.enter.default=textdump set; capture on; run pfs ; capture off; textdump dump; reset script kdb.enter.default=bt ; show registers ; dump ; reset # kdb.enter.witness witness(4) detected a locking error. script kdb.enter.witness=run lockinfo
So there I commented out the old line and added:
script kdb.enter.default=bt ; show registers ; dump ; reset
Now reboot as that is only read in at boot.
Then check it's present at the CLI with:
[24.08-DEVELOPMENT][root@7100.stevew.lan]/root: sysctl debug.ddb.scripting.scripts debug.ddb.scripting.scripts: lockinfo=show locks; show alllocks; show lockedvnods pfs=bt ; show registers ; show pcpu ; run lockinfo ; acttrace ; ps ; alltrace kdb.enter.default=bt ; show registers ; dump ; reset kdb.enter.witness=run lockinfo
It will now dump the full vmcore after a panic.
You can check it by manually triggering a panic with:sysctl sysctl debug.kdb.panic=1
At the console you will see something like:
db:0:kdb.enter.default> dump Dumping 586 out of 8118 MB:..3%..11%..22%..33%..41%..52%..63%..71%..82%..93% Dump complete db:0:kdb.enter.default> reset Uptime: 17m8s
The available SWAP space must be larger than the used RAM though. That 7100 is only using 586MB because it's a test box.
-
For reference:
https://redmine.pfsense.org/issues/15618 -