Kernel Panic on pfSense+ 24.03-RELEASE
-
Hello,
we recently started to see Kernel Panics (Fatal trap 12: page fault while in kernel mode) on our Netgate 1537 Instances. We're running a HA Pair of them and they both show this behaviour. Currently the "usual primary" is in Persistent CARP Maintenance and the second one took over CARP IPs and is handling traffic as we suspected a bad memory module on the primary instance. This however seems not to be the case as the "usual secondary" is showing the same behaviour.
Both instances have recently been updated from 23.05.1. The Upgrade on one of the instances failed which was the reason it was re-installed from scratch and upgraded afterwards. This instance does show the same behaviour as the one which was upgraded only.
Both instances show the following on the "textdump.tar.N" as the last bit of information:
Fatal trap 12: page fault while in kernel mode cpuid = 3; apic id = 03 fault virtual address = 0x1c fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80f246e2 stack pointer = 0x0:0xfffffe0084fa7ae0 frame pointer = 0x0:0xfffffe0084fa7b70 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 2 (clock (3)) rdi: 0000000000000000 rsi: 0000000000000000 rdx: fffffe0084fa7cf8 rcx: 0000000000000000 r8: 0000000000000578 r9: 0000000000000000 rax: 0000000000000000 rbx: 0000000000000000 rbp: fffffe0084fa7b70 r10: 0000000000001388 r11: 00000000a36f15dd r12: 0000000000000000 r13: 0000000000000578 r14: fffff8003e4a4000 r15: 0000000000000034 trap number = 12 panic: page fault cpuid = 3 time = 1723109708 KDB: enter: panic
We don't run a bunch of packages on them. I think from custom packages it's just the following: acme (0.8_1), frr (2.0.2_3), lldmd (0.9.11_2), node_exporter (0.18.1_3) and zabbix-agent64 (1.0.6).
As both instances are showing this behaviour I'd "rule out" hardware issues. Even though the instances have been purchased at the same time so they're equally old but from my past experience it feels to be unlikely it might be a defective part like memory or storage. If considering Hardware failure I'd expect the instance which was the primary instance for an extended period of time to fail first; Not both ones at the same time.
I'd appreciate if someone could give me a hint what to look out for or how to further diagnose the issue.
Thank you very much in advance.
Cheers,
Christian -
HA setup and both start to 'crash' showing identical crash dumps ?
I agree with you, and I put my bets on a 'software' issue.acme : good news ; that one is just a rather innocent PHP scrip and one or two small shell scripts. Runs only ones a day, check your cron tasks when that is.
lldmd : dono what that is. Ditch it ?!
node_exporter ? a pfSense package ? Does it contain binaries , If so => remove it for a while.
zabbix-agent64 : can you live with it for some time ?You get my point by now : go bare bone mode for a while.
If it's the FreeBSD kernel by itself that is doing this ... well .....@cboenning said in Kernel Panic on pfSense+ 24.03-RELEASE:
Both instances have recently been updated from 23.05.1
But why using an old kernel ? You don't want the more recent one ? ( hint : 24.03 )
-
Hi @Gertjan ,
I removed the lldpd package which is the one I can live without.
Others however (zabbix-agent and node_exporter in particular) are integral part of our monitoring infrastructure which I cannot remove for business reasons.We are on 24.03-RELEASE (which is when the instances started to misbehave), the mention that we came from 23.05.1 was just a bit of history and the instances worked flawlessly serving ~250 OpenVPN users and terminating a good amount of IPSec Site-to-Site Tunnels (which we use frr for).
-
@cboenning said in Kernel Panic on pfSense+ 24.03-RELEASE:
the mention that we came from 23.05.1
I was somewhat reading the other way around .. sorry for that.
Can you post more details about the crash ? The place where it was crashing ?
-
-
@cboenning Just an observation: You mentioned you used FRR for IPsec site-2-site tunnels. FYI there is some major kernel route issues with the FRR package that comes with 24.03:
https://forum.netgate.com/topic/188603/updating-to-pfsense-24-3-breaks-routing-kernel-routes-now-gone/25
Could it be the FRR problem that causes Kernel problems in your setup?
-
@keyser I would not want to rule this out. It's the package "I was most afraid of" to upgrade given it bumped from 7.x to 9.x.
We don't do "anything funky" through. It's just a bunch of BGP Sessions we're running with Google Cloud VPN. No OSPF/OSPF6, no RIP; In fact we don't redistribute any routes other than "connected" (e.g. static, kernel, ospf/ospfv3). I'll go through the post you mentioned to see if there might be any similarities here.
-
Do you have the backtrace? Can you upload the full crash report(s)?
-
@stephenw10 I have uploaded the files as a tar Archive.
As a reference:
- pfSense-1 is the "usual primary" (currently in persistent CARP maintenance, thus backup), it produced a bunch of crashes out of which one dump was still available.
- pfSense-2 is the "current primary" (usually backup), I added 4 dumps to the upload.
-
Ok these are all identical crashes, on both nodes. So that's definitely a software issue.
Backtrace:
db:1:pfs> bt Tracing pid 2 tid 100097 td 0xfffff80001831740 kdb_enter() at kdb_enter+0x33/frame 0xfffffe0084fa28f0 panic() at panic+0x43/frame 0xfffffe0084fa2950 trap_fatal() at trap_fatal+0x40f/frame 0xfffffe0084fa29b0 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0084fa2a10 calltrap() at calltrap+0x8/frame 0xfffffe0084fa2a10 --- trap 0xc, rip = 0xffffffff80f246e2, rsp = 0xfffffe0084fa2ae0, rbp = 0xfffffe0084fa2b70 --- tcp_m_copym() at tcp_m_copym+0x62/frame 0xfffffe0084fa2b70 tcp_default_output() at tcp_default_output+0x1294/frame 0xfffffe0084fa2d60 tcp_timer_rexmt() at tcp_timer_rexmt+0x53c/frame 0xfffffe0084fa2dc0 tcp_timer_enter() at tcp_timer_enter+0x101/frame 0xfffffe0084fa2e00 softclock_call_cc() at softclock_call_cc+0x12e/frame 0xfffffe0084fa2ec0 softclock_thread() at softclock_thread+0xe9/frame 0xfffffe0084fa2ef0 fork_exit() at fork_exit+0x7f/frame 0xfffffe0084fa2f30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0084fa2f30 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Previously that was seen with HAProxy: https://redmine.pfsense.org/issues/15457
But you're not running HAProxy.
I do note that in each case it appears an OpenVPN instance is unable to service incoming requests:
<7>sonewconn: pcb 0xfffff8002a4cd400 (local:/var/etc/openvpn/server2/sock): Listen queue overflow: 2 already in queue awaiting acceptance (1 occurrences), euid 0, rgid 0, jail 0
Do you have OpenVPN servers running TCP?
-
@stephenw10 Yes, we're running two OpenVPN Servers on TCP. One is a pretty boring 12 Clients instance while the other one is one of our 2 primary VPN Services. Both Servers (one on UDP the other one - this one "server2" - on TCP).
Both usually serve around 120-150 users throughout the day while I cannot really tell how many users are connected at the point in time where the unit panics.
-
Hmm, but at least one of those servers is UDP only?
-
@stephenw10 yes, we run udp/1194 (server1, 120-150 users), tcp/1194 (server2, 120-150 users) and tcp/1195 (server3, 10-12 users).
-
Hmm, the OpenVPN message is probably just a symptom then. What do you listening on TCP from?:
sockstat -P tcp
-
@stephenw10 Full output (with redacted IPs) below, but major listening daemons are:
- nginx (80/18288, from pfSense)
- ssh (12689, from pfSense)
- openvpn (1194)
- isc-dhcpd (520, I switched to kea but it showed same behaviour so I reverted)
- node_exporter (9100)
- frr (179 /26xx)
- zabbix-agent (10050)
[24.03-RELEASE][admin@pfsense-2.domain]/root: sockstat -P tcp USER COMMAND PID FD PROTO LOCAL ADDRESS FOREIGN ADDRESS root sshd 12954 4 tcp4 lan:12689 remote:63592 frr bgpd 93160 15 tcp6 *:2605 *:* frr bgpd 93160 16 tcp4 *:2605 *:* frr bgpd 93160 20 tcp6 *:179 *:* frr bgpd 93160 21 tcp4 *:179 *:* frr bgpd 93160 23 tcp4 169.254.254.41:48552 169.254.254.42:179 frr bgpd 93160 24 tcp4 169.254.254.45:61975 169.254.254.46:179 frr bgpd 93160 25 tcp4 169.254.254.9:45278 169.254.254.10:179 frr bgpd 93160 26 tcp4 169.254.254.13:9166 169.254.254.14:179 frr bgpd 93160 27 tcp4 169.254.254.17:8175 169.254.254.18:179 frr bgpd 93160 28 tcp4 169.254.254.21:5727 169.254.254.22:179 frr bgpd 93160 29 tcp4 169.254.254.25:40994 169.254.254.26:179 frr bgpd 93160 30 tcp4 169.254.254.29:6862 169.254.254.30:179 frr bgpd 93160 31 tcp4 169.254.254.33:50604 169.254.254.34:179 frr bgpd 93160 32 tcp4 169.254.254.37:1108 169.254.254.38:179 frr bgpd 93160 33 tcp4 169.254.254.57:51757 169.254.254.58:179 frr bgpd 93160 34 tcp4 169.254.254.61:20765 169.254.254.62:179 frr bgpd 93160 35 tcp4 169.254.254.65:6066 169.254.254.66:179 frr bgpd 93160 36 tcp4 169.254.254.69:52714 169.254.254.70:179 frr bgpd 93160 37 tcp4 169.254.254.73:39873 169.254.254.74:179 frr bgpd 93160 38 tcp4 169.254.254.77:55399 169.254.254.78:179 frr bgpd 93160 39 tcp4 169.254.254.81:48328 169.254.254.82:179 frr bgpd 93160 40 tcp4 169.254.254.85:45645 169.254.254.86:179 frr bgpd 93160 41 tcp4 169.254.254.89:8402 169.254.254.90:179 frr bgpd 93160 42 tcp4 169.254.254.93:11107 169.254.254.94:179 frr bgpd 93160 43 tcp4 169.254.254.97:27421 169.254.254.98:179 frr bgpd 93160 44 tcp4 169.254.254.105:28537 169.254.254.106:179 frr bgpd 93160 45 tcp4 169.254.254.109:29597 169.254.254.110:179 frr bgpd 93160 46 tcp4 169.254.255.37:48190 169.254.255.38:179 frr bgpd 93160 47 tcp4 169.254.255.9:54573 169.254.255.10:179 frr bgpd 93160 48 tcp4 169.254.255.17:4014 169.254.255.18:179 frr bgpd 93160 49 tcp4 169.254.255.21:45371 169.254.255.22:179 frr bgpd 93160 50 tcp4 169.254.255.25:15854 169.254.255.26:179 frr bgpd 93160 51 tcp4 169.254.255.29:43328 169.254.255.30:179 frr bgpd 93160 52 tcp4 169.254.255.33:8344 169.254.255.34:179 frr bgpd 93160 53 tcp4 169.254.255.13:57269 169.254.255.14:179 frr bgpd 93160 54 tcp4 169.254.255.41:32399 169.254.255.42:179 frr bgpd 93160 55 tcp4 169.254.255.45:62751 169.254.255.46:179 frr bgpd 93160 56 tcp4 169.254.255.61:49589 169.254.255.62:179 frr bgpd 93160 57 tcp4 169.254.255.57:30575 169.254.255.58:179 frr bgpd 93160 58 tcp4 169.254.255.65:17927 169.254.255.66:179 frr bgpd 93160 59 tcp4 169.254.255.69:52593 169.254.255.70:179 frr bgpd 93160 60 tcp4 169.254.255.73:26342 169.254.255.74:179 frr bgpd 93160 61 tcp4 169.254.255.77:23533 169.254.255.78:179 frr bgpd 93160 62 tcp4 169.254.255.81:6200 169.254.255.82:179 frr bgpd 93160 63 tcp4 169.254.255.85:1817 169.254.255.86:179 frr bgpd 93160 64 tcp4 169.254.255.89:55972 169.254.255.90:179 frr bgpd 93160 65 tcp4 169.254.255.93:40012 169.254.255.94:179 frr bgpd 93160 66 tcp4 169.254.255.97:20828 169.254.255.98:179 frr bgpd 93160 67 tcp4 169.254.255.105:5854 169.254.255.106:179 frr bgpd 93160 68 tcp4 169.254.255.109:37727 169.254.255.110:179 frr bgpd 93160 69 tcp4 lan:179 lan-bgp-peer:46520 frr bgpd 93160 70 tcp4 lan:179 lan-bgp-peer:55180 frr staticd 92209 9 tcp6 *:2616 *:* frr staticd 92209 10 tcp4 *:2616 *:* frr mgmtd 91678 12 tcp6 *:2623 *:* frr mgmtd 91678 13 tcp4 *:2623 *:* frr zebra 90512 20 tcp6 *:2601 *:* frr zebra 90512 21 tcp4 *:2601 *:* zabbix zabbix_age 59718 4 tcp4 *:10050 *:* zabbix zabbix_age 59449 4 tcp4 *:10050 *:* zabbix zabbix_age 59331 4 tcp4 *:10050 *:* zabbix zabbix_age 59154 4 tcp4 *:10050 *:* zabbix zabbix_age 58742 4 tcp4 *:10050 *:* nobody node_expor 45522 3 tcp4 lan:9100 *:* nobody node_expor 45522 7 tcp4 lan:9100 remote:1285 frr bfdd 42282 17 tcp6 *:2617 *:* frr bfdd 42282 18 tcp4 *:2617 *:* root openvpn 84817 6 tcp4 carp-wan:1195 *:* root openvpn 84817 12 tcp4 carp-wan:1195 remote:54208 root openvpn 84817 13 tcp4 carp-wan:1195 remote:64293 root openvpn 84817 14 tcp4 carp-wan:1195 remote:55553 root openvpn 84817 15 tcp4 carp-wan:1195 remote:49879 root openvpn 84817 16 tcp4 carp-wan:1195 remote:49727 root openvpn 84817 17 tcp4 carp-wan:1195 remote:59337 root openvpn 84817 18 tcp4 carp-wan:1195 remote:52367 root openvpn 84817 19 tcp4 carp-wan:1195 remote:61691 root openvpn 84817 20 tcp4 carp-wan:1195 remote:49299 root openvpn 84817 21 tcp4 carp-wan:1195 remote:55497 root openvpn 84817 22 tcp4 carp-wan:1195 remote:56152 root openvpn 84817 23 tcp4 carp-wan:1195 remote:60493 root openvpn 63853 6 tcp4 carp-wan:1194 *:* root openvpn 63853 12 tcp4 carp-wan:1194 remote:50937 root openvpn 63853 13 tcp4 carp-wan:1194 remote:25223 root openvpn 63853 14 tcp4 carp-wan:1194 remote:34077 root openvpn 63853 15 tcp4 carp-wan:1194 remote:8229 root openvpn 63853 16 tcp4 carp-wan:1194 remote:49925 root openvpn 63853 17 tcp4 carp-wan:1194 remote:59427 root openvpn 63853 18 tcp4 carp-wan:1194 remote:19497 root openvpn 63853 19 tcp4 carp-wan:1194 remote:53176 root openvpn 63853 20 tcp4 carp-wan:1194 remote:53941 root openvpn 63853 21 tcp4 carp-wan:1194 remote:30092 root openvpn 63853 22 tcp4 carp-wan:1194 remote:61351 root openvpn 63853 23 tcp4 carp-wan:1194 remote:59472 root openvpn 63853 24 tcp4 carp-wan:1194 remote:17457 root openvpn 63853 25 tcp4 carp-wan:1194 remote:10610 root openvpn 63853 26 tcp4 carp-wan:1194 remote:55283 root openvpn 63853 27 tcp4 carp-wan:1194 remote:43863 root openvpn 63853 28 tcp4 carp-wan:1194 remote:51742 root openvpn 63853 29 tcp4 carp-wan:1194 remote:50180 root openvpn 63853 30 tcp4 carp-wan:1194 remote:26228 root openvpn 63853 31 tcp4 carp-wan:1194 remote:55189 root openvpn 63853 32 tcp4 carp-wan:1194 remote:51027 root openvpn 63853 33 tcp4 carp-wan:1194 remote:20917 root openvpn 63853 34 tcp4 carp-wan:1194 remote:49952 root openvpn 63853 35 tcp4 carp-wan:1194 remote:63507 root openvpn 63853 36 tcp4 carp-wan:1194 remote:50676 root openvpn 63853 37 tcp4 carp-wan:1194 remote:63995 root openvpn 63853 55 tcp4 carp-wan:1194 remote:50970 root openvpn 63853 88 tcp4 carp-wan:1194 remote:54025 root openvpn 63853 129 tcp4 carp-wan:1194 remote:49206 dhcpd dhcpd 75672 11 tcp4 lan:36325 pfsense-1:519 dhcpd dhcpd 75672 12 tcp4 lan:520 *:* root nginx 67598 5 tcp4 *:18288 *:* root nginx 67598 6 tcp6 *:18288 *:* root nginx 67598 7 tcp4 *:80 *:* root nginx 67598 9 tcp6 *:80 *:* root nginx 67341 5 tcp4 *:18288 *:* root nginx 67341 6 tcp6 *:18288 *:* root nginx 67341 7 tcp4 *:80 *:* root nginx 67341 9 tcp6 *:80 *:* root nginx 67000 5 tcp4 *:18288 *:* root nginx 67000 6 tcp6 *:18288 *:* root nginx 67000 7 tcp4 *:80 *:* root nginx 67000 9 tcp6 *:80 *:* root sshd 98253 3 tcp6 *:12689 *:* root sshd 98253 4 tcp4 *:12689 *:* ? ? ? ? tcp4 carp-wan:1194 remote:57205 [24.03-RELEASE][admin@pfsense-2.domain]/root:
-
Earlier today I disabled the DHCP Service in pfSense as I can currently live without it.
-
Great thanks. We have some devs engaged on this now, there are a few users hitting it.
-
@stephenw10 Thank you.
Feel free to contact me privately in case you need additional details or I can provide anything.
-
Bug to track it: https://redmine.pfsense.org/issues/15684
-