Multiple issues, firewall freezes and whole network goes down.
-
A short time later, this time I got an another crash with crash report.
Dump header from device: /dev/nda0p2 Architecture: amd64 Architecture Version: 4 Dump Length: 617472 Blocksize: 512 Compression: none Dumptime: 2024-09-08 20:08:22 +0300 Hostname: FIREWALL.mydomain.org Magic: FreeBSD Text Dump Version String: FreeBSD 15.0-CURRENT #0 plus-RELENG_24_03-n256311-e71f834dd81: Fri Apr 19 00:28:14 UTC 2024 root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-24_03-main/obj/amd64/Y4MAEJ2R/var/j Panic String: page fault Dump Parity: 44932402 Bounds: 0 Dump Status: good
Fatal trap 12: page fault while in kernel mode cpuid = 6; apic id = 08 fault virtual address = 0x1c fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80f246e2 stack pointer = 0x28:0xfffffe00e1f3bae0 frame pointer = 0x28:0xfffffe00e1f3bb70 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 2 (clock (6)) rdi: 0000000000000000 rsi: 0000000000000000 rdx: fffffe00e1f3bcf8 rcx: 0000000000000000 r8: 0000000000000528 r9: 0000000000000000 rax: 0000000000000000 rbx: 0000000000000000 rbp: fffffe00e1f3bb70 r10: 000000000000300f r11: 0000000000015069 r12: 0000000000000000 r13: 0000000000000528 r14: fffff8027dfb5000 r15: 0000000000000034 trap number = 12 panic: page fault cpuid = 6 time = 1725815302 KDB: enter: panic panic.txt 0600 0 0 12 14667355006 7145 ustar root wheel page fault version.txt 0600 0 0 457 14667355006 7635 ustar root wheel FreeBSD 15.0-CURRENT #0 plus-RELENG_24_03-n256311-e71f834dd81: Fri Apr 19 00:28:14 UTC 2024 root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-24_03-main/obj/amd64/Y4MAEJ2R/var/jenkins/workspace/pfSense-Plus-snapshots-24_03-main/sources/FreeBSD-src-plus-RELENG_24_03/amd64.amd64/sys/pfSense
full crash dump here
I see a lot of "Disabled multicast promiscuous mode" outputs here.
textdump.tar.0right now, my ISP is working on the cables in the neighborhood and I am having frequent WAN downtime but for some reason, this is crashing the firewall.
-
Ok that crash is this: https://redmine.pfsense.org/issues/15684
Try setting the workaround suggested there: https://redmine.pfsense.org/issues/15684#note-12
The logs show all gateways going down including what looks like an internal gateway?
Sep 8 17:37:07 FIREWALL rc.gateway_alarm[60969]: >>> Gateway alarm: MNG_DHCP (Addr:192.168.2.1 Alarm:1 RTT:24.549ms RTTsd:124.820ms Loss:21%)
Are all those gateways using the same NIC(s)?
-
@stephenw10 I have set the workaround though I had to set it manualy from system tunables sine it was not there by default.
There are 5 gateways with corresponding Interfaces
and the interfaces below
-
Ok so 4 of those gateways are all using igb1 but the MNG gtaeway uses igb0. So you would not expect to see all 5 throwing packet loss unless they go through the same switch maybe?
-
@stephenw10 yep, igb1 goes to modem port and igb0 goes to different switch. MNG is a management network with a separate switch with dhcp server not connected to the internet. It has all the IPMI and critical management connections. The purpose is to provide an environment where even if the pfsense crashes, management interface should stay up to reach pfsense UI (if possible) and IPMI
-
Hmm, what hardware is this?
Not much can cause two NICs to stop passing traffic like that. Especially igb NICs.
-
@stephenw10 it is Supermicro SuperServer 5019D-4C-FN8TP with 32GB ECC RAM and with addon card AOC-S25G-I2S-O PCIe SFPP28 25gbps
-
@stephenw10 to make it clear, the firewall just freezes itself, even directly connecting to the console, no inputs are registered by the firewall through console. Until reboot, it is just at stuck at something.
-
Hmm, so all 4 of those ports are on-board.
Does it not respond even to
ctl+t
? -
@stephenw10 no, it does not respond to anything. I did not try ctrl + t but ctrl + c, ctrl + alt + del, enter, space, backspace, nothing works
-
@Laxarus
we have the same hardware but not the 25 gbps card.Please check over the IPMI interface for some PCIe, ... errors, we had a faulty broadcom card some months ago.
-
Sometimes ctl+t is the only thing that will produce a response.
-
@stephenw10 will try ctrl + c, if the same thing happens again (hoping not), I will try to troubleshoot with WAN when I go back (right now I only have remote access).
There is only one constant in all the situations, when WAN goes down, there is a big chance of firewall crashing or freezing.
And the two bugs that you have stated is contributing to this somehow when WAN goes down. Hopefully, the next release of pfsense will take care of these bugs.
Thanks for bearing with me until now and I really appreciate it.@slu thanks for the suggestion. I have checked the maintenance and health logs on the IPMI but there is nothing noteworthy there. It all seems normal.
-
So, I had the same issue again this morning and I still have no idea why this is happening. @stephenw10 I have tried ctrl + t and no response to that neither.
Any advise to debugging this is very much appreciated.
Full log here, the freeze happened around Sep 16 07:00
system.log.0 -
You need to tune the OVPN_S2S_VPNV4 gateway. It's throwing alarms repeatedly. It's clearly a pretty bad route because the alarms are legitimate for a default settings . However reloading the firewall each tie it fires is not helping anything. You might just disable the monitoring or monitoring action on that gateway.
But that shouldn't cause it to stop responding. The actual failure appears to happen here:
Sep 16 07:18:45 FIREWALL rc.gateway_alarm[63113]: >>> Gateway alarm: VPNAC_WG (Addr:10.11.0.1 Alarm:1 RTT:91.226ms RTTsd:79.944ms Loss:21%) Sep 16 07:18:45 FIREWALL check_reload_status[635]: updating dyndns VPNAC_WG Sep 16 07:18:45 FIREWALL check_reload_status[635]: Restarting IPsec tunnels Sep 16 07:18:45 FIREWALL check_reload_status[635]: Restarting OpenVPN tunnels/interfaces Sep 16 07:18:45 FIREWALL check_reload_status[635]: Reloading filter Sep 16 07:18:45 FIREWALL rc.gateway_alarm[65772]: >>> Gateway alarm: WAN_PPPOE (Addr:10.98.238.224 Alarm:1 RTT:5.947ms RTTsd:11.776ms Loss:21%) Sep 16 07:18:45 FIREWALL check_reload_status[635]: updating dyndns WAN_PPPOE Sep 16 07:18:45 FIREWALL check_reload_status[635]: Restarting IPsec tunnels Sep 16 07:18:45 FIREWALL check_reload_status[635]: Restarting OpenVPN tunnels/interfaces Sep 16 07:18:45 FIREWALL check_reload_status[635]: Reloading filter Sep 16 07:18:46 FIREWALL php-fpm[20435]: /rc.openvpn: The command '/sbin/route -n6 get 'default' 2>/dev/null | /usr/bin/egrep 'flags: <.*PROTO.*>'' returned exit code '1', the output was '' Sep 16 07:18:46 FIREWALL php-fpm[20435]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed IP addresses. Reloading endpoints that may use VPNAC_WG. Sep 16 07:18:46 FIREWALL php-fpm[20435]: /rc.openvpn: The command '/sbin/route -n6 get 'default' 2>/dev/null | /usr/bin/egrep 'flags: <.*PROTO.*>'' returned exit code '1', the output was '' Sep 16 07:18:46 FIREWALL php-fpm[20435]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed IP addresses. Reloading endpoints that may use WAN_PPPOE. Sep 16 07:18:46 FIREWALL php-fpm[51827]: /rc.dyndns.update: phpDynDNS (@.mydomain.org): No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry. Sep 16 07:18:50 FIREWALL ppp[53627]: [wan_link0] LCP: no reply to 1 echo request(s) Sep 16 07:19:00 FIREWALL ppp[53627]: [wan_link0] LCP: no reply to 2 echo request(s) Sep 16 07:19:05 FIREWALL rc.gateway_alarm[23895]: >>> Gateway alarm: MNG_DHCP (Addr:192.168.2.1 Alarm:1 RTT:4.611ms RTTsd:15.937ms Loss:22%)
Where all gateways start to indicate failures and the pppoe goes down. Effectively no traffic is passing from that point.
But there are no lower level errors, the NICs do not show loss of link for example.
The firewall is still logging and running scripts it doesn't appear to be down. At least until the end of that log.
When did you try to connect? How did you connect?
-
@stephenw10 I have further tweaked the ovpn gateway.
Sep 16 07:18:46 FIREWALL php-fpm[20435]: /rc.openvpn: The command '/sbin/route -n6 get 'default' 2>/dev/null | /usr/bin/egrep 'flags: <.*PROTO.*>'' returned exit code '1', the output was '' Sep 16 07:18:46 FIREWALL php-fpm[20435]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed IP addresses. Reloading endpoints that may use VPNAC_WG. Sep 16 07:18:46 FIREWALL php-fpm[20435]: /rc.openvpn: The command '/sbin/route -n6 get 'default' 2>/dev/null | /usr/bin/egrep 'flags: <.*PROTO.*>'' returned exit code '1', the output was '' Sep 16 07:18:46 FIREWALL php-fpm[20435]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed IP addresses. Reloading endpoints that may use WAN_PPPOE.
here, why is firewall trying to get routing for ipv6, the tunnel is ipv4 only.
I had to power reset just to get everything back around 10:46.
64a4a208-8a92-48ff-8f9e-736090de796f-system3.zip -
Just to be clear though you are unable to see any response from the firewall even on the local physical firewall?
It's extremely unusual to see it still logging and running scripts at that time but unresponsive at the console. Like I don't think I've ever seen that.
-
@stephenw10 said in Multiple issues, firewall freezes and whole network goes down.:
Like I don't think I've ever seen that
the attached capture for unresponsive console
9979fb96-2531-417e-a50a-dd6d321f8e90-pfsense freeze.zip -
Ok that looks like the IPMI console? And I assume that usually works as expected?
Is it configured for video as the primary console?
Are you able to test using a physical console?
-
@stephenw10 Yep, normally, it works without a problem. I am not sure how it is configured since there are no options to change the behavior.
No, I cannot access the physical console since it is at remote site.