XG-7100 kernel panic boot loop



  • While running a wan bandwidth test, I suddenly lost all wan connectivity, but was still able to reach the web UI. After rebooting, I'm seeing a kernel panic loop when booting multi-user, but am able to start up single-user mode. I captured the boot logs and panic dumps from the serial console here https://pastebin.com/xpQzJUHR

    I'm unsure how to read this, and am hoping for suggestions on what might be wrong.


  • Netgate

    How did you reboot it?

    In single-user mode run /sbin/fsck -y / about four times then reboot and see if that fixes it.



  • @derelict Thanks, that resolved the kernel panic. I could not find a reboot option in the GUI, and I had forgotten to save the pfsense manual to a local disk to see if it was listed.

    I was hoping the power button would do a clean ACPI shutdown with a single press, but after waiting a while and it didn't shut down, I did an 8-second forced shutdown from the power button.


  • Netgate

    Diagnostics > Reboot



  • Ah, there it is! Interestingly, I can replicate the WAN lockup. Every time I run comcast's speed test (speedtest.xfinity.com) the WAN goes down, and the only solution I've found so far is a reboot.


  • Netgate

    Anything in the logs? Any expansion on what goes down actually means?



  • Shortly after the speed test finishes, if I jump to Status > Gateways I can see packet loss climb up to 100%, and Status moves to offline for both WAN_DHCP and WAN_DHCPV6.

    In the gateway logs:

    Aug 5 18:51:16	dpinger		WAN_DHCP <IP>: Alarm latency 70660us stddev 120945us loss 21%
    
    Aug 5 18:51:16	dpinger		WAN_DHCP6 <IPV6>%lagg0.4090: Alarm latency 75036us stddev 121650us loss 21%
    

  • Netgate

    OK the next thing I would do is run a packet capture and determine if the pings are going out the interface or not.

    Diagnostics > Packet Capture

    Interface: WAN
    Protocol: ICMP
    Host Address: Whatever the gateway monitoring address is
    Count: 10000 (or so)

    Then run your speed test again.

    Then, after it fails, stop and look at the capture. My guess is you will see the packets leaving and there being no response, which means something upstream is dying, not the XG-7100.

    Feel free to download the pcap file and I will send you a place to upload it so I can look at it and provide interpretation if you like.



  • I ran the packet capture test a few times to make sure I didn't cut the traces short too early, or some other mistake. But the logs do not show any ICMP packets without a reply, and it is a nicely ordered request/reply sequence until no more requests show up, and the gateway status shows 'Offline'.

    Other data points:

    • Rebooting the cable modem does not resolve WAN connection, but rebooting 7100 does.
    • The WAN goes offline shortly after the upload portion of the bandwidth test, but not always. Upload speeds are ~30Mbps. Lowering the LAN port connection to 100Mbps to throttle download speeds doesn't change results.

  • Netgate

    What ports are you using for WAN and LAN? The default ETH1 and ETH2?



  • eth1 for wan, eth4 for lan



  • I was able to reproduce the lockups with other high bandwidth downloads, and usually accessing the web UI also breaks (no response), and I need to reboot the system from serial console.

    I shut off suricata on the LAN interface (which was configured for Inline mode), and so far haven't seen any crashes, but need some more time and testing to confirm this resolves my issue.



  • The problem causing wan interface to go down is suricata crashing. If I have it in inline mode on either the lan or the wan interface, then any stream that maxes my wan bandwidth for ~30MB or more crashes suricata.

    Stopping and restarting the process manually restores the WAN gateway.

    Do others have any experience with the 7100 and running Suricata in inline mode. I know there are warnings about inline mode, and do not know what to expect with Denverton hardware.