PfSense dropping WAN until box is restarted



  • I've been having an on-going problem for a few months now. Every week or so (occuring at random, not to schedule), my pfSense box will lose its WAN connection and be unable to reconnect. The web UI will slow to point of unusability. The only resolution so far is to physically power-cycle the router. Sometimes, two restarts are necessary as the WAN does not always come back up after a restart, exhibiting the same problem.

    The majority of the time, the WAN goes down overnight between 3am-4am, and I wake up in the morning to find I have no internet, and a message log filled with hundreds of 'PF was wedged/busy and has been reset'.

    Occasionally, the WAN goes down while I'm awake and using the network, so I can get into the router and grab the logs before it slows to nothing.

    I've attached [sanitised] logs from the most recent event.

    Top output shows pfctl hogging the CPU. I can kill this process but another pfctl with a different PID pops up immediately and carries on consuming cycles.

    My hardware:

    Zotac CI323 w/ 8GB RAM - 2x Realtek gigabit adapters. re0 is WAN (PPPoE), re1 is a VLAN trunk for my various subnets.
    Huawei EchoLife HG612 VDSL2 modem
    Netgear GS716T switch

    Also running OpenVPN server, and IPSec server (both only occasionally used), plus a 6in4 tunnel to Hurricane Electric.

    ISP link is 37 Mbps down, 2 Mbps up.

    When the router is failing to connect, I can unplug it from the modem, plug in my laptop, and dial up the PPPoE connection with no issues. Plug my router back in, and it still fails to connect; so I don't think the modem is at fault.

    Things I've tried so far:

    System > Advanced > Networking > Disable Hardware Checksum Offload
    Disabling PowerD
    Changing Snort for Suricata (Issue seemed similar to https://forum.pfsense.org/index.php?topic=88768.0)

    I'd be happy to hear anything else I could try, or further diagnostic steps.
    pfsense_system.txt
    pfsense_ppp.txt
    pfsense_top.txt



  • Hello,

    From your logs:

    Oct 3 21:03:40	ppp		process 75997 started, version 5.8 (root@pfSense_v2_3_0_amd64-pfSense_v2_3_0-job-14 22:52 6-Apr-2016)
    

    You seem to be running pfSense 2.3.0 while the latest available version is 2.3.2.
    I'd recommend doing an update and it might fix the issue you are seeing.



  • According to the Web UI I'm on pfSense 2.3.2.

    SSH also lists:

    *** Welcome to pfSense 2.3.2-RELEASE (amd64 full-install) on hostname ***
    
    [2.3.2-RELEASE]
    




  • Maybe you should test 2.3.3 snapshots, and see if situation has been fixed. Remember to backup your config to a safe place so you can restore later.

    https://snapshots.pfsense.org/



  • I am now running on 2.3.2-RELEASE-p1.

    The drop-outs have been continuing - about every 2-3 days now, sometimes multiple times per day. I'll have further logs to upload later - can't do right now as I'm in work away from the router at home.

    What I have discovered, while trying to migrate the PPPoE connection from re0 to re1, is that physically removing and then reconnecting the ethernet cable on re0 will fairly reliably cause the crash - PPPoE starts failing to dial out and the pfctl process goes crazy on CPU usage.

    What's the best way of determining if this is a software/driver issue, or a hardware issue?