WAN connection drops after 15 min high load [SOLVED]



  • Setup:
    2.1-RELEASE (i386) on Intel D945gsejt with Intel dual-nic PCI-card (fxp - Intel 82558 Pro/100).
    WAN on fxp0, LAN on fxp1, OPT1 on re0 (RealTek 8168/8111)
    100/100mbit connection (DHCP)

    Problem:
    My trusty pfSense FW has started loosing connection to the internet whenever i download hard for 10+ minutes.
    WAN is still reported as up in pfSense, but no traffic gets through. LAN is fine.
    Restart seems to be the only way to bring back the connection.

    Tested:
    Limit my download to 50mbit = i can go forever without incident.
    Tried running WAN on the builtin NIC (re0) = Same result
    Tried to bypass the FW by connecting a PC straight to the ISP = No problem. Running 96mbit download for over an hour without incident.
    Update pfSense (was running 2.0 RC)

    It used to work fine and i can't pinpoint a change made around the time when i first saw the problem.

    Any ideas?
    I'm just about to throw it out the window.  :o


  • Netgate Administrator

    Nothing in the system logs?

    Steve



  • This shows up when it chrashes:

    Mar 4 00:31:16 check_reload_status: Reloading filter
    Mar 4 00:31:16 check_reload_status: Restarting OpenVPN tunnels/interfaces
    Mar 4 00:31:16 check_reload_status: Restarting ipsec tunnels
    Mar 4 00:31:16 check_reload_status: updating dyndns WAN_DHCP


  • Netgate Administrator

    That is more of a symptom than a cause. There are no apinger entries?

    What do the RRD quality graphs look like?

    You could try a 2.1.1 snapshot, I believe they have a number of fixes for apinger issues.

    Steve



  • Nothing before this in the log, and nothing after until the reboot.

    Will try 2.1.1.

    Never heard of RRD graphs before so i just need to figure out what and where they are first  ;)



  • RRD Quality looks fine in general but it's not hard to see when then WAN drops out.
    See attached.

    No package-loss except from when the WAN is down



  • Netgate Administrator

    Yet that doesn't show 100%. Could be the rounding that RRD does to fit the data.
    You might try disabling apinger all together. Go to System: Routing: Gateways:  Edit the WAN gateway, advanced section, disable gateway monitoring.
    If that solves it you can instead tune apinger to better fit your line conditions.

    Steve



  • Well that at least made a difference.
    Unfortunately not the difference i needed.

    After disabling apinger the line crashed after 1-2 min at 100mbit DL.
    I tried enabling it again, but now i only get 2 pings through after a reboot before it chrashes again.

    Gateway log is now showing:
    Mar 4 01:17:40 apinger: Exiting on signal 15.
    Mar 4 01:17:21 apinger: ALARM: WAN_DHCP(95.109.99.1) *** down ***
    Mar 4 01:17:11 apinger: Starting Alarm Pinger, apinger(24614)
    Mar 4 01:17:10 apinger: Exiting on signal 15.
    Mar 4 01:15:41 apinger: ALARM: WAN_DHCP(95.109.99.1) *** down ***
    Mar 4 01:15:18 apinger: Starting Alarm Pinger, apinger(33575)
    Mar 4 01:11:43 apinger: SIGHUP received, reloading configuration.
    Mar 4 01:11:40 apinger: SIGHUP received, reloading configuration.
    Mar 4 01:11:39 apinger: Starting Alarm Pinger, apinger(24298)
    Mar 4 01:11:05 apinger: No usable targets found, exiting
    Mar 4 01:11:05 apinger: Starting Alarm Pinger, apinger(38845)
    Mar 4 01:10:28 apinger: Exiting on signal 15.
    Mar 4 01:10:23 apinger: ALARM: WAN_DHCP(95.109.99.1) *** down ***
    Mar 4 01:10:13 apinger: Starting Alarm Pinger, apinger(6083)

    Had to bypass the FW now to go online.


  • Netgate Administrator

    Hmm. You could try setting an alternative monitor IP, something publically available like 8.8.8.8.

    Steve



  • Will try that tonight.
    Also prepped a LiveCD on USB to test with.



  • Running a fresh install with 2.1.1 embedded.
    This seems to have fixed the issue.

    Unfortunately i cannot provide much information on what the actual problem was, so for anyone else experiencing this i can only recommend that you try a fresh install of 2.1.1

    Appreciate the help Steve!



  • I did read somewhere that this was to do with Multi-WAN stuff happening even on single WAN setups. Maybe that was corrected. Would that make sense?


Log in to reply