WAN interfaces flapping with multiWAN



  • I'm suddenly getting a lot of these types of errors in the system logs:

    Aug 16 13:40:10 slbd[366]: ICMP poll failed for 63.138.38.129, marking service DOWN
    Aug 16 13:40:10 slbd[366]: Service WAN1FailsToWAN2 changed status, reloading filter policy
    Aug 16 13:40:15 slbd[366]: ICMP poll succeeded for 63.138.38.129, marking service UP
    Aug 16 13:40:15 slbd[366]: Service WAN1FailsToWAN2 changed status, reloading filter policy
    Aug 16 13:41:23 slbd[366]: ICMP poll failed for 63.138.38.129, marking service DOWN
    Aug 16 13:41:23 slbd[366]: Service LoadBalance changed status, reloading filter policy
    Aug 16 13:41:28 slbd[366]: ICMP poll succeeded for 63.138.38.129, marking service UP
    Aug 16 13:41:29 slbd[366]: Service LoadBalance changed status, reloading filter policy

    It is happening on both WAN interfaces and occurring several times an hour for each one.  As you can see, it corrects itself after about 5 seconds.  One WAN is a point to point T1 (that's the IP of the ISP's DNS server above).  The other is an SDSL link and I'm pinging the internal LAN address of the DSL router so it can't be a problem with the external link.  Neither WAN router thinks its external interface is losing connectivity so I'm a bit stumped.  Anyone else experiencing this?

    I'm using the following snapshot:

    1.2.1-TESTING-SNAPSHOT
    built on Sat Jul 19 07:13:48 EDT 2008

    Here are the stats on the WAN interfaces:

    WAN interface (dc1)
    Status up
    MAC address 00:a0:cc:63:91:84
    IP address (redacted) 
    Subnet mask 255.255.255.248
    Gateway (redacted)
    ISP DNS servers 208.67.222.222
    208.67.220.220
    Media 100baseTX <full-duplex>In/out packets 44458963/36944351 (1022.08 MB/107.57 MB)
    In/out errors 0/0
    Collisions 0

    LAN interface (re0)
    Status up
    MAC address 00:18:f8:0b:21:35
    IP address 10.0.0.2 
    Subnet mask 255.255.255.0
    Media 1000baseTX <full-duplex>In/out packets 67442505/77750290 (915.10 MB/590.85 MB)
    In/out errors 0/0
    Collisions 0

    OPT1 interface (dc0)
    Status up
    MAC address 00:a0:cc:63:6b:55
    IP address 10.0.1.10 
    Subnet mask 255.255.255.0
    Gateway 10.0.1.1
    Media 100baseTX <full-duplex>In/out packets 36487760/33143675 (238.86 MB/1.12 GB)
    In/out errors 0/0
    Collisions 0</full-duplex></full-duplex></full-duplex>



  • I'm afraid that there is something amiss with one of the WAN connections. Once a connection starts flapping it is generally a sign of packet loss or extremely high latency.

    In 1.2 we use fping for gateway detection, it has retries and a backoff algorithm. So the highest latency before failing is 1.5 seconds iirc.

    In 1.3 we use something else, but it will notify you of such issues regardless.



  • I don't know how the multi-WAN stuff is implemented.

    What if the WAN links were saturated (in one or both directions) and there was no queueing OR inappropriate queueing such that either (or both) the ICMP request and response had a "long" delay in actually getting on the wire because they were stuck behind other traffic?

    If your ping over the T1 is going to the ISP's router then is that system configured to give you timely response? What if its busy (it might be busy for reasons that have nothing to do with your traffic)?



  • We have a process, slbd in 1.2 and apinger in 1.3 that monitors the monitor IP addresses which should be behind the gateways. We add a static route for each monitor IP so we are sure we are using the right interface.

    When a gateway status changes we trigger a filter reload in 1.2 and 1.3.

    If a link is saturated in 1.2, where the latency is over 2 seconds and 3 attempts to ping have failed we mark it down. In 1.3 such a state would be marked with "Latency".

    If a link has packet loss in 1.2 it will most often stay up without too much issues, but you will see occasional up and down events. Which is too be expected really. Because the connection isn't very good. In 1.3 we mark a connection with packet loss as "Loss".

    You will need to make sure that the configured monitor IP is marginally correct. That is why you can configure a monitor IP different from the local gateway which might be connected over ethernet.



  • Hmmm…just for grins, I've just changed it so that the pings go to the lan interface on the other WAN router (the T1 link).   So now both links are pinging the internal LAN interface of the relevant router.  It hasn't had any effect on the behavior so I'm inclined to think it is an issue with pfsense, not a bona fide problem with the WAN links themselves.  If anyone has any troubleshooting suggestions, I'm certainly open to trying them.  The link flapping seems to have nothing to do with load as it is now a Saturday afternoon and there is nobody at the location and the links are still bouncing up and down every few minutes.

    Cheers,



  • That would imply that the issue is with the network local to you.

    It might be worth checking into link duplex settings. If you have a managed ethernet switch you have it a bit easier.

    Since you mention it is a T1, it's not by accident a cisco 1600 with a 10baset halfduplex connection is it?

    Or if it's newer, something that has a port forced to 100 full duplex and forgetting to set the switch port to the same?



  • @databeestje:

    That would imply that the issue is with the network local to you.

    It might be worth checking into link duplex settings. If you have a managed ethernet switch you have it a bit easier.

    Since you mention it is a T1, it's not by accident a cisco 1600 with a 10baset halfduplex connection is it?

    Or if it's newer, something that has a port forced to 100 full duplex and forgetting to set the switch port to the same?

    It's a managed switch (Catalyst 2960 w/24 10/100/1000 ports) and the T1 router is a Cisco 1841.  The DSL router is some Netopia something or other that I don't recall.  The switch is set to auto-negotiate the links and it currently thinks that both routers have negotiated 100mbit full-duplex connections.  It's been 10+ years since I used FreeBSD on any regular basis.  How can I force the nics on the pfsense system to 100mbit/full-duplex?

    Also, how would I make the firewall less "twitchy" about taking an interface down?  I'd like to experiment with allowing it to be a little more tolerant of temporary long-ish pings to the address being tested.

    Cheers,



  • Check the switch to make sure you don't have any errors on that end, the firewall's end is clean. If the switch is as well, don't touch your speed/duplex. That's best to leave alone unless you have a problem that cannot be resolved in any other fashion, and it doesn't appear you have a problem.

    I know a number of people are using load balancing including myself on 1.2.1, and haven't seen flapping, so I'm inclined to think you really are seeing loss. Best way to determine that is to run a couple tcpdumps from a SSH session:
    tcpdump -ni fxp0 -w /tmp/wan.pcap host 1.2.3.4

    replacing fxp0 with your real WAN interface, and 1.2.3.4 with the monitor IP for that interface. Repeat switching fxp0 for your second WAN, change the filename, and use its monitor IP. After it flaps again, ctrl-c to stop the SSH sessions and download the files from Diagnostics -> Command. If you don't know what to look for in the capture files, post the pcap files somewhere and add a URL here, or email them to me (cmb at pfsense dot org).



  • I'm seeing plenty of unnecessary DOWN/UP in the logs too.

    I did the
    tcpdump -ni fxp0 -w /tmp/wan.pcap host 1.2.3.4

    and checked the pcap file at the point where a down occurred.

    I looked at a single instance but it may give some clues. The down occurred where a second ping was sent out from pfsense 0.5s after the first, but before the reply from the first had returned.

    So the packet sequence goes

    request0, reply0        (UP)
    request1, request2, reply1, reply2    (logs show Down 5s after request1)
    request3, reply3      (back UP again)

    So looks like the problem may be the timing of sending out the second ping. It's happening too soon. Maybe the system expects a reply within 500ms? The remote server I'm pinging against normally replies sub 50ms but obviously not in this instance.

    The pinged IP in this instance is present just once in the list of monitor IPs in the slbd config.

    Hope this makes sense.

    Nick.



  • @NickC:

    I'm seeing plenty of unnecessary DOWN/UP in the logs too.

    I did the
    tcpdump -ni fxp0 -w /tmp/wan.pcap host 1.2.3.4

    and checked the pcap file at the point where a down occurred.

    I looked at a single instance but it may give some clues. The down occurred where a second ping was sent out from pfsense 0.5s after the first, but before the reply from the first had returned.

    So the packet sequence goes

    request0, reply0         (UP)
    request1, request2, reply1, reply2     (logs show Down 5s after request1)
    request3, reply3       (back UP again)

    So looks like the problem may be the timing of sending out the second ping. It's happening too soon. Maybe the system expects a reply within 500ms? The remote server I'm pinging against normally replies sub 50ms but obviously not in this instance.

    The pinged IP in this instance is present just once in the list of monitor IPs in the slbd config.

    Hope this makes sense.

    I think you're on to something here.  Would it be possible to tune this ping timing to make it less twitchy?

    To answer cmb, neither pfsense nor the switch are seeing any dropped packets.  So that seems to rule out the interfaces renegotiating the link speed theory.

    Edit:  I've been looking through the code and it appears that /usr/local/bin/ping_hosts.sh contains all the voodoo for bouncing the links up and down.  I'm not much of a shell scripter so I'm at a loss on how to tune this to make it less sensitive to these out of order ping responses (if that is indeed the source of the problem).  Suggestions anyone?

    Cheers,



  • Bump.  This is still a problem for me.  I'm running MRTG on the WAN interface of both routers and they are NOT losing their connectivity.

    Again, can someone point me in the right direction to somehow blunt this hair trigger for marking interfaces down in pfsense?

    Thanks!



  • we use the slbd process which then launches a fping command to ping the monitor ip.

    fping ignores the duplicate reply to the 1st request. It will thus retry, if the 2nd succeeds it returns return code 0.

    On anything else it will return 1 or 2.
    And that would cause the state to change.



  • @databeestje:

    we use the slbd process which then launches a fping command to ping the monitor ip.

    fping ignores the duplicate reply to the 1st request. It will thus retry, if the 2nd succeeds it returns return code 0.

    On anything else it will return 1 or 2.
    And that would cause the state to change.

    Which script have you updated to use fping?  It's simply a matter of replacing "ping" with "fping" in the slbd script?  This problem is seriously biting my office in the ass to the point that I've been asked to find another solution if I can't come up with a suitable fix pronto.  It would be a shame if that happened because things are working well otherwise.

    Cheers,



  • My thoughts when reading this post.

    500ms is a long time for a monitor ip to answer.
    I really only trust intel nic's.
    If possible i would have setup a 2nd pc with pfSense and 1 client pinging both monitor ip's.



  • @Perry:

    My thoughts when reading this post.

    500ms is a long time for a monitor ip to answer.
    I really only trust intel nic's.
    If possible i would have setup a 2nd pc with pfSense and 1 client pinging both monitor ip's.

    I completely agree with you.  However, I've replaced the interface with an intel card and the problem remains with the old fxp driver.  So I really don't think it is the physical interface or ethernet driver that is the problem.  I've also had our vendor replace the DSL router and the problem remains.  I guess I'll try the fping thing and hope for the best.

    Cheers,



  • To be perfectly clear on this, slbd is a binary, not a script. Furthermore it already uses fping. It has done so since the 1.2 release and it's still the same in 1.2.1.

    in 1.3 it is handled with apinger. Which is something else entirely.



  • @databeestje:

    To be perfectly clear on this, slbd is a binary, not a script. Furthermore it already uses fping. It has done so since the 1.2 release and it's still the same in 1.2.1.

    in 1.3 it is handled with apinger. Which is something else entirely.

    Bummer.  I suppose I have reached a dead end then.  Pfsense has been fantastic for us with the exception of this problem.  I guess I will have to go back to researching a commercial solution for link aggregation.

    Thanks for your information.

    Best,



  • It would also be interesting to see a trace or tcpdump (including timestamps) taken at one of the routers around the time of a link down event. Does the trace show the same ordering?

    Can you get a few traces at both the pfSense interface and the router interface at "link down" events? Do these show a similar pattern? Does ordering match at both ends? (For example, if one trace shows the sequence Tx Request 1, TX Request 2, Rx Response 1, Rx Response 2 does the other end show Rx Request 1, Rx Request 2, Tx Response 1, Tx Response 2?) Are the intervals between requests consistent OR do the requests sometimes appear almost "back to back"?

    What does a trace of a regular ping over this link look like? Does either end (or both ends) show delayed response from time to time?

    I can think of a number of different possible scenarios that might cause the behaviour described. And these scenarios might never be noticed in "normal" operation because they are masked by (for example) normal TCP dataloss recovery. Perhaps a device driver bug might cause a received frame to go unnoticed until the next received frame or the next transmit completion. Perhaps there are queueing/scheduling delays such that responses aren't timely. Perhaps fping's clock has a drift such that it sometimes sends requests too close together and so doesn't allow sufficient time for the responses to come back.

    I'm not familiar with the internals of fping and its man page doesn't provide much detail, but its test methodology doesn't seem particularly robust. To decide a link is down on the basis of a timeout on a single response seems far too aggressive. Networks sometimes lose packets and sometimes re-order packets.



  • @wallabybob:

    I'm not familiar with the internals of fping and its man page doesn't provide much detail, but its test methodology doesn't seem particularly robust. To decide a link is down on the basis of a timeout on a single response seems far too aggressive. Networks sometimes lose packets and sometimes re-order packets.

    I agree 100%.  I've asked if there is a way to make pfsense more tolerant so that it doesn't have a "hair trigger," but nobody responded with any suggestions.  If there was a way to reconfigure things so the user could increase the threshold of pain before yanking an interface down, that would be ideal.

    Cheers,



  • I don't know if you plan to take up my other suggestions.  They may well be hard work and a bit of a stretch.

    If I understand your problem reports correctly, your pfSense box is seeing significant delayed response on supposedly idle (or very near idle) single hop links. Regardless of the fping issue, its quite possible something else in your network is not working optimally. Depending on what that is, tweaking the fping timeout may not help very much.



  • @wallabybob:

    I don't know if you plan to take up my other suggestions.  They may well be hard work and a bit of a stretch.

    If I understand your problem reports correctly, your pfSense box is seeing significant delayed response on supposedly idle (or very near idle) single hop links. Regardless of the fping issue, its quite possible something else in your network is not working optimally. Depending on what that is, tweaking the fping timeout may not help very much.

    I can't imagine what else could be wrong.  These are direct links from one device to another.  There is no switch/hub/anything in between.  And I've already tried swapping in known good cabling (only about 1m of cat5e cable), changed the ethernet NIC on the pfsense box, etc.  I don't know what else to do at this point.

    Cheers,



  • I've been through all the posts on this topic.

    Familyguy: Are you seeing what NickC reported? Do your traces show show a similar pattern to NickC's trace? (I have just assumed that but I now can't see anywhere where you have said you have taken traces and seen the same sort of pattern NickC reported.)

    Whats this MRTG you were running on both routers? What are the routers? Do they run the same software? Do the routers have some sort of utility for generating a tcpdump like trace? (You may have to connect something on another router port so that the trace traffic does not go over the interface being traced.)

    You began your original post "All of a sudden …" For how long had it been working without reporting these errors? Can anyone remember anything that happened to the pfSense box or any of the routers at around the time the messages "suddenly" appeared? Someone dropped a box (possibly cracking a PCB trace for example)? Power surge? Marginal power supplies can cause systems to behave erratically. (I recall a PDP11/40 mini computer with a micro processor controlled WAN communications card at the end of the chassis furthest from the power supply. The comms controller behaved erratically, resetting the protocol a few times a minute. One of the voltages to the slot was just under the correct value.)

    You mention using dc interfaces initially, then fxp. Are these interfaces on dual port cards or single port cards? Did you make any changes to the card configuration soon before this started happening? How old is the system and what is the date of the BIOS? What system or motherboard are you using? What is in the PCI slots and are there any PCI slots spare?

    Please provide the dmesg output from the pfsense box.

    Have you tried turning the WAN and OPT2 interfaces into polling mode? At the shell prompt

    ifconfig fxp0 polling
    

    will enable polling mode on interface fxp 0. To disable polling mode, at the shell prompt type

    ifconfig fxp0 -polling
    

    Wait a least a couple of minutes before deciding whether or not it makes a difference. If it does make a difference on one interface then try it on the other.

    I have reasons for asking all these questions but I don't have the time now to explain other than the information that is currently available does not explain what is reported. You say you have run out of ideas. I haven't. For now, I'm prepared to give my more than 25 years of networking experience to work on this problem, but I need more to work with. I realise I have asked for a lot but I probably won't be able to able to do anything more on this issue until Sunday night (4 days from now). If you give me a fair bit to work with by the time I can get back to this I will have a greater chance of putting together a reasonable theory of what is happening. If you have to move on or aren't able to get me more information that is fine.



  • Thank you for the generous offer to help.  I'll gather as much info as I can.  The thing that really has me scratching my head is that OPT1 seems to be the "problem child," though the WAN interface periodically flaps too.  I've changed the monitor IP for OPT1 to be OPt1's own IP address and it STILL happens.  That is with both a dc ethernet card AND with the integrated Intel nic on the motherboard that uses the fxp driver.  So if the pings are failing even when it is monitoring the OPT1 interface itself, I'm just plain confused.

    One more thing I'm going to try is to use a more current snapshot and see if that helps at all.

    Cheers,



  • to be very clear about this.

    We replaced ping explicitly by fping because ping was flapping so much. When we used ping it sent a single packet with a 2 second timeout. This failed too often.

    When using fping the default timeout is 400 ms, the backoff factor is 1.5 and the number of retries is 3.
    This means that we wait 400ms, 600ms, 900ms and 1.35 seconds for a ping response.

    fping should only return failure after all retries have failed which is takes at most 3.5 seconds before failure is detected.

    If 1.2 does not exhibit this problem there might be a different issue at play totally.

    But as said before, this requires a tcpdump of the icmp traffic to be able to debug this.



  • We found a issue with FreeBSD7 which might be causing us grief.

    We have filed a PR and hope to get any response.



  • @databeestje:

    We found a issue with FreeBSD7 which might be causing us grief.

    We have filed a PR and hope to get any response.

    For those of us following this sisue can you give some more information, or at least a reference to the FreeBSD PR?



  • @wallabybob:

    @databeestje:

    We found a issue with FreeBSD7 which might be causing us grief.

    We have filed a PR and hope to get any response.

    For those of us following this sisue can you give some more information, or at least a reference to the FreeBSD PR?

    Yes, I would also find that interesting (but perhaps not directly helpful since I'm not much of a programmer).

    Best,





  • @cmb:

    The PR is http://www.freebsd.org/cgi/query-pr.cgi?pr=127528

    Doesn't appear that they intend to "fix" this and they're saying it's an application level issue.  Where does that leave those of us that are experiencing the problem?  Go back to pre-FreeBSD7 distro?

    Best,



  • Can you all see at what hz are you running?
    it should come out of sysctl kern.hz if greater than 1000 try setting it to 500 and retry.
    Interesting would be hz 2000 but we will see.



  • @ermal:

    Can you all see at what hz are you running?
    it should come out of sysctl kern.hz if greater than 1000 try setting it to 500 and retry.
    Interesting would be hz 2000 but we will see.

    Huh?  I don't understand what you just said.  What are you suggesting we change and why?

    Best,



  • I'm sure they're all at default hz. Changing that isn't a solution regardless.

    We'll get a resolution to this eventually, if it's an immediate problem for you, you'll have to downgrade to 1.2. This isn't going to be easy or quick to resolve.



  • @cmb:

    I'm sure they're all at default hz. Changing that isn't a solution regardless.

    We'll get a resolution to this eventually, if it's an immediate problem for you, you'll have to downgrade to 1.2. This isn't going to be easy or quick to resolve.

    OK.  I think downgrading looks like the path of least resistance.  The complaining from folks with frequently dropped connections at the office is getting rather shrill.  Looking forward to an eventual fix.

    Best,



  • For what its worth I was seeing this too and have also downgraded to 1.2-Release. Its a shame, as I hate going backwards. You need a firewall to be reliable and stable and its hard to test a new beta without putting it in 'service'.



  • The latest snapshots have a fix for this can you, if possible, test and report if it behaves correctly now.



  • @ermal:

    The latest snapshots have a fix for this can you, if possible, test and report if it behaves correctly now.

    I'll give it a try next time I'm on site.  What was the nature of the fix?

    Best,



  • slbd used to use fping to determine if a WAN was online. There is some kernel change in FreeBSD 7.0 that causes problems because fping sees replies from pings initiated by other processes.  Usually RRD for quality graph and slbd for monitor IP are both pinging the gateway IPs on your WAN (the fact that two processes are pinging the same thing is something we're eliminating in 1.3, but is too significant a change to pull into a maintenance release).

    Now, slbd runs a shell script (for easy changing and testing, because the process being run is hard coded into the slbd binary) which runs FreeBSD's ping. It knows which replies are supposed to go where, and should behave properly unlike fping. The ping in FreeBSD 7.0 supports everything we were doing with fping. This should hopefully be resolved now.



  • Confirm flapping stopped. Thanks for the fix.

    Nick.



  • Can you test that it behaves propperly if you disconnect one of the wans even in failover or loadbalance?
    This would help pushing the 1.2.1 release.



  • I'm running multiple failover (not balance) multi-WAN on a CARP cluster.
    Watching syslog messages as they come through I unplugged the phone line so the ping would fail but leave interfaces up.

    It took 30s for a the message to come through:
    "ICMP poll failed…marking service DOWN"

    Plugged back in and "marking service as UP"

    I don't know how long it took before but I think it was a little more responsive than this. If you think I'm just seeing a delay in the syslog pathway I can time it a more carefully using the logs.

    Nick.


Log in to reply