WAN interfaces flapping with multiWAN

databeestje

We have a process, slbd in 1.2 and apinger in 1.3 that monitors the monitor IP addresses which should be behind the gateways. We add a static route for each monitor IP so we are sure we are using the right interface.

When a gateway status changes we trigger a filter reload in 1.2 and 1.3.

If a link is saturated in 1.2, where the latency is over 2 seconds and 3 attempts to ping have failed we mark it down. In 1.3 such a state would be marked with "Latency".

If a link has packet loss in 1.2 it will most often stay up without too much issues, but you will see occasional up and down events. Which is too be expected really. Because the connection isn't very good. In 1.3 we mark a connection with packet loss as "Loss".

You will need to make sure that the configured monitor IP is marginally correct. That is why you can configure a monitor IP different from the local gateway which might be connected over ethernet.

familyguy

Hmmm…just for grins, I've just changed it so that the pings go to the lan interface on the other WAN router (the T1 link). So now both links are pinging the internal LAN interface of the relevant router. It hasn't had any effect on the behavior so I'm inclined to think it is an issue with pfsense, not a bona fide problem with the WAN links themselves. If anyone has any troubleshooting suggestions, I'm certainly open to trying them. The link flapping seems to have nothing to do with load as it is now a Saturday afternoon and there is nobody at the location and the links are still bouncing up and down every few minutes.

Cheers,

databeestje

That would imply that the issue is with the network local to you.

It might be worth checking into link duplex settings. If you have a managed ethernet switch you have it a bit easier.

Since you mention it is a T1, it's not by accident a cisco 1600 with a 10baset halfduplex connection is it?

Or if it's newer, something that has a port forced to 100 full duplex and forgetting to set the switch port to the same?

familyguy

@databeestje:

That would imply that the issue is with the network local to you.

It might be worth checking into link duplex settings. If you have a managed ethernet switch you have it a bit easier.

Since you mention it is a T1, it's not by accident a cisco 1600 with a 10baset halfduplex connection is it?

Or if it's newer, something that has a port forced to 100 full duplex and forgetting to set the switch port to the same?

It's a managed switch (Catalyst 2960 w/24 10/100/1000 ports) and the T1 router is a Cisco 1841. The DSL router is some Netopia something or other that I don't recall. The switch is set to auto-negotiate the links and it currently thinks that both routers have negotiated 100mbit full-duplex connections. It's been 10+ years since I used FreeBSD on any regular basis. How can I force the nics on the pfsense system to 100mbit/full-duplex?

Also, how would I make the firewall less "twitchy" about taking an interface down? I'd like to experiment with allowing it to be a little more tolerant of temporary long-ish pings to the address being tested.

Cheers,

cmb

Check the switch to make sure you don't have any errors on that end, the firewall's end is clean. If the switch is as well, don't touch your speed/duplex. That's best to leave alone unless you have a problem that cannot be resolved in any other fashion, and it doesn't appear you have a problem.

I know a number of people are using load balancing including myself on 1.2.1, and haven't seen flapping, so I'm inclined to think you really are seeing loss. Best way to determine that is to run a couple tcpdumps from a SSH session:
tcpdump -ni fxp0 -w /tmp/wan.pcap host 1.2.3.4

replacing fxp0 with your real WAN interface, and 1.2.3.4 with the monitor IP for that interface. Repeat switching fxp0 for your second WAN, change the filename, and use its monitor IP. After it flaps again, ctrl-c to stop the SSH sessions and download the files from Diagnostics -> Command. If you don't know what to look for in the capture files, post the pcap files somewhere and add a URL here, or email them to me (cmb at pfsense dot org).

NickC

I'm seeing plenty of unnecessary DOWN/UP in the logs too.

I did the
tcpdump -ni fxp0 -w /tmp/wan.pcap host 1.2.3.4

and checked the pcap file at the point where a down occurred.

I looked at a single instance but it may give some clues. The down occurred where a second ping was sent out from pfsense 0.5s after the first, but before the reply from the first had returned.

So the packet sequence goes

request0, reply0 (UP)
request1, request2, reply1, reply2 (logs show Down 5s after request1)
request3, reply3 (back UP again)

So looks like the problem may be the timing of sending out the second ping. It's happening too soon. Maybe the system expects a reply within 500ms? The remote server I'm pinging against normally replies sub 50ms but obviously not in this instance.

The pinged IP in this instance is present just once in the list of monitor IPs in the slbd config.

Hope this makes sense.

Nick.

familyguy

@NickC:

I'm seeing plenty of unnecessary DOWN/UP in the logs too.

I did the
tcpdump -ni fxp0 -w /tmp/wan.pcap host 1.2.3.4

and checked the pcap file at the point where a down occurred.

I looked at a single instance but it may give some clues. The down occurred where a second ping was sent out from pfsense 0.5s after the first, but before the reply from the first had returned.

So the packet sequence goes

request0, reply0 (UP)
request1, request2, reply1, reply2 (logs show Down 5s after request1)
request3, reply3 (back UP again)

So looks like the problem may be the timing of sending out the second ping. It's happening too soon. Maybe the system expects a reply within 500ms? The remote server I'm pinging against normally replies sub 50ms but obviously not in this instance.

The pinged IP in this instance is present just once in the list of monitor IPs in the slbd config.

Hope this makes sense.

I think you're on to something here. Would it be possible to tune this ping timing to make it less twitchy?

To answer cmb, neither pfsense nor the switch are seeing any dropped packets. So that seems to rule out the interfaces renegotiating the link speed theory.

Edit: I've been looking through the code and it appears that /usr/local/bin/ping_hosts.sh contains all the voodoo for bouncing the links up and down. I'm not much of a shell scripter so I'm at a loss on how to tune this to make it less sensitive to these out of order ping responses (if that is indeed the source of the problem). Suggestions anyone?

Cheers,

familyguy

Bump. This is still a problem for me. I'm running MRTG on the WAN interface of both routers and they are NOT losing their connectivity.

Again, can someone point me in the right direction to somehow blunt this hair trigger for marking interfaces down in pfsense?

Thanks!

databeestje

we use the slbd process which then launches a fping command to ping the monitor ip.

fping ignores the duplicate reply to the 1st request. It will thus retry, if the 2nd succeeds it returns return code 0.

On anything else it will return 1 or 2.
And that would cause the state to change.

familyguy

@databeestje:

we use the slbd process which then launches a fping command to ping the monitor ip.

fping ignores the duplicate reply to the 1st request. It will thus retry, if the 2nd succeeds it returns return code 0.

On anything else it will return 1 or 2.
And that would cause the state to change.

Which script have you updated to use fping? It's simply a matter of replacing "ping" with "fping" in the slbd script? This problem is seriously biting my office in the ass to the point that I've been asked to find another solution if I can't come up with a suitable fix pronto. It would be a shame if that happened because things are working well otherwise.

Cheers,

Perry

My thoughts when reading this post.

500ms is a long time for a monitor ip to answer.
I really only trust intel nic's.
If possible i would have setup a 2nd pc with pfSense and 1 client pinging both monitor ip's.

familyguy

@Perry:

My thoughts when reading this post.

500ms is a long time for a monitor ip to answer.
I really only trust intel nic's.
If possible i would have setup a 2nd pc with pfSense and 1 client pinging both monitor ip's.

I completely agree with you. However, I've replaced the interface with an intel card and the problem remains with the old fxp driver. So I really don't think it is the physical interface or ethernet driver that is the problem. I've also had our vendor replace the DSL router and the problem remains. I guess I'll try the fping thing and hope for the best.

Cheers,

databeestje

To be perfectly clear on this, slbd is a binary, not a script. Furthermore it already uses fping. It has done so since the 1.2 release and it's still the same in 1.2.1.

in 1.3 it is handled with apinger. Which is something else entirely.

familyguy

@databeestje:

To be perfectly clear on this, slbd is a binary, not a script. Furthermore it already uses fping. It has done so since the 1.2 release and it's still the same in 1.2.1.

in 1.3 it is handled with apinger. Which is something else entirely.

Bummer. I suppose I have reached a dead end then. Pfsense has been fantastic for us with the exception of this problem. I guess I will have to go back to researching a commercial solution for link aggregation.

Thanks for your information.

Best,

wallabybob

It would also be interesting to see a trace or tcpdump (including timestamps) taken at one of the routers around the time of a link down event. Does the trace show the same ordering?

Can you get a few traces at both the pfSense interface and the router interface at "link down" events? Do these show a similar pattern? Does ordering match at both ends? (For example, if one trace shows the sequence Tx Request 1, TX Request 2, Rx Response 1, Rx Response 2 does the other end show Rx Request 1, Rx Request 2, Tx Response 1, Tx Response 2?) Are the intervals between requests consistent OR do the requests sometimes appear almost "back to back"?

What does a trace of a regular ping over this link look like? Does either end (or both ends) show delayed response from time to time?

I can think of a number of different possible scenarios that might cause the behaviour described. And these scenarios might never be noticed in "normal" operation because they are masked by (for example) normal TCP dataloss recovery. Perhaps a device driver bug might cause a received frame to go unnoticed until the next received frame or the next transmit completion. Perhaps there are queueing/scheduling delays such that responses aren't timely. Perhaps fping's clock has a drift such that it sometimes sends requests too close together and so doesn't allow sufficient time for the responses to come back.

I'm not familiar with the internals of fping and its man page doesn't provide much detail, but its test methodology doesn't seem particularly robust. To decide a link is down on the basis of a timeout on a single response seems far too aggressive. Networks sometimes lose packets and sometimes re-order packets.

familyguy

@wallabybob:

I'm not familiar with the internals of fping and its man page doesn't provide much detail, but its test methodology doesn't seem particularly robust. To decide a link is down on the basis of a timeout on a single response seems far too aggressive. Networks sometimes lose packets and sometimes re-order packets.

I agree 100%. I've asked if there is a way to make pfsense more tolerant so that it doesn't have a "hair trigger," but nobody responded with any suggestions. If there was a way to reconfigure things so the user could increase the threshold of pain before yanking an interface down, that would be ideal.

Cheers,

wallabybob

I don't know if you plan to take up my other suggestions. They may well be hard work and a bit of a stretch.

If I understand your problem reports correctly, your pfSense box is seeing significant delayed response on supposedly idle (or very near idle) single hop links. Regardless of the fping issue, its quite possible something else in your network is not working optimally. Depending on what that is, tweaking the fping timeout may not help very much.

familyguy

@wallabybob:

I don't know if you plan to take up my other suggestions. They may well be hard work and a bit of a stretch.

If I understand your problem reports correctly, your pfSense box is seeing significant delayed response on supposedly idle (or very near idle) single hop links. Regardless of the fping issue, its quite possible something else in your network is not working optimally. Depending on what that is, tweaking the fping timeout may not help very much.

I can't imagine what else could be wrong. These are direct links from one device to another. There is no switch/hub/anything in between. And I've already tried swapping in known good cabling (only about 1m of cat5e cable), changed the ethernet NIC on the pfsense box, etc. I don't know what else to do at this point.

Cheers,

wallabybob

I've been through all the posts on this topic.

Familyguy: Are you seeing what NickC reported? Do your traces show show a similar pattern to NickC's trace? (I have just assumed that but I now can't see anywhere where you have said you have taken traces and seen the same sort of pattern NickC reported.)

Whats this MRTG you were running on both routers? What are the routers? Do they run the same software? Do the routers have some sort of utility for generating a tcpdump like trace? (You may have to connect something on another router port so that the trace traffic does not go over the interface being traced.)

You began your original post "All of a sudden …" For how long had it been working without reporting these errors? Can anyone remember anything that happened to the pfSense box or any of the routers at around the time the messages "suddenly" appeared? Someone dropped a box (possibly cracking a PCB trace for example)? Power surge? Marginal power supplies can cause systems to behave erratically. (I recall a PDP11/40 mini computer with a micro processor controlled WAN communications card at the end of the chassis furthest from the power supply. The comms controller behaved erratically, resetting the protocol a few times a minute. One of the voltages to the slot was just under the correct value.)

You mention using dc interfaces initially, then fxp. Are these interfaces on dual port cards or single port cards? Did you make any changes to the card configuration soon before this started happening? How old is the system and what is the date of the BIOS? What system or motherboard are you using? What is in the PCI slots and are there any PCI slots spare?

Please provide the dmesg output from the pfsense box.

Have you tried turning the WAN and OPT2 interfaces into polling mode? At the shell prompt

ifconfig fxp0 polling

will enable polling mode on interface fxp 0. To disable polling mode, at the shell prompt type

ifconfig fxp0 -polling

Wait a least a couple of minutes before deciding whether or not it makes a difference. If it does make a difference on one interface then try it on the other.

I have reasons for asking all these questions but I don't have the time now to explain other than the information that is currently available does not explain what is reported. You say you have run out of ideas. I haven't. For now, I'm prepared to give my more than 25 years of networking experience to work on this problem, but I need more to work with. I realise I have asked for a lot but I probably won't be able to able to do anything more on this issue until Sunday night (4 days from now). If you give me a fair bit to work with by the time I can get back to this I will have a greater chance of putting together a reasonable theory of what is happening. If you have to move on or aren't able to get me more information that is fine.

familyguy

Thank you for the generous offer to help. I'll gather as much info as I can. The thing that really has me scratching my head is that OPT1 seems to be the "problem child," though the WAN interface periodically flaps too. I've changed the monitor IP for OPT1 to be OPt1's own IP address and it STILL happens. That is with both a dc ethernet card AND with the integrated Intel nic on the motherboard that uses the fxp driver. So if the pings are failing even when it is monitoring the OPT1 interface itself, I'm just plain confused.

One more thing I'm going to try is to use a more current snapshot and see if that helps at all.

Cheers,