WAN interfaces flapping with multiWAN

familyguy

I'm suddenly getting a lot of these types of errors in the system logs:

Aug 16 13:40:10 slbd[366]: ICMP poll failed for 63.138.38.129, marking service DOWN
Aug 16 13:40:10 slbd[366]: Service WAN1FailsToWAN2 changed status, reloading filter policy
Aug 16 13:40:15 slbd[366]: ICMP poll succeeded for 63.138.38.129, marking service UP
Aug 16 13:40:15 slbd[366]: Service WAN1FailsToWAN2 changed status, reloading filter policy
Aug 16 13:41:23 slbd[366]: ICMP poll failed for 63.138.38.129, marking service DOWN
Aug 16 13:41:23 slbd[366]: Service LoadBalance changed status, reloading filter policy
Aug 16 13:41:28 slbd[366]: ICMP poll succeeded for 63.138.38.129, marking service UP
Aug 16 13:41:29 slbd[366]: Service LoadBalance changed status, reloading filter policy

It is happening on both WAN interfaces and occurring several times an hour for each one. As you can see, it corrects itself after about 5 seconds. One WAN is a point to point T1 (that's the IP of the ISP's DNS server above). The other is an SDSL link and I'm pinging the internal LAN address of the DSL router so it can't be a problem with the external link. Neither WAN router thinks its external interface is losing connectivity so I'm a bit stumped. Anyone else experiencing this?

I'm using the following snapshot:

1.2.1-TESTING-SNAPSHOT
built on Sat Jul 19 07:13:48 EDT 2008

Here are the stats on the WAN interfaces:

WAN interface (dc1)
Status up
MAC address 00:a0:cc:63:91:84
IP address (redacted)
Subnet mask 255.255.255.248
Gateway (redacted)
ISP DNS servers 208.67.222.222
208.67.220.220
Media 100baseTX <full-duplex>In/out packets 44458963/36944351 (1022.08 MB/107.57 MB)
In/out errors 0/0
Collisions 0

LAN interface (re0)
Status up
MAC address 00:18:f8:0b:21:35
IP address 10.0.0.2
Subnet mask 255.255.255.0
Media 1000baseTX <full-duplex>In/out packets 67442505/77750290 (915.10 MB/590.85 MB)
In/out errors 0/0
Collisions 0

OPT1 interface (dc0)
Status up
MAC address 00:a0:cc:63:6b:55
IP address 10.0.1.10
Subnet mask 255.255.255.0
Gateway 10.0.1.1
Media 100baseTX <full-duplex>In/out packets 36487760/33143675 (238.86 MB/1.12 GB)
In/out errors 0/0
Collisions 0</full-duplex></full-duplex></full-duplex>

databeestje

I'm afraid that there is something amiss with one of the WAN connections. Once a connection starts flapping it is generally a sign of packet loss or extremely high latency.

In 1.2 we use fping for gateway detection, it has retries and a backoff algorithm. So the highest latency before failing is 1.5 seconds iirc.

In 1.3 we use something else, but it will notify you of such issues regardless.

wallabybob

I don't know how the multi-WAN stuff is implemented.

What if the WAN links were saturated (in one or both directions) and there was no queueing OR inappropriate queueing such that either (or both) the ICMP request and response had a "long" delay in actually getting on the wire because they were stuck behind other traffic?

If your ping over the T1 is going to the ISP's router then is that system configured to give you timely response? What if its busy (it might be busy for reasons that have nothing to do with your traffic)?

databeestje

We have a process, slbd in 1.2 and apinger in 1.3 that monitors the monitor IP addresses which should be behind the gateways. We add a static route for each monitor IP so we are sure we are using the right interface.

When a gateway status changes we trigger a filter reload in 1.2 and 1.3.

If a link is saturated in 1.2, where the latency is over 2 seconds and 3 attempts to ping have failed we mark it down. In 1.3 such a state would be marked with "Latency".

If a link has packet loss in 1.2 it will most often stay up without too much issues, but you will see occasional up and down events. Which is too be expected really. Because the connection isn't very good. In 1.3 we mark a connection with packet loss as "Loss".

You will need to make sure that the configured monitor IP is marginally correct. That is why you can configure a monitor IP different from the local gateway which might be connected over ethernet.

familyguy

Hmmm…just for grins, I've just changed it so that the pings go to the lan interface on the other WAN router (the T1 link). So now both links are pinging the internal LAN interface of the relevant router. It hasn't had any effect on the behavior so I'm inclined to think it is an issue with pfsense, not a bona fide problem with the WAN links themselves. If anyone has any troubleshooting suggestions, I'm certainly open to trying them. The link flapping seems to have nothing to do with load as it is now a Saturday afternoon and there is nobody at the location and the links are still bouncing up and down every few minutes.

Cheers,

databeestje

That would imply that the issue is with the network local to you.

It might be worth checking into link duplex settings. If you have a managed ethernet switch you have it a bit easier.

Since you mention it is a T1, it's not by accident a cisco 1600 with a 10baset halfduplex connection is it?

Or if it's newer, something that has a port forced to 100 full duplex and forgetting to set the switch port to the same?

familyguy

@databeestje:

That would imply that the issue is with the network local to you.

It might be worth checking into link duplex settings. If you have a managed ethernet switch you have it a bit easier.

Since you mention it is a T1, it's not by accident a cisco 1600 with a 10baset halfduplex connection is it?

Or if it's newer, something that has a port forced to 100 full duplex and forgetting to set the switch port to the same?

It's a managed switch (Catalyst 2960 w/24 10/100/1000 ports) and the T1 router is a Cisco 1841. The DSL router is some Netopia something or other that I don't recall. The switch is set to auto-negotiate the links and it currently thinks that both routers have negotiated 100mbit full-duplex connections. It's been 10+ years since I used FreeBSD on any regular basis. How can I force the nics on the pfsense system to 100mbit/full-duplex?

Also, how would I make the firewall less "twitchy" about taking an interface down? I'd like to experiment with allowing it to be a little more tolerant of temporary long-ish pings to the address being tested.

Cheers,

cmb

Check the switch to make sure you don't have any errors on that end, the firewall's end is clean. If the switch is as well, don't touch your speed/duplex. That's best to leave alone unless you have a problem that cannot be resolved in any other fashion, and it doesn't appear you have a problem.

I know a number of people are using load balancing including myself on 1.2.1, and haven't seen flapping, so I'm inclined to think you really are seeing loss. Best way to determine that is to run a couple tcpdumps from a SSH session:
tcpdump -ni fxp0 -w /tmp/wan.pcap host 1.2.3.4

replacing fxp0 with your real WAN interface, and 1.2.3.4 with the monitor IP for that interface. Repeat switching fxp0 for your second WAN, change the filename, and use its monitor IP. After it flaps again, ctrl-c to stop the SSH sessions and download the files from Diagnostics -> Command. If you don't know what to look for in the capture files, post the pcap files somewhere and add a URL here, or email them to me (cmb at pfsense dot org).

NickC

I'm seeing plenty of unnecessary DOWN/UP in the logs too.

I did the
tcpdump -ni fxp0 -w /tmp/wan.pcap host 1.2.3.4

and checked the pcap file at the point where a down occurred.

I looked at a single instance but it may give some clues. The down occurred where a second ping was sent out from pfsense 0.5s after the first, but before the reply from the first had returned.

So the packet sequence goes

request0, reply0 (UP)
request1, request2, reply1, reply2 (logs show Down 5s after request1)
request3, reply3 (back UP again)

So looks like the problem may be the timing of sending out the second ping. It's happening too soon. Maybe the system expects a reply within 500ms? The remote server I'm pinging against normally replies sub 50ms but obviously not in this instance.

The pinged IP in this instance is present just once in the list of monitor IPs in the slbd config.

Hope this makes sense.

Nick.

familyguy

@NickC:

I'm seeing plenty of unnecessary DOWN/UP in the logs too.

I did the
tcpdump -ni fxp0 -w /tmp/wan.pcap host 1.2.3.4

and checked the pcap file at the point where a down occurred.

I looked at a single instance but it may give some clues. The down occurred where a second ping was sent out from pfsense 0.5s after the first, but before the reply from the first had returned.

So the packet sequence goes

request0, reply0 (UP)
request1, request2, reply1, reply2 (logs show Down 5s after request1)
request3, reply3 (back UP again)

So looks like the problem may be the timing of sending out the second ping. It's happening too soon. Maybe the system expects a reply within 500ms? The remote server I'm pinging against normally replies sub 50ms but obviously not in this instance.

The pinged IP in this instance is present just once in the list of monitor IPs in the slbd config.

Hope this makes sense.

I think you're on to something here. Would it be possible to tune this ping timing to make it less twitchy?

To answer cmb, neither pfsense nor the switch are seeing any dropped packets. So that seems to rule out the interfaces renegotiating the link speed theory.

Edit: I've been looking through the code and it appears that /usr/local/bin/ping_hosts.sh contains all the voodoo for bouncing the links up and down. I'm not much of a shell scripter so I'm at a loss on how to tune this to make it less sensitive to these out of order ping responses (if that is indeed the source of the problem). Suggestions anyone?

Cheers,

familyguy

Bump. This is still a problem for me. I'm running MRTG on the WAN interface of both routers and they are NOT losing their connectivity.

Again, can someone point me in the right direction to somehow blunt this hair trigger for marking interfaces down in pfsense?

Thanks!

databeestje

we use the slbd process which then launches a fping command to ping the monitor ip.

fping ignores the duplicate reply to the 1st request. It will thus retry, if the 2nd succeeds it returns return code 0.

On anything else it will return 1 or 2.
And that would cause the state to change.

familyguy

@databeestje:

we use the slbd process which then launches a fping command to ping the monitor ip.

fping ignores the duplicate reply to the 1st request. It will thus retry, if the 2nd succeeds it returns return code 0.

On anything else it will return 1 or 2.
And that would cause the state to change.

Which script have you updated to use fping? It's simply a matter of replacing "ping" with "fping" in the slbd script? This problem is seriously biting my office in the ass to the point that I've been asked to find another solution if I can't come up with a suitable fix pronto. It would be a shame if that happened because things are working well otherwise.

Cheers,

Perry

My thoughts when reading this post.

500ms is a long time for a monitor ip to answer.
I really only trust intel nic's.
If possible i would have setup a 2nd pc with pfSense and 1 client pinging both monitor ip's.

familyguy

@Perry:

My thoughts when reading this post.

500ms is a long time for a monitor ip to answer.
I really only trust intel nic's.
If possible i would have setup a 2nd pc with pfSense and 1 client pinging both monitor ip's.

I completely agree with you. However, I've replaced the interface with an intel card and the problem remains with the old fxp driver. So I really don't think it is the physical interface or ethernet driver that is the problem. I've also had our vendor replace the DSL router and the problem remains. I guess I'll try the fping thing and hope for the best.

Cheers,

databeestje

To be perfectly clear on this, slbd is a binary, not a script. Furthermore it already uses fping. It has done so since the 1.2 release and it's still the same in 1.2.1.

in 1.3 it is handled with apinger. Which is something else entirely.

familyguy

@databeestje:

To be perfectly clear on this, slbd is a binary, not a script. Furthermore it already uses fping. It has done so since the 1.2 release and it's still the same in 1.2.1.

in 1.3 it is handled with apinger. Which is something else entirely.

Bummer. I suppose I have reached a dead end then. Pfsense has been fantastic for us with the exception of this problem. I guess I will have to go back to researching a commercial solution for link aggregation.

Thanks for your information.

Best,

wallabybob

It would also be interesting to see a trace or tcpdump (including timestamps) taken at one of the routers around the time of a link down event. Does the trace show the same ordering?

Can you get a few traces at both the pfSense interface and the router interface at "link down" events? Do these show a similar pattern? Does ordering match at both ends? (For example, if one trace shows the sequence Tx Request 1, TX Request 2, Rx Response 1, Rx Response 2 does the other end show Rx Request 1, Rx Request 2, Tx Response 1, Tx Response 2?) Are the intervals between requests consistent OR do the requests sometimes appear almost "back to back"?

What does a trace of a regular ping over this link look like? Does either end (or both ends) show delayed response from time to time?

I can think of a number of different possible scenarios that might cause the behaviour described. And these scenarios might never be noticed in "normal" operation because they are masked by (for example) normal TCP dataloss recovery. Perhaps a device driver bug might cause a received frame to go unnoticed until the next received frame or the next transmit completion. Perhaps there are queueing/scheduling delays such that responses aren't timely. Perhaps fping's clock has a drift such that it sometimes sends requests too close together and so doesn't allow sufficient time for the responses to come back.

I'm not familiar with the internals of fping and its man page doesn't provide much detail, but its test methodology doesn't seem particularly robust. To decide a link is down on the basis of a timeout on a single response seems far too aggressive. Networks sometimes lose packets and sometimes re-order packets.

familyguy

@wallabybob:

I'm not familiar with the internals of fping and its man page doesn't provide much detail, but its test methodology doesn't seem particularly robust. To decide a link is down on the basis of a timeout on a single response seems far too aggressive. Networks sometimes lose packets and sometimes re-order packets.

I agree 100%. I've asked if there is a way to make pfsense more tolerant so that it doesn't have a "hair trigger," but nobody responded with any suggestions. If there was a way to reconfigure things so the user could increase the threshold of pain before yanking an interface down, that would be ideal.

Cheers,

wallabybob

I don't know if you plan to take up my other suggestions. They may well be hard work and a bit of a stretch.

If I understand your problem reports correctly, your pfSense box is seeing significant delayed response on supposedly idle (or very near idle) single hop links. Regardless of the fping issue, its quite possible something else in your network is not working optimally. Depending on what that is, tweaking the fping timeout may not help very much.