WAN interfaces flapping with multiWAN

databeestje · Sep 12, 2008, 7:54 PM

we use the slbd process which then launches a fping command to ping the monitor ip.

fping ignores the duplicate reply to the 1st request. It will thus retry, if the 2nd succeeds it returns return code 0.

On anything else it will return 1 or 2.
And that would cause the state to change.

familyguy · Sep 12, 2008, 10:43 PM

we use the slbd process which then launches a fping command to ping the monitor ip.

fping ignores the duplicate reply to the 1st request. It will thus retry, if the 2nd succeeds it returns return code 0.

On anything else it will return 1 or 2.
And that would cause the state to change.

Which script have you updated to use fping? It's simply a matter of replacing "ping" with "fping" in the slbd script? This problem is seriously biting my office in the ass to the point that I've been asked to find another solution if I can't come up with a suitable fix pronto. It would be a shame if that happened because things are working well otherwise.

Cheers,

Perry · Sep 13, 2008, 12:30 AM

My thoughts when reading this post.

500ms is a long time for a monitor ip to answer.
I really only trust intel nic's.
If possible i would have setup a 2nd pc with pfSense and 1 client pinging both monitor ip's.

familyguy · Sep 14, 2008, 9:16 PM

@Perry:

My thoughts when reading this post.

500ms is a long time for a monitor ip to answer.
I really only trust intel nic's.
If possible i would have setup a 2nd pc with pfSense and 1 client pinging both monitor ip's.

I completely agree with you. However, I've replaced the interface with an intel card and the problem remains with the old fxp driver. So I really don't think it is the physical interface or ethernet driver that is the problem. I've also had our vendor replace the DSL router and the problem remains. I guess I'll try the fping thing and hope for the best.

Cheers,

databeestje · Sep 15, 2008, 8:14 PM

To be perfectly clear on this, slbd is a binary, not a script. Furthermore it already uses fping. It has done so since the 1.2 release and it's still the same in 1.2.1.

in 1.3 it is handled with apinger. Which is something else entirely.

familyguy · Sep 16, 2008, 2:00 AM

@databeestje:

To be perfectly clear on this, slbd is a binary, not a script. Furthermore it already uses fping. It has done so since the 1.2 release and it's still the same in 1.2.1.

in 1.3 it is handled with apinger. Which is something else entirely.

Bummer. I suppose I have reached a dead end then. Pfsense has been fantastic for us with the exception of this problem. I guess I will have to go back to researching a commercial solution for link aggregation.

Thanks for your information.

Best,

wallabybob · Sep 16, 2008, 1:56 PM

It would also be interesting to see a trace or tcpdump (including timestamps) taken at one of the routers around the time of a link down event. Does the trace show the same ordering?

Can you get a few traces at both the pfSense interface and the router interface at "link down" events? Do these show a similar pattern? Does ordering match at both ends? (For example, if one trace shows the sequence Tx Request 1, TX Request 2, Rx Response 1, Rx Response 2 does the other end show Rx Request 1, Rx Request 2, Tx Response 1, Tx Response 2?) Are the intervals between requests consistent OR do the requests sometimes appear almost "back to back"?

What does a trace of a regular ping over this link look like? Does either end (or both ends) show delayed response from time to time?

I can think of a number of different possible scenarios that might cause the behaviour described. And these scenarios might never be noticed in "normal" operation because they are masked by (for example) normal TCP dataloss recovery. Perhaps a device driver bug might cause a received frame to go unnoticed until the next received frame or the next transmit completion. Perhaps there are queueing/scheduling delays such that responses aren't timely. Perhaps fping's clock has a drift such that it sometimes sends requests too close together and so doesn't allow sufficient time for the responses to come back.

I'm not familiar with the internals of fping and its man page doesn't provide much detail, but its test methodology doesn't seem particularly robust. To decide a link is down on the basis of a timeout on a single response seems far too aggressive. Networks sometimes lose packets and sometimes re-order packets.

familyguy · Sep 16, 2008, 5:55 PM

@wallabybob:

I'm not familiar with the internals of fping and its man page doesn't provide much detail, but its test methodology doesn't seem particularly robust. To decide a link is down on the basis of a timeout on a single response seems far too aggressive. Networks sometimes lose packets and sometimes re-order packets.

I agree 100%. I've asked if there is a way to make pfsense more tolerant so that it doesn't have a "hair trigger," but nobody responded with any suggestions. If there was a way to reconfigure things so the user could increase the threshold of pain before yanking an interface down, that would be ideal.

Cheers,

wallabybob · Sep 16, 2008, 10:45 PM

I don't know if you plan to take up my other suggestions. They may well be hard work and a bit of a stretch.

If I understand your problem reports correctly, your pfSense box is seeing significant delayed response on supposedly idle (or very near idle) single hop links. Regardless of the fping issue, its quite possible something else in your network is not working optimally. Depending on what that is, tweaking the fping timeout may not help very much.

familyguy · Sep 17, 2008, 1:25 AM

@wallabybob:

I don't know if you plan to take up my other suggestions. They may well be hard work and a bit of a stretch.

If I understand your problem reports correctly, your pfSense box is seeing significant delayed response on supposedly idle (or very near idle) single hop links. Regardless of the fping issue, its quite possible something else in your network is not working optimally. Depending on what that is, tweaking the fping timeout may not help very much.

I can't imagine what else could be wrong. These are direct links from one device to another. There is no switch/hub/anything in between. And I've already tried swapping in known good cabling (only about 1m of cat5e cable), changed the ethernet NIC on the pfsense box, etc. I don't know what else to do at this point.

Cheers,

wallabybob · Sep 17, 2008, 2:07 PM

I've been through all the posts on this topic.

Familyguy: Are you seeing what NickC reported? Do your traces show show a similar pattern to NickC's trace? (I have just assumed that but I now can't see anywhere where you have said you have taken traces and seen the same sort of pattern NickC reported.)

Whats this MRTG you were running on both routers? What are the routers? Do they run the same software? Do the routers have some sort of utility for generating a tcpdump like trace? (You may have to connect something on another router port so that the trace traffic does not go over the interface being traced.)

You began your original post "All of a sudden …" For how long had it been working without reporting these errors? Can anyone remember anything that happened to the pfSense box or any of the routers at around the time the messages "suddenly" appeared? Someone dropped a box (possibly cracking a PCB trace for example)? Power surge? Marginal power supplies can cause systems to behave erratically. (I recall a PDP11/40 mini computer with a micro processor controlled WAN communications card at the end of the chassis furthest from the power supply. The comms controller behaved erratically, resetting the protocol a few times a minute. One of the voltages to the slot was just under the correct value.)

You mention using dc interfaces initially, then fxp. Are these interfaces on dual port cards or single port cards? Did you make any changes to the card configuration soon before this started happening? How old is the system and what is the date of the BIOS? What system or motherboard are you using? What is in the PCI slots and are there any PCI slots spare?

Please provide the dmesg output from the pfsense box.

Have you tried turning the WAN and OPT2 interfaces into polling mode? At the shell prompt

ifconfig fxp0 polling

will enable polling mode on interface fxp 0. To disable polling mode, at the shell prompt type

ifconfig fxp0 -polling

Wait a least a couple of minutes before deciding whether or not it makes a difference. If it does make a difference on one interface then try it on the other.

I have reasons for asking all these questions but I don't have the time now to explain other than the information that is currently available does not explain what is reported. You say you have run out of ideas. I haven't. For now, I'm prepared to give my more than 25 years of networking experience to work on this problem, but I need more to work with. I realise I have asked for a lot but I probably won't be able to able to do anything more on this issue until Sunday night (4 days from now). If you give me a fair bit to work with by the time I can get back to this I will have a greater chance of putting together a reasonable theory of what is happening. If you have to move on or aren't able to get me more information that is fine.

familyguy · Sep 17, 2008, 7:45 PM

Thank you for the generous offer to help. I'll gather as much info as I can. The thing that really has me scratching my head is that OPT1 seems to be the "problem child," though the WAN interface periodically flaps too. I've changed the monitor IP for OPT1 to be OPt1's own IP address and it STILL happens. That is with both a dc ethernet card AND with the integrated Intel nic on the motherboard that uses the fxp driver. So if the pings are failing even when it is monitoring the OPT1 interface itself, I'm just plain confused.

One more thing I'm going to try is to use a more current snapshot and see if that helps at all.

Cheers,

databeestje · Sep 17, 2008, 8:36 PM

to be very clear about this.

We replaced ping explicitly by fping because ping was flapping so much. When we used ping it sent a single packet with a 2 second timeout. This failed too often.

When using fping the default timeout is 400 ms, the backoff factor is 1.5 and the number of retries is 3.
This means that we wait 400ms, 600ms, 900ms and 1.35 seconds for a ping response.

fping should only return failure after all retries have failed which is takes at most 3.5 seconds before failure is detected.

If 1.2 does not exhibit this problem there might be a different issue at play totally.

But as said before, this requires a tcpdump of the icmp traffic to be able to debug this.

databeestje · Sep 21, 2008, 9:06 PM

We found a issue with FreeBSD7 which might be causing us grief.

We have filed a PR and hope to get any response.

wallabybob · Sep 21, 2008, 9:10 PM

@databeestje:

We found a issue with FreeBSD7 which might be causing us grief.

We have filed a PR and hope to get any response.

For those of us following this sisue can you give some more information, or at least a reference to the FreeBSD PR?

familyguy · Sep 22, 2008, 5:28 PM

@wallabybob:

@databeestje:

We found a issue with FreeBSD7 which might be causing us grief.

We have filed a PR and hope to get any response.

For those of us following this sisue can you give some more information, or at least a reference to the FreeBSD PR?

Yes, I would also find that interesting (but perhaps not directly helpful since I'm not much of a programmer).

Best,

cmb · Oct 2, 2008, 8:23 PM

The PR is http://www.freebsd.org/cgi/query-pr.cgi?pr=127528

familyguy · Oct 9, 2008, 8:35 PM

@cmb:

The PR is http://www.freebsd.org/cgi/query-pr.cgi?pr=127528

Doesn't appear that they intend to "fix" this and they're saying it's an application level issue. Where does that leave those of us that are experiencing the problem? Go back to pre-FreeBSD7 distro?

Best,

eri-- · Oct 9, 2008, 9:28 PM

Can you all see at what hz are you running?
it should come out of sysctl kern.hz if greater than 1000 try setting it to 500 and retry.
Interesting would be hz 2000 but we will see.

familyguy · Oct 9, 2008, 10:00 PM

@ermal:

Can you all see at what hz are you running?
it should come out of sysctl kern.hz if greater than 1000 try setting it to 500 and retry.
Interesting would be hz 2000 but we will see.

Huh? I don't understand what you just said. What are you suggesting we change and why?

Best,