Captive default route

PurpleOfPants

I have a LAN using a Vigor router as the default route at 192.168.0.1 which works fine. I've added another WAN connection and connect to that via pfSense at 192.168.0.11, which also works fine. I'm just testing pfSense prior to switching over to it completely, so it's only default route for one or two pieces of kit.

In preparation of doing dual-WAN load balancing, I set up a load balancing pool for the WAN connection to pfSense. That appears to work OK (i.e. do nothing because it's the only connection) - the monitoring tracksthe actual state of the WAN and data goes in and out.

The problem is that enabling load balancing has broken the LAN! Kit that has its default route as .1 can now no longer access the Vigor. If I change their default route to .11 then they successfully use the pfSense. If I delete the load balancing pool then the kit can again use .1 as a default route.

I'm foxed. The pfSense is not routing the .1 traffic - it should be whizzing past completely oblivious to the pfSense being there, and does until the load balancing pool is created. How can the pfSense affect this traffic?

Just to complicate matters, but perhaps provide a clue, the affected kit is connected to a hub shared with the pfSense. I also have some kit which is connected via a switch, and this kit is unaffected. That is, it doesn't show the problem of .1 default route being snaffled. Whatever it up, the switch appears to block it where a hub doesn't.

This is using the current RC2, by the way.

hoba

First of all update to the most recent version.

However I don't think that is your issue. It sounds like you have the pfSense and the vigor parallel in your network and you are trying to get loadbalancing working with that setup somehow. That's not the way that it's supposed to work. You need seperate interfaces at the pfSense for each WAN with a unique gateway specified for each of these interfaces. Only these gateways can be used in your pool. An additional gateway at LAN can't be utilized. You are simply breaking things by doing this.

PurpleOfPants

They are in parallel, yes, but there is no load balancing per se in use. The reason for starting up a load balancing pool is just to see what how one sets it up before whipping out the Vigor and trusting my life to it :)

But I think the essential point is that the Vigor and pfSense co-exist perfectly happily normally. It's only when the load balancing is configured (with just the one pool) that things go wrong, and it's affecting traffic that shouldn't be going anywhere near the pfSense.

Is there some advertising of routes that Windows will hear and use to override it's default gateway setting? The Vigor has RIP capability, but this is completely disabled (and its static routes don't show the pfSense on any route). So maybe there's some protocol that the load balancing module starts up?

hoba

The loadbalancer uses pf's routeto, so it will send traffic directly to the upstream gateway of the next poolmember in line. You might need a passrule before your loadbalancer rule to exclude something from the balancing (a rule that uses the default gateway so the traffic is handled by the systemroutingtable instead of just forwarding it to the upstreamgateway). I would suggest upgrading, we might have fixed some conditions causing this by adding hidden rules that do that out of the box after RC2 was released (about a month ago already). First upgrade this box please.

PurpleOfPants

I've upgraded to the 26th September revision. On that, the problem occured straight away so I downgraded to RC2 again and then found that the problem also occurs on that all the time, but nothing notices because the pf actually routes so web browsing works.

it will send traffic directly to the upstream gateway

But the problem is that it's grabbing traffic that isn't meant to go through it. I've investigated a bit closer and what's happening is that a tracert should start with .1 (because that's the configured default route) but actually starts .11 (the pf). More interestingly, the ARP cache (on Windows) doesn't show the pf at all at this point - not surprising because the Windows PC doesn't know the pf is there. However, if I change the default route to .11 (so Windows does know about the pf) then the ARP cache does show the pf.

It looks to me like the pf is hijaaking the traffic meant for .1 and then routing is as normal. Responses wind up back at the Windows PC that mostly doesn't realise which route it's come from and no-one's the wiser. For kit the other side of a switch, the pf never sees that traffic so can't screw around with it. I don't know if this is what's actually happening, but it would fit the symptoms.

PurpleOfPants

Some more detail. With the Windows PC default route set to .1 a tracert to .1 shows this:

Tracing route to router.home.net [192.168.0.1]
over a maximum of 30 hops:

  1   <10 ms   <10 ms   <10 ms  pfsense.home.net [192.168.0.11]
  2   <10 ms   <10 ms   <10 ms  router.home.net [192.168.0.1]

With no change to the Windows PC, if I unplug the pf the tracert correctly shows this:

Tracing route to router.home.net [192.168.0.1]
over a maximum of 30 hops:

  1   <10 ms   <10 ms   <10 ms  router.home.net [192.168.0.1]

The pf has to be overriding the Windows default route somehow.

PurpleOfPants

And here, I think, is confirmation. When the default route on Windows is set to .1 a browser access is flagged in the firewall log:

The rule that triggered this action is:

@40 block drop in log quick all label "Default block all just to be sure."

With the default route set to .11 (the pf) there is no firewall log entry. So it looks like the firewall is attacking this traffic, even though it's not going through the pf. How would one go about zapping this rule - it doesn't appear in the firewall setup anywhere?

hoba

There is no way the pfSense can trigger what you see at your windows box. Sorry. There must be something very bad happening in your network unrelated to pfSense. ???

PurpleOfPants

I've caught it with Ethereal now. The attachments are Ethereal caps with these stations:

3com_51:* = 00:04:76:* = 192.168.0.102 = Windows
Dratek:* = 00:50:7f:* = 192.168.0.1 = Vigor
Shuttle* = 00:30:1b:* = 192.168.0.11 = pfSense

If you check ping.cap you can see from the Ethernet II section that the Windows machine is sending the ping to the Vigor at 192.168.0.1. A reply from the off-site destination comes back via the Vigor too. But after that a reply also comes back from pf!

It looks very much to me like pf is checking the destination IP address of stuff whizzing past to see if they need routing, whereas it should be checking the Ethernet II details. What other explanation could there be for routing that ping?

Next, the http.cap shows a browser access through the Vigor at .1 to Google. The reply comes back via the Vigor and then pf kills the link by sending a RST. This is the firewall, I presume, not liking the traffic that has an off-site source but didn't arrive via pf.

It seems to me that no-one has noticed this before because you'd usually have the pf as the default route. Else, if you do have more than one router you'll be using switches (since it's hard to find hubs nowadays with a switch costing near enough nothing). A switch hides the non-pf traffic from pf so it doesn't see it and therefore doesn't think to route or interferre with it.

Damn. Won't let me attach caps. Grab 'em here:

ping capture
HTTP session

hoba

You must have arp/IP conflicts of some kind or you have misconfigured something wrong in a way that is beyond my knowledge. Sorry, that doesn't make any sense at all. :-\

PurpleOfPants

You must have arp/IP conflicts of some kind

I'm pretty sure not.

you have misconfigured something wrong

But I could easily imagine that to be so ;)