Bridged pfsense stop to pass traffic

slagr

Hi all,

We have 2 pfsense (2.0-R, 2.0.1), installed on 2 Dell PowerEdge 860 with 3 NICs - 2 of them are Broadcom integrated (5721J) identified as :
Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev. 0x004101,
and one is Intel, identified as :
Intel(R) PRO/1000 Network Connection 7.2.3

All 3 interfaces are bridged. Both setup were inherited, so I cannot change much here now.
5721 chip looks to be included into broadcom freebsd driver.

Once in a while, pfsense stops all traffic for no any reasons logged.
I've read http://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards, and have into my loder.conf.local added :
kern.ipc.nmbclusters="262144"
hw.bce.tso_enable=0
hw.pci.enable_msix=0

But that haven't helped much.

mbuf seems to be ok for me:
2110/3140/5250 mbufs in use (current/cache/total)
2078/1968/4046/262144 mbuf clusters in use (current/cache/total/max)

and cannot see failures via vmstat -z.

All we can do, is to switch traffic to second instance and reboot the dead server.
I suspect the bridge to have problems, because I can ssh via pfsync (intel) interface.

I have 2 questions here:
What else can I troubleshoot, and is there any possibility to restart pfsense network manually (supposed from a script) ?
Any help greatly appreciated.
Thanks.

giorgiolago

In the dmesg output you see any error watchdog timeout - resitting?

slagr

@giorgiolago:

In the dmesg output you see any error watchdog timeout - resitting?

As far as I remember there were no any watchdog timeout errors.

stephenw10

The Broadcom 5721 chip looks to be handled by the bge(4) driver not bce(4).
So your loader tunables are wrong. Should use:

hw.bge.tso_enable=0

Steve

slagr

@stephenw10:

The Broadcom 5721 chip looks to be handled by the bge(4) driver not bce(4).
So your loader tunables are wrong. Should use:
hw.bge.tso_enable=0
Steve

Thanks, Steve.
That's a copy&past error. My bad.
My local config looks correct tho :


kern.ipc.nmbclusters="262144"                                                                                                                                                                                                                
hw.bge.tso_enable=0                                                                                                                                                                                                                          
hw.pci.enable_msix=0

stephenw10

Ah, OK. Easily done.

Is there some reason you're running old versions of pfSense?

Since you're also running Intel NICs you could try the additional tuning parameter suggested:

hw.em.num_queues=1

I'm running Intel NICs and have never needed any tuning though. :-\

When it stops passing traffic you still have access to the box?

Steve

slagr

@stephenw10:

Ah, OK. Easily done.

Is there some reason you're running old versions of pfSense?

I'm conservative about any upgrade, and as I cannot test upgrading process, I don't want to risk.
All I can say i,s that I got the same issues with 2.0-RC3, 2.0-R, 2.0.1
So, I'm not sure upgrading to 2.0.2,2.0.3,2.1 would help.
Didn't find any related info about any improvements for us on redmine. But I might be wrong.
@stephenw10:

Since you're also running Intel NICs you could try the additional tuning parameter suggested:
hw.em.num_queues=1
I'm running Intel NICs and have never needed any tuning though. :-\

When it stops passing traffic you still have access to the box?

Steve

Intel NIC is using for pfsync interface.
Unfortunately broadcom NICs are integrated, and I cannot replace them with Intel.
I cannot log in from outside (WAN), but I can log in from inside the net (LAN).

stephenw10

I agree, it should be working fine under older versions. I just wondered in case you were running something custom.

Broadcom NICs are normally quite reliable (second only to Intel). Has the setup been working and just recently started to be unreliable?

Please define what you mean by 'stops all traffic'. You say you can log in from the LAN and you previously said you can SSH in via the pfsync interface. Can you normally connect from the WAN then?

Steve

slagr

@stephenw10:

I agree, it should be working fine under older versions. I just wondered in case you were running something custom.

Broadcom NICs are normally quite reliable (second only to Intel). Has the setup been working and just recently started to be unreliable?

Please define what you mean by 'stops all traffic'.

Incorrect wording. By 'stop all traffic' I meant, that I could not access anything behind the pfsense. We have a few /24 networks behind it, no NAT involved.

@stephenw10:

You say you can log in from the LAN and you previously said you can SSH in via the pfsync interface. Can you normally connect from the WAN then?
Steve

That's the point. I can connect via LAN,pfsync interfaces, but not via WAN.
Another thing I forgot to mention (sorry!), is that I can see WAN interface "Errors In" counter increasing slowly. All other interfaces error counters are 0.

stephenw10

So the errors appear on WAN whilst it's still operating normally?

You didn't say if this behavior has just recently started but I assume it has.

Do the boxes share a common upstream switch? Could that be failing?

Steve

slagr

@stephenw10:

So the errors appear on WAN whilst it's still operating normally?

That's correct, errors appears while box is still operational.
@stephenw10:

You didn't say if this behavior has just recently started but I assume it has.

Do the boxes share a common upstream switch? Could that be failing?

Steve

Yes, they share a common switch. Don't seem to have issues with it so far tho.
Do you think that could be a (baystack) switch (port) issue ? When switching ports and enabling a spare pfsense, the second one is starting to work just fine.
We have a ~ 1/6 mbps, sometimes up to 10 outgoing, but very rare.

slagr

@slagr:

Yes, they share a common switch. Don't seem to have issues with it so far tho.
Do you think that could be a (baystack) switch (port) issue ? When switching ports and enabling a spare pfsense, the second one is starting to work just fine.
We have a ~ 1/6 mbps, sometimes up to 10 outgoing, but very rare.

Could that happen because of:

"Jun 3 12:42:15 fw apinger: ALARM: WANGW(x.x.x.1) *** WANGWdown ***" ?

My apinger.conf has :


alarm down "WANGWdown" {
    time 120s
}

target "x.x.x.1" {
    description "WANGW"
    srcip "x.x.x.253"
    interval 10s
    alarms override "WANGWloss","WANGWdelay","WANGWdown";
    rrd file "/var/db/rrd/WANGW-quality.rrd"
}

Thanks.

stephenw10

Errm…. I don't understand what you're asking. :-\

Steve

slagr

@slagr:

Yes, they share a common switch. Don't seem to have issues with it so far tho.
Do you think that could be a (baystack) switch (port) issue ? When switching ports and enabling a spare pfsense, the second one is starting to work just fine.
We have a ~ 1/6 mbps, sometimes up to 10 outgoing, but very rare.

Well, the problem was (I think), that both em0 and bg0 shared the same IRQ.
We went ahead and removed em0 (as we were unable to assign a dedicated IRQ to em0 in bios).
Now waiting for the "outage".
OTOH, we got carp broken, as em0 has been used as a dedicated carp NIC.
So, as I'd like to get carp back for the time being, my question would be if I won't break anything (many thousands miles away), by assigning to both LAN interfaces (3) IPs from the same network we use behind the LAN iface, and protect carp on firewall. We used to have such configuration: bridge : WAN (bge0) - management IP, LAN (bge1) - no IP, OPT1 (em0) - carp. em0 went down from that picture.
Thanks.

stephenw10

Having a shared IRQ should not prevent the NICs from working. Having disabled msix for all pci devices it's likely to have more of an effect (I would have thought) but even so it shouldn't stop all traffic.
I am unsure of your network configuration from your description and I have only experimental experience with a CARP setup so I can't really tell you what would happen. Since you will be thousands of miles away getting it wrong would be very bad so I would have to advise waiting for another opinion. :-
In the mean time giving us a network diagram would help greatly.

Steve

slagr

@stephenw10:

Having a shared IRQ should not prevent the NICs from working. Having disabled msix for all pci devices it's likely to have more of an effect (I would have thought) but even so it shouldn't stop all traffic.
I am unsure of your network configuration from your description and I have only experimental experience with a CARP setup so I can't really tell you what would happen. Since you will be thousands of miles away getting it wrong would be very bad so I would have to advise waiting for another opinion. :-
In the mean time giving us a network diagram would help greatly.

Steve

Thanks, Steve.
You're right, that having a shared IRQ should not lead to such results. Reading/googling further, I think next step is to exclude CARP interface from the bridge. As I stated, that is an old, inherited setup, and all 3 NICs are members of the bridge.