Creeping packetloss/IP frag/checksumming errors with multiple tagged interfaces

rbd

hey guys… long time 1.2.x user and 1.2 is working out great for us in production. We have a 2.0 beta box up for testing (which is an old HP DL360 G3) hanging off of a Cisco switch. We have a single physical interface connected to the switch and are using VLAN tagging/trunking with 5 tagged interfaces from that (LAN, WAN, and 3 DMZ subnets) ...that was all set up fine.

So, we restart the box. Everything starts out fine virtually no packet loss on the interfaces. I'm keeping a simple straight ping going to the box across one of the DMZ interfaces. Then, as time goes by, I start to see "Request timed out" messages creep in...first one or two here or there. I had the ping going all night and when I return, I end up getting stats like:

Ping statistics for 10.14.230.15:
Packets: Sent = 14532, Received = 10670, Lost = 3862 (26% loss),
Approximate round trip times in milli-seconds:
Minimum = 59ms, Maximum = 3787ms, Average = 64ms

Ouch! At the same time, the same box is pinging to our production 1.2.x firewall with 2 packets out of 20,000 lost. Now on the 2.0-beta box, browsing to the web admin interface is almost impossible, with frequent timeouts (it was very snappy when the box first booted up).

The stats for the interface I'm pinging against look like:
• SITEOZ interface (bge0_vlan400)
• Status up
• MAC address 00:11:85:d4:2e:7f
• IP address 10.14.230.15
• Subnet mask 255.255.255.0
• Media 100baseTX <full-duplex>• In/out packets 24614/24542 (1.47 MB/3.08 MB)
• In/out packets (pass) 24542/27132 (1.46 MB/3.08 MB)
• In/out packets (block) 72/0 (14 KB/0 bytes)
• In/out errors 0/0
• Collisions 0

So it seems the fw THINKS its getting and responding to packets fine. All other interfaces show 0 errors and collisions as well, and the Cisco switch shows 0 errors as well on the interface. We have other things hanging on the switch as well and haven't had these kinds of problems with them.
What we've tried:
• Change out the cable
• Use a totally different box (a similar DL360 G3)...with the 2.0 code, same problem
• Disable TCP checksum offloading. Seemed to help, at least at first
• Disable pf scrubbing
• Disable firewalling on the same interface
(We rebooted the box after making any of these changes, just to be sure.)

Taking a packet capture off of any interface shows a number of IP truncated-ip and back checksum messages:
13:42:34.751214 IP (tos 0x10, ttl 255, id 45827, offset 0, flags [DF], proto VRRP (112), length 56, bad cksum 0 (->f811)!)
10.14.230.15 > 224.0.0.18: VRRPv2, Advertisement, vrid 3, prio 0, authtype none, intvl 1s, length 36, addrs(7): 59.94.171.30,19.58.56.126,200.217.201.105,71.102.78.29,156.48.194.207,176.186.156.234,49.8.58.126
13:42:34.918545 IP truncated-ip - 9160 bytes missing! (tos 0x0, ttl 255, id 24372, offset 512, flags [none], proto VRRP (112), length 9216, bad cksum 6934 (->852c)!)
209.43.6.165 > 224.0.0.18: carp
13:42:35.119057 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 209.43.6.176 tell 209.43.6.161, length 48
13:42:35.590717 IP truncated-ip - 9156 bytes missing! (tos 0x0, ttl 255, id 4982, offset 512, flags [none], proto VRRP (112), length 9216, bad cksum b4f2 (->d0ea)!)
209.43.6.165 > 224.0.0.18: carp
13:42:35.613680 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.15.5.254 tell 10.15.5.254, length 50
13:42:35.753217 IP (tos 0x10, ttl 255, id 17492, offset 0, flags [DF], proto VRRP (112), length 56, bad cksum 0 (->66c1)!)
10.14.230.15 > 224.0.0.18: VRRPv2, Advertisement, vrid 3, prio 0, authtype none, intvl 1s, length 36, addrs(7): 59.94.171.30,19.58.56.127,103.136.127.189,197.28.141.49,163.173.187.111,232.184.2.72,122.111.195.123
13:42:36.629175 IP (tos 0x0, ttl 64, id 56759, offset 0, flags [DF], proto TCP (6), length 40, bad cksum 0 (->4ec8)!)
10.14.230.15.80 > 10.14.20.37.14893: Flags [R], cksum 0x8906 (correct), seq 1174378509, win 0, length 0
13:42:36.629559 45:78:70:69:72:65 > 00:01:01:56:02:00, ethertype Unknown (0x733a), length 70:
0x0000: 2030 0d0a 4c61 7374 2d4d 6f64 6966 6965 .0..Last-Modifie
0x0010: 643a 2054 7565 2c20 3136 2046 6562 2032 d:.Tue,.16.Feb.2
0x0020: 3031 3020 3133 3a34 323a 3238 2047 4d54 010.13:42:28.GMT
0x0030: 0d0a 4361 6368 652d ..Cache-
13:42:36.755225 IP (tos 0x10, ttl 255, id 62510, offset 0, flags [DF], proto VRRP (112), length 56, bad cksum 0 (->b6e6)!)
10.14.230.15 > 224.0.0.18: VRRPv2, Advertisement, vrid 3, prio 0, authtype none, intvl 1s, length 36, addrs(7): 59.94.171.30,19.58.56.128,190.62.241.192,249.176.191.224,251.12.14.163,46.97.68.49,58.244.110.141
13:42:37.613698 STP 802.1d, Config, Flags [none], bridge-id 8190.00:07:eb:d5:dc:80.8012, length 42
message-age 0.00s, max-age 20.00s, hello-time 2.00s, forwarding-delay 15.00s
root-id 8190.00:07:eb:d5:dc:80, root-pathcost 0
13:42:37.757227 IP (tos 0x10, ttl 255, id 15904, offset 0, flags [DF], proto VRRP (112), length 56, bad cksum 0 (->6cf5)!)
10.14.230.15 > 224.0.0.18: VRRPv2, Advertisement, vrid 3, prio 0, authtype none, intvl 1s, length 36, addrs(7): 59.94.171.30,19.58.56.129,168.210.59.71,71.96.81.104,200.233.234.76,187.75.1.11,187.93.85.88
13:42:37.987824 00:01:00:00:00:00 SNA > 00:01:00:08:00:00 Null Information, send seq 0, rcv seq 0, Flags [Poll], length 48
13:42:38.759239 IP (tos 0x10, ttl 255, id 44017, offset 0, flags [DF], proto VRRP (112), length 56, bad cksum 0 (->ff23)!)
10.14.230.15 > 224.0.0.18: VRRPv2, Advertisement, vrid 3, prio 0, authtype none, intvl 1s, length 36, addrs(7): 59.94.171.30,19.58.56.130,10.227.58.16,63.201.219.107,142.223.118.51,116.21.218.96,56.182.12.38
13:42:39.287277 IP truncated-ip - 8668 bytes missing! (tos 0x0, ttl 255, id 3883, offset 512, flags [none], proto VRRP (112), length 9216, bad cksum b93d (->d535)!)
209.43.6.165 > 224.0.0.18: carp

I'm a bit at a loss here as to what it is. Any ideas?

Thanks,

Robby</full-duplex>

overand

I notice VRRP and/or CARP stuff floating across the wire here - can you verify that's actually in place - and if so, is it on the pfSense box itself?

If that's in use, can you set up a testbed with that not-in-use, to verify you don't have - for example - another VRRP or CARP host 'stealing' the IP you're pinging from/to?

rbd

We have to look into the CARP issues. We recently added a second firewall to test some of that out, but we were experiencing this problem before we ever added the second firewall and any CARP addresses (we will take it back down to a single box to ferret out this issue).

We "solved" this problem by totally disabling packet filtering (under advanced settings). If we do that, the firewall works fine, no problems at all. Of course, it's no longer a firewall and this situation won't work, but it seems to exonerate the switch/network…what in the packet filtering code could be causing this?

Robby

cmb

Can you try a different NIC chipset? Would be good to know if it's driver-specific, or specific to something else in your setup.

overand

Some info and thoughts - I've used 1.2.3-RELEASE on the same hardware - a Proliant DL360 G3 - with VLAN tagging, and have had very good success. Unfortunately, I don't have a spare G3 to test out 2.0 on for comparison - yet.

If you set this box up as dumb/untagged, bge0-WAN,bge1-LAN, etc, (with filtering enabled) - does everything work as expected? That might give an idea where to peek for issues.

Also worth peeking - does the 'ifconfig' statement on the bge interface show the tags you'd expect? RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM

mclendening

Working with Robby on this situation and found that if you reboot and test without opening Webgui we did not have any throughput issues. Open the Webgui throughput will degrade severely. Reinstalled since his earlier post (2.0 Beta with vanilla config, no carp, wide-open rules, vlan trunking on interface bge0 - Broadcom BCM5703X).

michael clendening

rbd

An update: We're probably going to be switching to some DL360 G4s we have laying around, and using a 4 port Intel LAN card (we'll just use a single port on that probably and tag all the interfaces like before, but it's a different chipset, and a well supported one at that).

Thanks for the help, we'll keep this post updated.

Robby

mclendening

[Solved] Installed a new Intel PWLA8391GT PCI NIC ($33.00). Re-installed Feb 16 Beta 2.0 build and configured for VLAN's. Now getting 70 mbps sustained from Server -> Cisco 3750 (100 meg) -> Cisco 3550 (100 meg)-> pfSense (100 meg) OPT1 VLAN100 on 3550 -> Server on OPT2 VLAN200 on 3550

Did not need to add any tweaks to advanced settings and here is the Cisco config for those interested:

Cisco 3550 (Old ass switch)
interface FastEthernet0/1
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 1,100,200,300,400,1001-1005
switchport mode trunk
switchport nonegotiate
spanning-tree portfast trunk
spanning-tree bpdufilter enable

This switch config provides the cleanest and fast re-converge time.

Peace,
Michael Clendening