Odd network behavior after upgrading from 1.2 to 1.2.3



  • Hello All

    A few days ago we upgraded our pfSense routers from 1.2.RELEASE to 1.2.3-RELEASE and since then we've experienced some very odd network problems.

    The setup:
    Two pfSense routers (Dell PE2900 with Intel PRO cards) with WAN, LAN and 4 OPT interfaces. One dedicated interface with a crossover cable for CARP-sync. The hardware has performed flawlessly with 1.2-RELEASE.

    pfSense is our only router/firewall/loadbalancer and handles all connectivity to the internet (Fiber based). Switches are Dell or D-Link GigaBit switches and have never been known to cause problems.

    Some of our servers are stand-alone servers with public IP-addresses while others have private IP-addresses and are accessed through the round-robin load balancer in pfSense or through NAT and portforwarding, depending on the task. Servers are a mix of physical and virtual (KVM and XEN) servers, primarily RHEL4/5 and Fedora with an odd Windows 2000 server thrown in for good measure.

    The symptoms:
    Connections are dropped out of the blue. SSH and PPTP connections dies completely, HTTP/S and FTP connections needs to reconnect, IMAP/S and POP3/S connections goes stale and hang around on the mailserver until they finally timeout, which causes a lot of problems with "Mailbox is locked by POP server".

    General settings (unchanged from 1.2-RELEASE):
    Load Balancing: "Use sticky connections" is active
    Static route filtering: "Bypass firewall rules for traffic on the same interface" is active
    FTP RFC 959 data port violation workaround: Active
    Clear DF bit instead of dropping: Active
    Firewall Optimization Options: "Conservative"
    Firewall Maximum States: 2000000 (usually hovers around 100000 states)
    Disable Hardware Checksum Offloading: Active (just in case)

    What has been tried already:
    Clear DF bit instead of dropping: on/off - no difference.
    Firewall Optimization Options: Tried all four settings - no difference.
    Firewall Maximum States: Reduced to 200000 - no difference.
    Disable Firewall Scrub: on/off - no difference (hey, you never know…)
    Disable Hardware Checksum Offloading: on/off - no difference.
    Added logging of all block actions in the firewall to make sure that we were not blocking packets without knowing about it - nothing suspicious found.
    Logged everything blocked by the default rule - blocking lots of legitimate packets with status PSH or FIN, but at least FIN is known to be normal.

    I would really like to be able to keep the latest version of pfSense on our routers (hardware utilization is vastly improved with much better control over interrupts) but right now I simply cannot see my way out of this. Any last minute ideas before I revert to 1.2-RELEASE?

    EDIT:

    • and can anyone confirm whether a backup of the configuration from a 1.2.3-RELEASE will import without problems to a 1.2-RELEASE?

    Best regards,
    Anders C. Madsen



  • Its a comprehensive problem report, but have you been able to look at a system while its in the problem state to see if you can figure out what is going on "inside". For example, is receive error count on any of the NICs climbing? Is there an interrupt storm? Have any of the links gone offline? Have you done any analysis or tracing of mail connection that is misbehaving?



  • You're right, those things should have been included in my first post:

    No network errors as far as pfSense can tell (49 errors on LAN, 11 on OPT1, none on the other NICs over the last 5 days), and running tcpdump confirms this - as far as I can tell, everything looks OK.

    However, I'm seeing a bunch of these lines in the filter logs:
    067843 rule 398/0(match): block in on em2: 86.183.106.194.64620 > 87.61.125.155.6847:  tcp 40 [bad hdr length 4 - too short, < 20]
    360661 rule 552/0(match): block out on em2: 195.41.114.72.47760 > 140.211.166.21.80:  tcp 28 [bad hdr length 4 - too short, < 20]
    108691 rule 551/0(match): block in on em2: 87.59.139.102.1649 > 195.41.114.39.110:  tcp 24 [bad hdr length 4 - too short, < 20]
    075182 rule 552/0(match): block out on em2: 195.41.114.72.47759 > 140.211.166.21.80:  tcp 28 [bad hdr length 4 - too short, < 20]
    135795 rule 552/0(match): block out on em2: 195.41.114.74.80 > 93.165.145.97.57937:  tcp 1476 [bad hdr length 4 - too short, < 20]
    036371 rule 551/0(match): block in on em2: 86.111.68.109.61768 > 195.41.114.39.143:  tcp 69 [bad hdr length 0 - too short, < 20]

    They seem to be related to all different kinds of ports and addresses and only pops up in large bursts now and then. The WAN interface is by far the worst offender but I've seen at least two other interfaces logging these lines sporadically. The length varies but 4 or 16 are the most coomon. The errors persists for perhaps a minute and then it goes away again.

    No interrupt storm, neither on pfSense nor on the servers behind it  - if anything, the number of interrrupts in the pfSense router has decreased to about half the level prior to the upgrade.
    All links are online and stable.

    I've done no formal analysis but I see a lot of packages to port 143 being blocked by the default deny rule (rule 552) and I don't understand if this really is the way it is supposed to be. The output below is a direct capture from the filter logs as output in an SSH session on the router:

    003859 rule 552/0(match): block out on em2: 195.41.114.39.143 > 87.49.212.2.64031: [|tcp]
    008997 rule 552/0(match): block out on em2: 195.41.114.39.143 > 109.57.188.80.52722: [|tcp]
    000251 rule 552/0(match): block out on em2: 195.41.114.39.143 > 87.49.212.2.63837: [|tcp]
    002003 rule 552/0(match): block out on em2: 195.41.114.39.143 > 95.166.15.228.29538: [|tcp]
    021980 rule 552/0(match): block out on em2: 195.41.114.39.143 > 87.49.212.2.63889: [|tcp]
    021990 rule 552/0(match): block out on em2: 195.41.114.39.143 > 87.49.212.2.63669: [|tcp]
    000752 rule 552/0(match): block out on em2: 195.41.114.39.25 > 209.85.161.173.48539: [|tcp]
    022232 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49916: [|tcp]
    019881 rule 552/0(match): block out on em2: 195.41.114.39.143 > 95.166.15.228.29684: [|tcp]
    004979 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49662: [|tcp]
    001253 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49678: [|tcp]
    000994 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.50019: [|tcp]
    002001 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.50935: [|tcp]
    003121 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49707: [|tcp]
    002880 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.50938: [|tcp]
    000869 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49948: [|tcp]
    000030 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49890: [|tcp]
    003842 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49770: [|tcp]
    001123 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49720: [|tcp]
    004879 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49946: [|tcp]
    003243 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49580: [|tcp]
    001003 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49677: [|tcp]
    002743 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49660: [|tcp]
    000124 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49945: [|tcp]
    003876 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49947: [|tcp]
    000040 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49915: [|tcp]
    013957 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49771: [|tcp]
    008370 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49582: [|tcp]
    002754 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49587: [|tcp]
    011230 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49674: [|tcp]
    000878 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49631: [|tcp]
    007750 rule 552/0(match): block out on em2: 195.41.114.39.143 > 87.49.212.2.63853: [|tcp]
    003279 rule 552/0(match): block out on em2: 195.41.114.39.143 > 95.166.15.228.29337: [|tcp]
    007957 rule 552/0(match): block out on em2: 195.41.114.39.143 > 89.184.152.147.49630: [|tcp]
    000031 rule 552/0(match): block out on em2: 195.41.114.39.110 > 83.95.101.171.42774: [|tcp]

    It's quite extreme and I suspect that this is the main reason for my problems with hanging IMAP-connections but I don't know if this the actual cause or only a symptom caused by something outside the router.

    Best regards,
    Anders C. Madsen



  • The header length errors: the reported length seems to generally differ from a correct length (>= 20) by a single bit. I'd explore to see if there is a memory error or power supply error causing an intermittent memory error. That the header length error occurs intermittently and in bursts is particularly nasty. Perhaps running memtest86 (or memtest86+) for an extended period might give some clues. Perhaps reseating the memory cards might help. Does your memory have ECC and is it enabled?

    The block out on em2: Is 195.41.114.39.143 your mail server? Are these packets also blocked because they are badly formed (e.g. incorrect header length)? If not, can you suggest why they are blocked. Since firewall rules allegedly are applied only on reception I'm left to suspect the problem is in the packet formatting rather than the packet matching a firewall rule.



  • A memory error is not very likely since a) it is indeed ECC RAM and b) the problem persists even if I do a failover to the secondary router. Granted, both physical servers may have a memory error but I don't really believe in it… :)

    I've had a windows running with the filter output for a longer period and I'm seeing header lengths of 0, 4, 8, 12 and 16. I've noticed that the problem starts with a header length of 0, then progresses to 4, 8, 12 and eventually 16 before the problem clears again.

    195.41.114.39.143 is our mailserver and I'm unable to see why they are blocked unless someone can tell me what is in the default rule in pfSense. This is the raw output from the firewall for some of the blocked packets:

    Jun 15 14:30:39 pf: 004489 rule 561/0(match): block out on em2: (tos 0x0, ttl 63, id 54638, offset 0, flags [none], proto TCP (6), length 93) 195.41.114.39.143 > 83.92.176.48.52980: P, cksum 0x11bd (correct), 1723590538:1723590591(53) ack 1602018465 win 1728
    Jun 15 14:30:39 pf: 000122 rule 561/0(match): block out on em2: (tos 0x0, ttl 63, id 3707, offset 0, flags [none], proto TCP (6), length 40) 195.41.114.39.143 > 83.92.176.48.52980: F, cksum 0x1619 (correct), 53:53(0) ack 1 win 1728
    Jun 15 14:30:39 pf: 001928 rule 561/0(match): block out on em2: (tos 0x0, ttl 63, id 17419, offset 0, flags [none], proto TCP (6), length 93) 195.41.114.39.143 > 83.92.176.48.52980: FP, cksum 0x11bc (correct), 0:53(53) ack 1 win 1728
    Jun 15 14:30:40 pf: 194432 rule 561/0(match): block out on em2: (tos 0x0, ttl 63, id 42404, offset 0, flags [none], proto TCP (6), length 93) 195.41.114.39.143 > 83.92.176.48.52980: FP, cksum 0x11bc (correct), 0:53(53) ack 1 win 1728

    Thanks for your help so far, BTW, it is very much appreciated.

    Best regards,
    Anders C. Madsen



  • @madsenandersc:

    A memory error is not very likely since a) it is indeed ECC RAM

    And ECC is enabled in the BIOS? (On a couple of my home systems installing ECC memory is not sufficient to enable ECC, it has to be specifically enabled in the BIOS. But those systems are not DELL systems so they are probably not good predictors for the behaviour of DELL systems.)

    Is em2 a common component of the reports? How is em2 different from your other interfaces (e.g. different chipset?, different bus type?, different bus?) Please provide the output from the shell command pciconf -l

    Thanks for your help so far, BTW, it is very much appreciated.

    Doesn't seem I've helped much yet, but thanks for the appreciation.



  • @wallabybob:

    And ECC is enabled in the BIOS? (On a couple of my home systems installing ECC memory is not sufficient to enable ECC, it has to be specifically enabled in the BIOS. But those systems are not DELL systems so they are probably not good predictors for the behaviour of DELL systems.)

    The server will not even boot with non-ECC RAM so that is a given. :)

    Is em2 a common component of the reports? How is em2 different from your other interfaces (e.g. different chipset?, different bus type?, different bus?) Please provide the output from the shell command pciconf -l

    Well, em2 is present in many of the reports but since it is WAN, them majority of the traffic is passing through it. Interfaces em0 through em6 are identical (Intel PRO/1000 cards).

    pciconf -l

    hostb0@pci0:0:0:0: class=0x060000 card=0x80868086 chip=0x25d88086 rev=0x92 hdr=0x00
    pcib1@pci0:0:2:0: class=0x060400 card=0x00000000 chip=0x25e28086 rev=0x92 hdr=0x01
    pcib6@pci0:0:3:0: class=0x060400 card=0x00000000 chip=0x25e38086 rev=0x92 hdr=0x01
    pcib7@pci0:0:4:0: class=0x060400 card=0x00000000 chip=0x25e48086 rev=0x92 hdr=0x01
    pcib9@pci0:0:5:0: class=0x060400 card=0x00000000 chip=0x25e58086 rev=0x92 hdr=0x01
    pcib10@pci0:0:6:0: class=0x060400 card=0x00000000 chip=0x25f98086 rev=0x92 hdr=0x01
    pcib11@pci0:0:7:0: class=0x060400 card=0x00000000 chip=0x25e78086 rev=0x92 hdr=0x01
    hostb1@pci0:0:16:0: class=0x060000 card=0x01b81028 chip=0x25f08086 rev=0x92 hdr=0x00
    hostb2@pci0:0:16:1: class=0x060000 card=0x01b81028 chip=0x25f08086 rev=0x92 hdr=0x00
    hostb3@pci0:0:16:2: class=0x060000 card=0x01b81028 chip=0x25f08086 rev=0x92 hdr=0x00
    hostb4@pci0:0:17:0: class=0x060000 card=0x80868086 chip=0x25f18086 rev=0x92 hdr=0x00
    hostb5@pci0:0:19:0: class=0x060000 card=0x80868086 chip=0x25f38086 rev=0x92 hdr=0x00
    hostb6@pci0:0:21:0: class=0x060000 card=0x80868086 chip=0x25f58086 rev=0x92 hdr=0x00
    hostb7@pci0:0:22:0: class=0x060000 card=0x80868086 chip=0x25f68086 rev=0x92 hdr=0x00
    pcib12@pci0:0:28:0: class=0x060400 card=0x01b81028 chip=0x26908086 rev=0x09 hdr=0x01
    uhci0@pci0:0:29:0: class=0x0c0300 card=0x01b81028 chip=0x26888086 rev=0x09 hdr=0x00
    uhci1@pci0:0:29:1: class=0x0c0300 card=0x01b81028 chip=0x26898086 rev=0x09 hdr=0x00
    uhci2@pci0:0:29:2: class=0x0c0300 card=0x01b81028 chip=0x268a8086 rev=0x09 hdr=0x00
    uhci3@pci0:0:29:3: class=0x0c0300 card=0x01b81028 chip=0x268b8086 rev=0x09 hdr=0x00
    ehci0@pci0:0:29:7: class=0x0c0320 card=0x01b81028 chip=0x268c8086 rev=0x09 hdr=0x00
    pcib14@pci0:0:30:0: class=0x060401 card=0x00000000 chip=0x244e8086 rev=0xd9 hdr=0x01
    isab0@pci0:0:31:0: class=0x060100 card=0x00000000 chip=0x26708086 rev=0x09 hdr=0x00
    atapci0@pci0:0:31:1: class=0x01018a card=0x01b81028 chip=0x269e8086 rev=0x09 hdr=0x00
    pcib2@pci0:4:0:0: class=0x060400 card=0x00000000 chip=0x35008086 rev=0x01 hdr=0x01
    pcib5@pci0:4:0:3: class=0x060400 card=0x00000000 chip=0x350c8086 rev=0x01 hdr=0x01
    pcib3@pci0:5:0:0: class=0x060400 card=0x00000000 chip=0x35108086 rev=0x01 hdr=0x01
    pcib4@pci0:5:1:0: class=0x060400 card=0x00000000 chip=0x35148086 rev=0x01 hdr=0x01
    em0@pci0:7:0:0: class=0x020000 card=0x135e8086 chip=0x105e8086 rev=0x06 hdr=0x00
    em1@pci0:7:0:1: class=0x020000 card=0x135e8086 chip=0x105e8086 rev=0x06 hdr=0x00
    em2@pci0:8:1:0: class=0x020000 card=0x13768086 chip=0x107c8086 rev=0x05 hdr=0x00
    em3@pci0:8:2:0: class=0x020000 card=0x13768086 chip=0x107c8086 rev=0x05 hdr=0x00
    em4@pci0:9:0:0: class=0x020000 card=0x135e8086 chip=0x105e8086 rev=0x06 hdr=0x00
    em5@pci0:9:0:1: class=0x020000 card=0x135e8086 chip=0x105e8086 rev=0x06 hdr=0x00
    pcib8@pci0:10:0:0: class=0x060400 card=0x00000000 chip=0x032c8086 rev=0x09 hdr=0x01
    mpt0@pci0:11:8:0: class=0x010000 card=0x1f091028 chip=0x00541000 rev=0x01 hdr=0x00
    pcib13@pci0:2:0:0: class=0x060400 card=0x00000000 chip=0x01031166 rev=0xc3 hdr=0x01
    bce0@pci0:3:0:0: class=0x020000 card=0x01b81028 chip=0x164c14e4 rev=0x12 hdr=0x00
    vgapci0@pci0:14:13:0: class=0x030000 card=0x01b81028 chip=0x515e1002 rev=0x02 hdr=0x00

    Doesn't seem I've helped much yet, but thanks for the appreciation.

    Sometimes just asking the right questions is a big part of finding the solution. :)

    At least I have a feeling of going forward right now - that is a nice change from the last five days.

    Best regards,
    Anders C. Madsen



  • @madsenandersc:

    Interfaces em0 through em6 are identical (Intel PRO/1000 cards).

    There is quite a number of different chipsets used in PRO/1000 cards with different bus interfaces and different physical interfaces. Not all "em" NICs are the same.

    Note that em2 and em3 have different card numbers and chip numbers from em0, em1, em4 and em5. Can you move WAN to another interface and does that make a difference?



  • If my decoding is correct em0, em1, em4 and em5 have the 82571 chip while em2 and em3 have the 82541 chip. From data on the Intel web site it appears the 82541 was first introduced in Q3 2003 and the 82571 in Q3 2005. The 82541 has one port per chip, the 82571 two ports per chip. I think there are significant enough differences to warrant trying another interface to see if you get a different result.



  • @wallabybob:

    If my decoding is correct em0, em1, em4 and em5 have the 82571 chip while em2 and em3 have the 82541 chip. From data on the Intel web site it appears the 82541 was first introduced in Q3 2003 and the 82571 in Q3 2005. The 82541 has one port per chip, the 82571 two ports per chip. I think there are significant enough differences to warrant trying another interface to see if you get a different result.

    Your decoding is probably correct - at least as far as the number of ports goes, so the rest is also very likely true.

    I tried the following:

    • Moved WAN to em0 - no difference.
    • Moved WAN back to em2 to keep the number of variables to a minimum.
    • Did a completely fresh install of the firmware, using version 1.2.2-RELEASE in case the problem was in the BSD 7.2 drivers or software.
    • Restored configuration from backup and fired it up. Same problem, although it took about two hours before it surfaced - after that things were just as weird as they used to be with 1.2.3-RELEASE.
    • Did a completely fresh install of the firmware using version 1.2-RELEASE and restored the configuration from backup.
    • Fired it up and watched all the red lights disappear one by one. Mail and webservers are back to normal, SSH and VPN is stable, packets are no longer being dropped by the default rule.

    I have absolutely no clue to why it is so, but for some reason our Dell PowerEdge 2900 with two single port Intel PRO/1000 cards and two dual port Intel PRO/1000 cards are incompatible with pfSense above version 1.2-RELEASE. It may be that we have configured something in a non-standard way or it may be a problem with the firmware in the NIC's or the motherboard BIOS or something else, but there is no doubt that it is the case. We will consider what to do from here; although we've been very happy with pfSense, it is probably not a good idea to run software that we know we cannot upgrade in the future, so alternatives will have to be considered.

    Wallabybob, thanks a million for your help - regardless of the outcome I'm very grateful for your time.

    Best regards,
    Anders C. Madsen



  • @madsenandersc:

    it is probably not a good idea to run software that we know we cannot upgrade in the future

    That is probably a premature judgement if you haven't yet tried pfSense 2.0 BETA

    Wallbybob, thanks a million for your help - regardless of the outcome I'm very grateful for your time.

    Thanks. Its a puzzling problem. Anecdotal evidence suggests quite a number of people are using Intel PRO/1000 NICs on pfSense without seeing this problem.



  • @wallabybob:

    @madsenandersc:

    it is probably not a good idea to run software that we know we cannot upgrade in the future

    That is probably a premature judgement if you haven't yet tried pfSense 2.0 BETA

    You're right - but to be honest my fear is that we're looking at some kind of incompatibility with FreeBSD 7.x which I assume is the foundation for pfSense 2.0 (haven't checked yet). Right now I guess I'm just so relieved to have our systems stable again that I don't want to touch a single thing on the routers. Ever. :)

    Wallbybob, thanks a million for your help - regardless of the outcome I'm very grateful for your time.

    Thanks. Its a puzzling problem. Anecdotal evidence suggests quite a number of people are using Intel PRO/1000 NICs on pfSense without seeing this problem.

    I know, and that was actually why we chose those cards in the first place: In general, Intel PRO/1000 is perceived to be about as thoroughly tested as it comes. Frankly I have a hard time believing that they are the cause of all the problems but on the other hand I can't see where else to look. SMP? Dell BIOS/MB? A ton of those out there as well, most likely humming along nicely too.

    Best regards,
    Anders C. Madsen


  • Rebel Alliance Developer Netgate

    @madsenandersc:

    You're right - but to be honest my fear is that we're looking at some kind of incompatibility with FreeBSD 7.x which I assume is the foundation for pfSense 2.0 (haven't checked yet). Right now I guess I'm just so relieved to have our systems stable again that I don't want to touch a single thing on the routers. Ever. :)

    2.0 is based on what will be FreeBSD 8.1. A major difference from 7.x in many regards.



  • @jimp:

    2.0 is based on what will be FreeBSD 8.1. A major difference from 7.x in many regards.

    Ah - that was good news indeed. OK, we'll give it a whirl once it's been released and see how it goes.

    Best regards,
    Anders C. Madsen


Locked