PfSync send errors



  • I recently upgraded a pair of Carp enabled pfSense instances from 2.0-RC3 to 2.0.1.  I upgraded the Backup first and then the Master.  That all seemed to go well and work as expected.  However, I'm now seeing pfsync send errors on the Master:
    [2.0.1-RELEASE][admin@host.name.removed]/root(25): netstat -s -p pfsync
    pfsync:
            1492961 packets received (IPv4)
            0 packets received (IPv6)
                    0 packets discarded for bad interface
                    0 packets discarded for bad ttl
                    0 packets shorter than header
                    0 packets discarded for bad version
                    0 packets discarded for bad HMAC
                    0 packets discarded for bad action
                    0 packets discarded for short packet
                    0 states discarded for bad values
                    0 stale states
                    101217 failed state lookup/inserts
            482055153 packets sent (IPv4)
            0 packets sent (IPv6)
                    0 send failed due to mbuf memory error
                    4499736 send error

    It looked like perhaps the Carp settings on the Master had lost the user name that Carp uses to sync the configuration.  I put the user name back in the config, the errors went a way briefly and then came back.

    I'm not sure how to troubleshoot this problem.  Any suggestions are appreciated.

    Thanks,
    Steve


  • Rebel Alliance Developer Netgate

    The username/password are for config sync, not state sync.

    On the state sync, on the primary, put in the IP of the sync interface on the secondary, and on the secondary, put in the IP of the sync interface on the primary. (and make sure state sync is enabled on both)



  • Thank you for the response.  I now have them configured for unicast sync via the interface IPs of the Carp interfaces.  Unfortunately, this did not solve the problem.  Interestingly, whether I'm using multicast or unicast sync, it does appear that both instances have a similar number of states - like the sync is working (or at least mostly).  Previous to 2.0.1 and this thread, I've always used the multicast sync without problem.  I'm only seeing the issue on the Master, the Backup does not report any send errors - though I have not tried to failover to see if the problem follows the Master (this is a production environment and I cannot experiment too much).  I did a packet capture when it was setup for multicast and I saw some multicast packets, that won't show what did not get sent however.  Honestly, other than my monitoring system noticing and reporting the send errors, things seem to be working.  Any other ideas?

    Thanks,
    Steve


  • Rebel Alliance Developer Netgate

    I'm not sure really. I checked a couple CARP clusters I had handy and the most errors I saw on that line were 1.

    If it's actually syncing the states it's probably fine. If it's getting send errors, it could be a problem with the nic/cable also



  • Okay, thanks for the info.

    The nics and cable (just a crossover cable between two nics in this case) were not touched, the machines were not even power cycled, just warm reboots.  The interface stats are not reporting errors or collisions.  Seems like the physical layer is working right.

    I'll try to find a good time to reboot the Master to see if that helps.  I'll also keep digging and report back here if I figure it out.

    Regards,
    Steve



  • I think I've made a little progress.  Not sure how relevant it is but I was reading this article on the pfsync:
    http://www.undeadly.org/cgi?action=article&sid=20090301211402

    Just for fun I decided to change the MTU on the Carp interfaces to 9000 from the default of 1500.  Using ifconfig I tried to change the MTU of the pfsync0 interface to 9000 as well - it only seems to actually change to 1490 (which is up from the default of 1460).  That change seems to have dropped the send errors by 40% to 50%.  We have a pretty busy network with 300K+ states at any given time, seems like it could be more efficient to use larger packets with our rate of change.

    Anyone know how I can change the MTU of the pfsync interface to something larger than 1490?  How with pfSense can I get that setting to remain between reboots and upgrades?

    Thanks,
    Steve



  • You can change the MTU on the interface's page (Interface > <name of="" pfsync="" interface="">).</name>



  • Unfortunately, pfsync0 is its own interface and it does not appear in webadmin.  Webadmin shows my 4 physical interfaces.  Here is what ifconfig shows:
    [2.0.1-RELEASE][admin@host.name.removed]/root(1): ifconfig
    igb0: flags=8943 <up,broadcast,running,promisc,simplex,multicast>metric 0 mtu 1500

    igb1: flags=8943 <up,broadcast,running,promisc,simplex,multicast>metric 0 mtu 1500
            options=bb <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum>ether 00:1b:21:85:04:15
            inet 10.11.1.2 netmask 0xffffff00 broadcast 10.11.1.255
            inet6 fe80::21b:21ff:fe85:415%igb1 prefixlen 64 scopeid 0x2
            nd6 options=3 <performnud,accept_rtadv>media: Ethernet autoselect (1000baseT <full-duplex>)
            status: active
    bce0: flags=8943 <up,broadcast,running,promisc,simplex,multicast>metric 0 mtu 1500
            options=c00bb <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso,linkstate>ether 78:2b:cb:08:a1:41
            inet 10.10.2.2 netmask 0xffffff00 broadcast 10.10.2.255
            inet6 fe80::7a2b:cbff:fe08:a141%bce0 prefixlen 64 scopeid 0x3
            inet 10.10.254.4 netmask 0xffffff00 broadcast 10.10.254.255
            nd6 options=3 <performnud,accept_rtadv>media: Ethernet autoselect (1000baseT <full-duplex>)
            status: active
    bce1: flags=8843 <up,broadcast,running,simplex,multicast>metric 0 mtu 9000
            options=c00bb <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso,linkstate>ether 78:2b:cb:08:a1:42
            inet6 fe80::7a2b:cbff:fe08:a142%bce1 prefixlen 64 scopeid 0x4
            inet 10.10.3.2 netmask 0xffffff00 broadcast 10.10.3.255
            nd6 options=3 <performnud,accept_rtadv>media: Ethernet autoselect (1000baseT <full-duplex>)
            status: active
    pflog0: flags=100 <promisc>metric 0 mtu 33664
    pfsync0: flags=41 <up,running>metric 0 mtu 1940
            pfsync: syncdev: bce1 syncpeer: 10.10.3.3 maxupd: 128 syncok: 1
    enc0: flags=0<> metric 0 mtu 1536
    lo0: flags=8049 <up,loopback,running,multicast>metric 0 mtu 16384
            options=3 <rxcsum,txcsum>inet 127.0.0.1 netmask 0xff000000
            inet6 ::1 prefixlen 128
            inet6 fe80::1%lo0 prefixlen 64 scopeid 0x8
            nd6 options=3<performnud,accept_rtadv></performnud,accept_rtadv></rxcsum,txcsum></up,loopback,running,multicast></up,running></promisc></full-duplex></performnud,accept_rtadv></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso,linkstate></up,broadcast,running,simplex,multicast></full-duplex></performnud,accept_rtadv></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso,linkstate></up,broadcast,running,promisc,simplex,multicast></full-duplex></performnud,accept_rtadv></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum></up,broadcast,running,promisc,simplex,multicast></up,broadcast,running,promisc,simplex,multicast>



  • oh, thought you were referring to the Ethernet interface where you're sending the pfsync traffic. The pfsync interface itself doesn't need to be touched.



  • Well, I'm getting unexplained errors.  Increasing the MTU on the CARP and pfsync0 interfaces helped.  If I could go a little further with the MTU on the pfsync0 interface I think it might solve my problem.  I cannot seem to push the pfsync interface beyond 1940.



  • I got a chance to power cycle the Master today.  That did not help.  Since this problem started occurring after upgrading to 2.0.1, I'm tempted to open a bug report.  The issue seem to relate to the number of states we are running.  We had been setup (by default I think) for 388K states.  As we were running as much as 350K states I changed the systems to support 800K states - that seems to have made the problem a little worse.  I cannot see a way to configure my way out of this issue, I believe the hardware and physical layer are working properly (can't find any problems there).  Any other thoughts from the community are appreciated.


Log in to reply