PfSync send errors

jimp

The username/password are for config sync, not state sync.

On the state sync, on the primary, put in the IP of the sync interface on the secondary, and on the secondary, put in the IP of the sync interface on the primary. (and make sure state sync is enabled on both)

stvboyle

Thank you for the response. I now have them configured for unicast sync via the interface IPs of the Carp interfaces. Unfortunately, this did not solve the problem. Interestingly, whether I'm using multicast or unicast sync, it does appear that both instances have a similar number of states - like the sync is working (or at least mostly). Previous to 2.0.1 and this thread, I've always used the multicast sync without problem. I'm only seeing the issue on the Master, the Backup does not report any send errors - though I have not tried to failover to see if the problem follows the Master (this is a production environment and I cannot experiment too much). I did a packet capture when it was setup for multicast and I saw some multicast packets, that won't show what did not get sent however. Honestly, other than my monitoring system noticing and reporting the send errors, things seem to be working. Any other ideas?

Thanks,
Steve

jimp

I'm not sure really. I checked a couple CARP clusters I had handy and the most errors I saw on that line were 1.

If it's actually syncing the states it's probably fine. If it's getting send errors, it could be a problem with the nic/cable also

stvboyle

Okay, thanks for the info.

The nics and cable (just a crossover cable between two nics in this case) were not touched, the machines were not even power cycled, just warm reboots. The interface stats are not reporting errors or collisions. Seems like the physical layer is working right.

I'll try to find a good time to reboot the Master to see if that helps. I'll also keep digging and report back here if I figure it out.

Regards,
Steve

stvboyle

I think I've made a little progress. Not sure how relevant it is but I was reading this article on the pfsync:
http://www.undeadly.org/cgi?action=article&sid=20090301211402

Just for fun I decided to change the MTU on the Carp interfaces to 9000 from the default of 1500. Using ifconfig I tried to change the MTU of the pfsync0 interface to 9000 as well - it only seems to actually change to 1490 (which is up from the default of 1460). That change seems to have dropped the send errors by 40% to 50%. We have a pretty busy network with 300K+ states at any given time, seems like it could be more efficient to use larger packets with our rate of change.

Anyone know how I can change the MTU of the pfsync interface to something larger than 1490? How with pfSense can I get that setting to remain between reboots and upgrades?

Thanks,
Steve

cmb

You can change the MTU on the interface's page (Interface > <name of="" pfsync="" interface="">).</name>

stvboyle

Unfortunately, pfsync0 is its own interface and it does not appear in webadmin. Webadmin shows my 4 physical interfaces. Here is what ifconfig shows:
[2.0.1-RELEASE][admin@host.name.removed]/root(1): ifconfig
igb0: flags=8943 <up,broadcast,running,promisc,simplex,multicast>metric 0 mtu 1500

igb1: flags=8943 <up,broadcast,running,promisc,simplex,multicast>metric 0 mtu 1500
options=bb <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum>ether 00:1b:21:85:04:15
inet 10.11.1.2 netmask 0xffffff00 broadcast 10.11.1.255
inet6 fe80::21b:21ff:fe85:415%igb1 prefixlen 64 scopeid 0x2
nd6 options=3 <performnud,accept_rtadv>media: Ethernet autoselect (1000baseT <full-duplex>)
status: active
bce0: flags=8943 <up,broadcast,running,promisc,simplex,multicast>metric 0 mtu 1500
options=c00bb <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso,linkstate>ether 78:2b:cb:08:a1:41
inet 10.10.2.2 netmask 0xffffff00 broadcast 10.10.2.255
inet6 fe80::7a2b:cbff:fe08:a141%bce0 prefixlen 64 scopeid 0x3
inet 10.10.254.4 netmask 0xffffff00 broadcast 10.10.254.255
nd6 options=3 <performnud,accept_rtadv>media: Ethernet autoselect (1000baseT <full-duplex>)
status: active
bce1: flags=8843 <up,broadcast,running,simplex,multicast>metric 0 mtu 9000
options=c00bb <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso,linkstate>ether 78:2b:cb:08:a1:42
inet6 fe80::7a2b:cbff:fe08:a142%bce1 prefixlen 64 scopeid 0x4
inet 10.10.3.2 netmask 0xffffff00 broadcast 10.10.3.255
nd6 options=3 <performnud,accept_rtadv>media: Ethernet autoselect (1000baseT <full-duplex>)
status: active
pflog0: flags=100 <promisc>metric 0 mtu 33664
pfsync0: flags=41 <up,running>metric 0 mtu 1940
pfsync: syncdev: bce1 syncpeer: 10.10.3.3 maxupd: 128 syncok: 1
enc0: flags=0<> metric 0 mtu 1536
lo0: flags=8049 <up,loopback,running,multicast>metric 0 mtu 16384
options=3 <rxcsum,txcsum>inet 127.0.0.1 netmask 0xff000000
inet6 ::1 prefixlen 128
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x8
nd6 options=3<performnud,accept_rtadv></performnud,accept_rtadv></rxcsum,txcsum></up,loopback,running,multicast></up,running></promisc></full-duplex></performnud,accept_rtadv></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso,linkstate></up,broadcast,running,simplex,multicast></full-duplex></performnud,accept_rtadv></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso,linkstate></up,broadcast,running,promisc,simplex,multicast></full-duplex></performnud,accept_rtadv></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum></up,broadcast,running,promisc,simplex,multicast></up,broadcast,running,promisc,simplex,multicast>

cmb

oh, thought you were referring to the Ethernet interface where you're sending the pfsync traffic. The pfsync interface itself doesn't need to be touched.

stvboyle

Well, I'm getting unexplained errors. Increasing the MTU on the CARP and pfsync0 interfaces helped. If I could go a little further with the MTU on the pfsync0 interface I think it might solve my problem. I cannot seem to push the pfsync interface beyond 1940.

stvboyle

I got a chance to power cycle the Master today. That did not help. Since this problem started occurring after upgrading to 2.0.1, I'm tempted to open a bug report. The issue seem to relate to the number of states we are running. We had been setup (by default I think) for 388K states. As we were running as much as 350K states I changed the systems to support 800K states - that seems to have made the problem a little worse. I cannot see a way to configure my way out of this issue, I believe the hardware and physical layer are working properly (can't find any problems there). Any other thoughts from the community are appreciated.