Pfsense CARP switch from MASTER/BACKUP during XMLRPC Sync
-
Seeing some weird behavior with CARP Master and slave timeouts during xmlrpc, was running 2.5 with pfblockerng, things worked great for a long time, not sure what changed but some of the ip lists grew very large, such as firehol, and others
pfblocker cron runs every minute, but due to the limits it only really updates once an hour, during this update XMLRPC occurs and some of the 10-15 CARP ips (usually about 5-7) timeout, it lasts for about 1 second, and then goes back to backup mode we also see increased icmp ping times over 100ms during the rpcsync. No rhyme or reason to which of the carps timeout for a second appears to be random each time.
I did also notice during one of the timeouts pfblockerng log showed updating and the only thing it did was remove 6ips and add 2, and with row level locking in the newer pfsense I don’t think it’s the large tables causing the problem
what we did, to try and resolve, we upgrade to 2.6, lastest pfblockerng, still timeouts occur, actually much worse and more reproducible, weve then tried a new switch, both of which support vrrp, timeout still occurs, we do have vlans, and chelsio 10g nic, not that it should matter?
Pfblockerng sync members are always disabled. IE “enabled” is not checked for the member in the list.
I unchecked all of the xmlrpc sync items in the main pfsense, no carp timeouts occur.
I went thru and rechecked the top half, carp timeouts occur,
Reduced the xmlsync list to about top 10, carps still timeouts
Reduced the xmlsync list to about top 5, carps still timeouts
Reduced the xmlsync list to about top 1 (just users.), testing this now.
Is there some way to slow down xmlrpc sync? Is there another bug going on?
Id like to keep the routers in sync automatically, but I am starting to exhaust my ability to resolve so im hoping you guys can get me headed in the right direction. -
this is a config that has been upgraded over the years from 1.2.3 all the way to 2.6 im wondering if any of these tunables should be removed or adjusted? that might be impacting carp and xmlrpc sync?
System Tunables
NOTE: The options on this page are intended for use by advanced users only. Tunable Name Description Value
debug.pfftpproxy Disable the pf ftp proxy handler. 0
vfs.read_max Increase UFS read-ahead speeds to match current state of hard drives and NCQ. More information here: http://ivoras.sharanet.org/blog/tree/2010-11-19.ufs-read-ahead.html default (32)
net.inet.ip.portrange.first Set the ephemeral port range to be lower. default (1024)
net.inet.tcp.blackhole Drop packets to closed TCP ports without returning a RST default (2)
net.inet.udp.blackhole Do not send ICMP port unreachable messages for closed UDP ports default (1)
net.inet.ip.random_id Randomize the ID field in IP packets (default is 0: sequential IP IDs) default (1)
net.inet.tcp.drop_synfin Drop SYN-FIN packets (breaks RFC1379, but nobody uses it anyway) default (1)
net.inet.ip.redirect Enable sending IPv4 redirects default (1)
net.inet6.ip6.redirect Enable sending IPv6 redirects default (1)
net.inet6.ip6.use_tempaddr Enable privacy settings for IPv6 (RFC 4941) default (0)
net.inet6.ip6.prefer_tempaddr Prefer privacy addresses and use them over the normal addresses default (0)
net.inet.tcp.syncookies Generate SYN cookies for outbound SYN-ACK packets default (1)
net.inet.tcp.recvspace Maximum incoming/outgoing TCP datagram size (receive) default (65228)
net.inet.tcp.sendspace Maximum incoming/outgoing TCP datagram size (send) default (65228)
net.inet.tcp.delayed_ack Do not delay ACK to try and piggyback it onto a data packet default (0)
net.inet.udp.maxdgram Maximum outgoing UDP datagram size default (57344)
net.link.bridge.pfil_onlyip Handling of non-IP packets which are not passed to pfil (see if_bridge(4)) default (0)
net.link.bridge.pfil_member Set to 0 to disable filtering on the incoming and outgoing member interfaces. default (1)
net.link.bridge.pfil_bridge Set to 1 to enable filtering on the bridge interface default (0)
net.link.tap.user_open Allow unprivileged access to tap(4) device nodes default (1)
kern.randompid Randomize PID's (see src/sys/kern/kern_fork.c: sysctl_kern_randompid()) default (347)
net.inet.ip.intr_queue_maxlen Maximum size of the IP input queue default (1000)
hw.syscons.kbd_reboot Disable CTRL+ALT+Delete reboot from keyboard. default (0)
net.inet.tcp.inflight.enable Enable TCP Inflight mode default ()
net.inet.tcp.log_debug Enable TCP extended debugging default (0)
net.inet.icmp.icmplim Set ICMP Limits default (0)
net.inet.tcp.tso TCP Offload Engine default (1)
net.inet.udp.checksum UDP Checksums default (1)
kern.ipc.maxsockbuf Maximum socket buffer size default (4262144)
kern.ipc.maxsockbuf Maximum socket buffer size - set by FRR package 16777216
kern.timecounter.hardware default is TSC-low ntpd needs this in 2.5.1- softlowd needs hpet HPET
net.link.vlan.mtag_pcp Retain VLAN PCP information as packets are passed up the stack 1
net.inet.ip.process_options Enable IP options processing ([LS]SRR, RR, TS) 0 (0)
kern.random.harvest.mask Entropy harvesting mask 351
net.route.netisr_maxqlen maximum routing socket dispatch queue length 1024
net.inet.icmp.reply_from_interface ICMP reply from incoming interface for non-local packets 1
net.inet6.ip6.rfc6204w3 Accept the default router list from ICMPv6 RA messages even when packet forwarding is enabled 1
net.key.preferred_oldsa 0
net.inet.carp.senderr_demotion_factor Send error demotion factor adjustment 0 (0)
net.pfsync.carp_demotion_factor pfsync's CARP demotion factor adjustment 0 (0)
net.raw.recvspace Default raw socket receive space 65536
net.raw.sendspace Default raw socket send space 65536
net.inet.raw.recvspace Maximum space for incoming raw IP datagrams 131072
net.inet.raw.maxdgram Maximum outgoing raw IP datagram size 131072
kern.corefile Process corefile name format string /root/%N.core -
i was able to track down a bit of a solution
we had disabled hardware offloads , this is now turned back on which make xmlrpc sync much quicker and lower load and cpu.also
we have two wans, on each wan we had two openvpn servers listening for different purposes, 7-8 years ago we were told that its best to listen on localhost with each vpn server, then nat port forward each external port so that each wan can listen on the same server, it appears if we do this now, each time an xmlproc sync occurs it causes pfctl and the reload scripts to thrash and loop 3 or more times.
we this this occuring over and over with localhost
php-fpm[6973]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use VPNthe solution now is listening on a single carp ip, this means were not able to openvpn in the backup wan, but atleast vpn works on master and backup servers, just not the backup wan
all xmlrpc sync is re-enabled and no CARP timeouts so far.......