Pfsense clears up states. help needed
-
Hello. I'm facing a very strange issue, any help is kindly appreciated.
There are two pfsense routers (version 2.3.2-RELEASE-p1, but I've faced this issue 1st time on 2.2.5/2.2.6) in HA mode. Sometimes one of the routers starts to drop traffic by resetting firewall states. Most times it happened on MASTER node, while some days ago it's happened on the BACKUP node as well.
It looks like this: I SSH into the BACKUP node and the connection getting stalled after a few (up to 10-15) seconds. tcpdump on the BACKUP node console shows incoming tcp packets to port 22, but there were no replies. I checked the states table (pfctl -ss) and found that it didn't have my ssh connection entry anymore.
After some researching I've found that the issue has gone after I switched the CARP interface off (cleared the "Enable" checkbox in the web interface). Restoring the
interface back re-started the issue. Temporal disabling CARP (on the BACKUP node) instead of switching the interface off doesn't help.
It also helps if I disable firewalling on the BACKUP node by "pfctl -d" command. After this the states are still getting cleared on the MASTER node but remain on the BACKUP.At this moment I'm trying to discover what part of the system clears up the states on the MASTER node. I tried to stop service daemons, checked Schedule States (it's ticked) and Killing States (it's cleared) checkboxes, ran "pfctl -xl" (nothing appears in logs during the states clearing), reconfigured HA from scratch using LAGG instead of plain Ethernet for SYNC interface - nothing helps. From time to time the issue disappears, but then comes back again.
I thought to try dtrace but it appears it's unavailable:
dtrace: failed to initialize dtrace: DTrace device not available on system
Here is the list of running processes on the MASTER node:
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 11 190.0 0.0 0 16 - RL 12Oct16 35124:28.52 [idle]
root 12 9.2 0.0 0 176 - WL 12Oct16 905:49.41 [intr]
root 34323 1.0 0.5 12636 5388 - Ss Tue08PM 3:29.85 /usr/local/sbin/openvpn –config /var/etc/openvpn/server2.conf
root 87968 0.4 3.1 85624 31468 - S 4:47PM 0:00.22 php-fpm: pool nginx (php-fpm)
root 2886 0.2 0.5 12104 4764 - Ss Wed05PM 46:43.68 /usr/local/sbin/miniupnpd -f /var/etc/miniupnpd.conf -P /var/run/miniupnpd.pid
root 15 0.1 0.0 0 8 - DL 12Oct16 62:55.86 [rand_harvestq]
root 0 0.0 0.0 0 88 - DLs 12Oct16 0:03.24 [kernel]
root 1 0.0 0.1 9060 688 - ILs 12Oct16 0:00.08 /sbin/init –
root 2 0.0 0.0 0 8 - DL 12Oct16 0:00.00 [crypto]
root 3 0.0 0.0 0 8 - DL 12Oct16 0:00.00 [crypto returns]
root 4 0.0 0.0 0 16 - DL 12Oct16 1:22.12 [cam]
root 5 0.0 0.0 0 8 - DL 12Oct16 0:05.17 [fdc0]
root 6 0.0 0.0 0 8 - DL 12Oct16 12:37.83 [pf purge]
root 7 0.0 0.0 0 8 - DL 12Oct16 0:00.00 [sctp_iterator]
root 8 0.0 0.0 0 16 - DL 12Oct16 0:16.04 [pagedaemon]
root 9 0.0 0.0 0 8 - DL 12Oct16 0:00.00 [vmdaemon]
root 10 0.0 0.0 0 8 - DL 12Oct16 0:00.00 [audit]
root 13 0.0 0.0 0 16 - DL 12Oct16 0:00.00 [ng_queue]
root 14 0.0 0.0 0 24 - DL 12Oct16 0:00.14 [geom]
root 16 0.0 0.0 0 200 - DL 12Oct16 0:25.31 [usb]
root 17 0.0 0.0 0 8 - DL 12Oct16 0:54.01 [acpi_thermal]
root 18 0.0 0.0 0 8 - DL 12Oct16 0:00.84 [acpi_cooling0]
root 19 0.0 0.0 0 8 - DL 12Oct16 0:01.56 [idlepoll]
root 20 0.0 0.0 0 8 - DL 12Oct16 0:00.02 [pagezero]
root 21 0.0 0.0 0 8 - DL 12Oct16 0:07.42 [bufdaemon]
root 22 0.0 0.0 0 8 - DL 12Oct16 10:48.01 [syncer]
root 23 0.0 0.0 0 8 - DL 12Oct16 0:05.99 [vnlru]
root 58 0.0 0.0 0 8 - DL 12Oct16 0:14.21 [md0]
root 614 0.0 2.5 81528 25456 - Ss 12Oct16 1:56.85 php-fpm: master process (/usr/local/lib/php-fpm.conf) (php-fpm)
root 658 0.0 0.4 9404 4304 - Is 12Oct16 0:00.56 /sbin/devd -q
root 1001 0.0 0.2 10108 1892 - Ss Tue07PM 0:00.93 /usr/sbin/cron -s
root 1717 0.0 0.2 10172 1888 - Is Wed05PM 0:00.00 /usr/local/sbin/upsmon
uucp 1982 0.0 0.2 10172 1904 - S Wed05PM 0:18.83 /usr/local/sbin/upsmon
root 7146 0.0 0.7 17644 7136 - Ss 12:34PM 0:01.56 sshd: root@pts/0 (sshd)
root 11992 0.0 0.7 15032 6800 - Is 12Oct16 0:00.02 /usr/sbin/sshd
root 12058 0.0 0.2 14328 1836 - Is 12Oct16 0:00.02 /usr/local/sbin/sshlockout_pf 15
root 12702 0.0 0.2 10148 1896 - Ss 13Oct16 2:32.17 /usr/sbin/syslogd -s -c -c -l /var/dhcpd/var/run/log -P /var/run/syslog.pid -f /var/etc/syslog.conf
root 14566 0.0 0.6 12636 5648 - Ss Tue08PM 0:30.55 /usr/local/sbin/openvpn –config /var/etc/openvpn/server1.conf
root 21958 0.0 0.2 10236 1988 - Ss 12Oct16 1:05.44 /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid
root 24578 0.0 0.2 10424 2072 - Is 12Oct16 0:00.44 /usr/local/sbin/xinetd -syslog daemon -f /var/etc/xinetd.conf -pidfile /var/run/xinetd.pid
root 34879 0.0 0.2 10232 1784 - Is 13Oct16 0:00.02 /usr/local/sbin/sshlockout_pf 15
root 39239 0.0 0.5 23924 5380 - Is 12Oct16 0:00.00 nginx: master process /usr/local/sbin/nginx -c /var/etc/nginx-webConfigurator.conf (nginx)
root 39358 0.0 0.6 23924 6316 - S 12Oct16 2:52.30 nginx: worker process (nginx)
root 39634 0.0 0.6 23924 6292 - S 12Oct16 2:49.24 nginx: worker process (nginx)
root 40986 0.0 0.2 9948 1572 - Is 12Oct16 0:00.00 /usr/local/bin/minicron 240 /var/run/ping_hosts.pid /usr/local/bin/ping_hosts.sh
root 41233 0.0 0.2 9948 1584 - I 12Oct16 0:00.52 minicron: helper /usr/local/bin/ping_hosts.sh (minicron)
root 41512 0.0 0.2 9948 1572 - Is 12Oct16 0:00.00 /usr/local/bin/minicron 3600 /var/run/expire_accounts.pid /usr/local/sbin/fcgicli -f /etc/rc.expireaccounts
root 41775 0.0 0.2 9948 1584 - I 12Oct16 0:00.04 minicron: helper /usr/local/sbin/fcgicli -f /etc/rc.expireaccounts (minicron)
root 41902 0.0 0.2 9948 1572 - Is 12Oct16 0:00.00 /usr/local/bin/minicron 86400 /var/run/update_alias_url_data.pid /usr/local/sbin/fcgicli -f /etc/rc.update_alias_url_data
root 42380 0.0 0.2 9948 1584 - I 12Oct16 0:00.00 minicron: helper /usr/local/sbin/fcgicli -f /etc/rc.update_alias_url_data (minicron)
dhcpd 42782 0.0 1.2 20440 11944 - Ss 13Oct16 4:39.87 /usr/local/sbin/dhcpd -user dhcpd -group _dhcp -chroot /var/dhcpd -cf /etc/dhcpd.conf -pf /var/run/dhcpd.pid re0_vlan1
nobody 44187 0.0 0.4 11340 3948 - S Thu01PM 4:42.85 /usr/local/sbin/dnsmasq --all-servers -C /dev/null --rebind-localhost-ok --stop-dns-rebind --listen-address=192.168.210.65 --listen-address=192.168.210.1 --listen-address=172.26.1.1 --listen-address=172.27.1.1 --listen-address=127.0.0.1 --bind-interfaces --edns-packet-max=4096 --rebind-domain-ok=/mydom.com/ --dns-forward-max=5000 --cache-size=10000 --local-ttl=1
root 46326 0.0 1.0 16800 10208 - Ss Thu01PM 0:20.83 /usr/sbin/bsnmpd -c /var/etc/snmpd.conf -p /var/run/snmpd.pid
root 48965 0.0 0.2 10460 2136 - IN Thu01PM 3:59.43 /bin/sh /var/db/rrd/updaterrd.sh
root 52722 0.0 0.6 28292 6400 - S<s 12oct16 ="" ="" 20:10.13="" usr="" local="" bin="" ipcad="" -rds<br="">root 55485 0.0 0.3 11484 2948 - Is 12Oct16 0:00.37 /usr/local/libexec/ipsec/starter --daemon charon
root 55694 0.0 1.4 49008 13984 - Is 12Oct16 2:53.88 /usr/local/libexec/ipsec/charon --use-syslog
root 56771 0.0 0.2 10232 1784 - Is 12Oct16 0:00.02 /usr/local/sbin/sshlockout_pf 15
root 65258 0.0 0.2 5856 1536 - IN 4:47PM 0:00.00 sleep 60
root 85053 0.0 1.7 17108 17144 - Ss 14Oct16 1:09.52 /usr/local/sbin/ntpd -g -c /var/etc/ntpd.conf -p /var/run/ntpd.pid
root 91845 0.0 0.2 10388 2060 - INs Thu02PM 0:00.02 /usr/local/sbin/check_reload_status
root 92033 0.0 0.2 10388 1988 - IN Thu02PM 0:00.00 check_reload_status: Monitoring daemon of check_reload_status
root 95968 0.0 0.2 10632 1856 - Ss Fri05PM 0:59.31 /usr/local/bin/dpinger -S -r 0 -i TransparentProxy -B 172.26.1.242 -p /var/run/dpinger_TransparentProxy~172.26.1.242~172.26.1.50.pid -u /var/run/dpinger_TransparentProxy~172.26.1.242~172.26.1.50.sock -C /etc/rc.gateway_alarm -d 1 -s 500 -l 2000 -t 60000 -A 1000 -D 500 -L 20 172.26.1.50
root 96214 0.0 0.2 18824 1992 - Ss Fri05PM 0:44.34 /usr/local/bin/dpinger -S -r 0 -i Mat -B 10.11.225.171 -p /var/run/dpinger_Mat~10.11.225.171~10.11.225.169.pid -u /var/run/dpinger_Mat~10.11.225.171~10.11.225.169.sock -C /etc/rc.gateway_alarm -d 1 -s 1000 -l 2000 -t 5000 -A 1000 -D 500 -L 20 80.70.225.169
root 96600 0.0 0.2 14728 1928 - Ss Fri05PM 1:17.26 /usr/local/bin/dpinger -S -r 0 -i Inter -B 12.13.143.12 -p /var/run/dpinger_Inter~12.13.143.12~12.13.143.9.pid -u /var/run/dpinger_Inter~12.13.143.12~12.13.143.9.sock -C /etc/rc.gateway_alarm -d 1 -s 500 -l 2000 -t 60000 -A 1000 -D 500 -L 40 5.17.143.9
root 56668 0.0 0.2 10060 1676 v0 Is+ 12Oct16 0:00.00 /usr/libexec/getty Pc ttyv0
root 7449 0.0 0.2 10460 2192 0 Is 12:34PM 0:00.01 -sh (sh)
root 7479 0.0 0.2 10460 2088 0 I 12:34PM 0:00.00 /bin/sh /etc/rc.initial
root 8469 0.0 0.3 10820 2960 0 S 12:34PM 0:00.17 /bin/tcsh
root 99127 0.0 0.2 10204 1888 0 R+ 4:48PM 0:00.00 ps axuwww</s >Crontab:
1,31 0-5 * * * root /usr/bin/nice -n20 adjkerntz -a
1 3 1 * * root /usr/bin/nice -n20 /etc/rc.update_bogons.sh
*/60 * * * * root /usr/bin/nice -n20 /usr/local/sbin/expiretable -v -t 3600 sshlockout
1 1 * * * root /usr/bin/nice -n20 /etc/rc.dyndns.update
*/60 * * * * root /usr/bin/nice -n20 /usr/local/sbin/expiretable -v -t 3600 virusprot
30 12 * * * root /usr/bin/nice -n20 /etc/rc.update_urltables
*/60 * * * * root /usr/bin/nice -n20 /usr/local/sbin/expiretable -v -t 3600 webConfiguratorlockoutWhat else to look at? Please, help!
-
I think by default PFSense clears states when the upstream gateway goes down. If you master thinks the gateway as gone down, even if it's because its link went down, it make nuke all of the states, which may then propagate to the fail-over?
Entirely guessing, I've never used HA nor read on how to use it.
-
Harvy66, thanks for your reply. I didn't mention it, but my gateways are pretty stable, so it's definitely not the case. Also in the example I provided states was cleared for SSH connection which was made from local LAN to the BACKUP node only. No other states were affected.