pfSense periodically drops or misroutes packets
-
In an attempt to rule out issues with the pf state code, I created a set of stateless rules for SSH traffic in and out of the firewall via the LAN and OpenVPN interfaces. I verified that establishing an ssh connection did not create an entry in the state table. However, at some point the firewall still stops forwarding packets out the LAN interface.
Packets on VPN interface:
16:47:44.612931 IP spock.cora.nwra.com.ssh > vpn6.cora.nwra.com.34640: Flags [.], ack 33826, win 581, options [nop,nop,TS val 530726102 ecr 100216044], length 0 16:47:49.619397 IP spock.cora.nwra.com.ssh > vpn6.cora.nwra.com.34640: Flags [P.], seq 38630:38682, ack 33826, win 581, options [nop,nop,TS val 530731109 ecr 100216044], length 52 16:47:49.732568 IP vpn6.cora.nwra.com.34640 > spock.cora.nwra.com.ssh: Flags [P.], seq 33826:33878, ack 38682, win 982, options [nop,nop,TS val 100221164 ecr 530731109], length 52 16:47:49.935345 IP vpn6.cora.nwra.com.34640 > spock.cora.nwra.com.ssh: Flags [P.], seq 33826:33878, ack 38682, win 982, options [nop,nop,TS val 100221367 ecr 530731109], length 52 16:47:49.988355 IP spock.cora.nwra.com.ssh > vpn6.cora.nwra.com.34640: Flags [P.], seq 38630:38682, ack 33826, win 581, options [nop,nop,TS val 530731478 ecr 100216044], length 52 16:47:49.989170 IP vpn6.cora.nwra.com.34640 > spock.cora.nwra.com.ssh: Flags [.], ack 38682, win 982, options [nop,nop,TS val 100221421 ecr 530731478,nop,nop,sack 1 {38630:38682}], length 0 16:47:50.341554 IP vpn6.cora.nwra.com.34640 > spock.cora.nwra.com.ssh: Flags [P.], seq 33826:33878, ack 38682, win 982, options [nop,nop,TS val 100221773 ecr 530731478], length 52
Packets on LAN interface:
16:47:44.612683 IP vpn6.cora.nwra.com.34640 > spock.cora.nwra.com.ssh: Flags [P.], seq 33774:33826, ack 38630, win 982, options [nop,nop,TS val 100216044 ecr 530726091], length 52 16:47:44.612902 IP spock.cora.nwra.com.ssh > vpn6.cora.nwra.com.34640: Flags [.], ack 33826, win 581, options [nop,nop,TS val 530726102 ecr 100216044], length 0 16:47:49.619367 IP spock.cora.nwra.com.ssh > vpn6.cora.nwra.com.34640: Flags [P.], seq 38630:38682, ack 33826, win 581, options [nop,nop,TS val 530731109 ecr 100216044], length 52 16:47:49.988324 IP spock.cora.nwra.com.ssh > vpn6.cora.nwra.com.34640: Flags [P.], seq 38630:38682, ack 33826, win 581, options [nop,nop,TS val 530731478 ecr 100216044], length 52 16:47:50.490315 IP spock.cora.nwra.com.ssh > vpn6.cora.nwra.com.34640: Flags [P.], seq 38630:38682, ack 33826, win 581, options [nop,nop,TS val 530731980 ecr 100216044], length 52
The TCP sequence packet 33826:33878 from vpn6 never makes it out the LAN interface.
-
I've also tried disabling PF scrubbing to no effect.
-
"pfctl -x loud" did not produce any useful output either during the time when this happens.
-
Perhaps spoke too soon about the loud setting - it seems that got reset at some point. I'm now in a state where replies to an outbound TCP connection (LAN -> WAN) to port 6697 are consistently not being passed back through the LAN interface. See the following kernel message, but they don't mean anything to me:
Dec 12 15:40:35 inferno.cora.nwra.com kernel: pf: BAD state: TCP out wire: 130.185.232.126:6697 208.187.183.84:9693 stack: 130.185.232.126 Dec 12 15:40:35 inferno.cora.nwra.com kernel: pf: State failure on: 1 | 5 Dec 12 15:40:43 inferno.cora.nwra.com kernel: pf: BAD state: TCP out wire: 130.185.232.126:6697 208.187.183.84:9693 stack: 130.185.232.126:6697 10.10.41.27:39994 [lo=452368117 high=452397173 win=29200 modulator=0 wscale=7] [lo=2051634067 high=2051663379 win=227 modulator=0 wscale=7] 4:2 SA seq=2067272792 (2067272792) ack=452368117 len=0 ackskew=0 pkts=7:1 dir=in,rev Dec 12 15:40:43 inferno.cora.nwra.com kernel: pf: State failure on: 1 | 5 Dec 12 15:40:52 inferno.cora.nwra.com kernel: pf: OK ICMP 3:1 208.187.183.84 -> 52.21.76.50 state: TCP out wire: 52.21.76.50:443 208.187.183.84:26369 stack: 52.21.76.50:443 10.10.41.27 Dec 12 15:40:53 inferno.cora.nwra.com kernel: pf: BAD state: TCP out wire: 52.21.76.50:443 208.187.183.84:26369 stack: 52.21.76.50:443 10.10.41.27:52038 [lo=2657418702 high=2657445582 win=29200 modulator=0 wscale=7] [lo=80869102 high=80898414 win=105 modulator=0 Dec 12 15:40:54 inferno.cora.nwra.com kernel: pf: State failure on: 2 | 6 Dec 12 15:40:59 inferno.cora.nwra.com kernel: pf: BAD state: TCP out wire: 130.185.232.126:6697 208.187.183.84:9693 stack: 130.185.232.126:6697 10.10.41.27:39994 [lo=452368117 high=452397173 win=29200 modulator=0 wscale=7] [lo=2051634067 high=2051663379 win=227 modulator=0 wscale=7] 4:2 SA seq=2067272792 (2067272792) ack=452368117 len=0 ackskew=0 pkts=7:1 dir=in,rev Dec 12 15:40:59 inferno.cora.nwra.com kernel: pf: State failure on: 1 | 5 Dec 12 15:41:02 inferno.cora.nwra.com kernel: pf_map_addr: selected address 10.10.41.2 Dec 12 15:41:12 inferno.cora.nwra.com kernel: pf: dropping packet with ip options Dec 12 15:41:15 inferno.cora.nwra.com kernel: pf: OK ICMP 3:1 208.187.183.84 -> 130.185.232.126 state: TCP out wire: 130.185.232.126:6697 208.187.183.84:4543 stack: 130.185.232.126:6697 10.10.41.27:39994 [lo=452368116 high=452397173 win=229 modulator=0 wscale=7] [lo=4197729835 high=4197759147 win=227 modulator=0 wscale=7] 4:2 seq=4197729834
# pfctl -s state | grep 6697 #
which is odd - I would have expected to see a NAT state. I've made the LAN rule stateless in an attempt to debug this.
-
State table was a timing issue - with no packets making into the LAN interface the connection doesn't get established and so the state times out. But it does appear briefly:
igb1 tcp 208.187.183.84:30452 (10.10.41.27:56532) -> 38.229.70.22:6697 ESTABLISHED:SYN_SENT
-
And another weirdness - enabled logging on the LAN stateless rule for 6697:
pass out log quick on igb0 inet proto tcp from any port = ircs-u to <vulcan7> no state label "USER_RULE: Test stateless rule"
and it supposedly matches some packets (though not as many that come in):
Dec 12 16:23:03 inferno.cora.nwra.com filterlog: 62,,,1544481968,igb0,match,pass,out,4,0x68,,54,0,0,DF,6,tcp,60,38.229.70.22,10.10.41.27,6697,59600,0,SA,497406726,3635760457,28960,,mss;sackOK;TS;nop;wscale
But it does not appear on a tcpdump of igb0:
16:22:31.114219 IP vulcan7.cora.nwra.com.59600 > card.freenode.net.ircs-u: Flags [S], seq 3635760456, win 29200, options [mss 1460,sackOK,TS val 95690528 ecr 0,nop,wscale 7], length 0 16:23:03.210311 IP vulcan7.cora.nwra.com.59600 > card.freenode.net.ircs-u: Flags [S], seq 3635760456, win 29200, options [mss 1460,sackOK,TS val 95722624 ecr 0,nop,wscale 7], length 0 16:25:31.589412 IP vulcan7.cora.nwra.com.38846 > wolfe.freenode.net.ircs-u: Flags [S], seq 1451269209, win 29200, options [mss 1460,sackOK,TS val 95871003 ecr 0,nop,wscale 7], length 0
-
Whatever is going on, it appears to be happening to multiple connections simultaneously. I'm getting reports from multiple users of connection hangs at or about the same time.
Also, I cannot reproduce this issue on our 2.4.3 firewalls - which I'm not upgrading until I can sort this out.
: netstat -s -s tcp: 124959 packets sent 55620 data packets (8254657 bytes) 19 data packets (8604 bytes) retransmitted 65556 ack-only packets (0 delayed) 3766 control packets 115960 packets received 56267 acks (for 8253673 bytes) 3 duplicate acks 63381 packets (20226905 bytes) received in-sequence 3 completely duplicate packets (0 bytes) 1 window update packet 33 packets received after close 1761 connection requests 300 connection accepts 2 ignored RSTs in the windows 2060 connections established (including accepts) 1927 times used RTT from hostcache 1927 times used RTT variance from hostcache 2062 connections closed (including 225 drops) 1036 connections updated cached RTT on close 1036 connections updated cached RTT variance on close 2 connections updated cached ssthresh on close 56202 segments updated rtt (of 55024 attempts) 35 retransmit timeouts 1 connection dropped by rexmit timeout 18 keepalive timeouts 16 keepalive probes sent 2 connections dropped by keepalive 43475 correct ACK header predictions 57803 correct data packet header predictions 391 syncache entries added 34 retransmitted 1 dupsyn 300 completed 80 reset 10 stale 1 unreach 391 cookies sent 29 hostcache entries added 13 SACK options (SACK blocks) received 4 packets with ECN ECT(0) bit set 5 successful ECN handshakes TCP connection count by state: 2 connections in CLOSED state 9 connections in LISTEN state 9 connections in ESTABLISHED state 1 connection in TIME_WAIT state udp: 16122957 datagrams received 14469 with no checksum 8915 dropped due to no socket 147302 broadcast/multicast datagrams undelivered 592 dropped due to full socket buffers 15966148 delivered 17339474 datagrams output sctp: Packet drop statistics: Timeouts: ip: 275198040 total packets received 16589745 packets for this host 125160728 packets forwarded (46509457 packets fast forwarded) 23425 packets not forwardable 17837972 packets sent from this host 5623 output packets dropped due to no bufs, etc. 13 output packets discarded due to no route 1134 output datagrams fragmented 2845 fragments created icmp: 22945 calls to icmp_error Output histogram: echo reply: 175789 destination unreachable: 22649 time exceeded: 296 Input histogram: echo reply: 174637 destination unreachable: 397 echo: 175789 175789 message responses generated ICMP address mask responses are disabled igmp: ipsec: ah: esp: ipcomp: pim: carp: pfsync: arp: 7591 ARP requests sent 24083 ARP replies sent 407985 ARP requests received 4219 ARP replies received 412204 ARP packets received 28325 total packets dropped due to no ARP entry 1220 ARP entrys timed out ip6: 12717 total packets received 29 packets sent from this host 15 output packets discarded due to no route Input histogram: hop by hop: 1344 UDP: 10618 ICMP6: 755 Mbuf statistics: 12 one mbuf 12705 one ext mbuf 0 two or more ext mbuf 2 failures of source address selection source addresses on a non-outgoing I/F 2 addresses scope=0xf Source addresses selection rule applied: 2 same address icmp6: Output histogram: neighbor solicitation: 7 MLDv2 listener report: 20 Histogram of error messages to be generated: ipsec6: rip6:
Status: Enabled for 1 days 00:32:48 Debug: Urgent Interface Stats for igb0 IPv4 IPv6 Bytes In 14744688471 1294985 Bytes Out 111339870599 0 Packets In Passed 59113235 0 Blocked 5687 7975 Packets Out Passed 94128979 0 Blocked 54 0 State Table Total Rate current entries 7437 searches 372470510 4215.0/s inserts 7112154 80.5/s removals 7104713 80.4/s Counters match 7440531 84.2/s bad-offset 0 0.0/s fragment 12 0.0/s short 0 0.0/s normalize 1 0.0/s memory 0 0.0/s bad-timestamp 0 0.0/s congestion 0 0.0/s ip-option 1287 0.0/s proto-cksum 0 0.0/s state-mismatch 637 0.0/s state-insert 0 0.0/s state-limit 0 0.0/s src-limit 0 0.0/s synproxy 0 0.0/s map-failed 0 0.0/s
Not sure what other statistics to look at.
-
igb0 interface info
In/out packets 59242918/94295700 (13.75 GiB/103.78 GiB) In/out packets (pass) 59242918/94295700 (13.75 GiB/103.78 GiB) In/out packets (block) 13713/54 (1.84 MiB/0 B) In/out errors 0/0 Collisions 0
-
Okay, the ibg0 interfaces stats may be relevant. Checked again this morning and the number of out packets blocked has increased:
In/out packets (block) 20948/146 (2.87 MiB/2 KiB)
And it is slowly increasing. Now the thing is - I don't see any ibg0 outbound blocked packets logged in the filter log. All rules on igb0 that would block should be logged - and the counts for those rules in the rule status are 0. Now to try to correlate that block number with when I see the dropped packets.
-
Alas, no. Those block counts do not appear to increase when I see this issue. I'm at a loss to explain what is happening to these packets.
-
Any resolution on this topic?
I'm currently investigating a similar issue across multiple firewalls running 2.4.5. -
Not really, though I don't think I've seen this issue in quite a while.