Bizarre packet loss issue on WAN when using NAT 1:1
Okay, so it's definitely pfSense from what I can tell. I saw another thread with the same problem: https://forum.pfsense.org/index.php?topic=797.0
In my case though, the VIP Subnet mask is correct at /29 - but here's the kicker: I can set my VIP Subnet mask to whatever I want… such as /30, /31, /32 even! it'll still work and serious packet loss still occurs with no traffic on the VIP.
I'm beginning to believe it's a problem with a pfSense update so I may have to downgrade our SG-4860-1U somehow through a USB reinstallation on the hardware since clearly Factory Defaults don't resolve the problem. If a downgrade doesn't resolve it then I'll have to open a support call with pfSense support.
End Of Update
I think we have a situation indicating a bad firewall, switch, modem or congested gateway or something. I'm a software developer but my secondary function is as ad-hoc IT so I know more/less in certain areas than actual network guru's.
We're a small business running a new pfSense SG-4860-1U since roughly April 2016. We have 60Mb down, 6Mb up w/ our ISP, and with our Traffic Shaping rules our packet loss is always at 0.0% and RTTsd is low, been this way for months. Right before I went for a two-week travelling vacation early October, we had a series of power outages and our HP ProCurve uplink port eventually just froze up solid on link lights and I had to change the port for a short-term solution then later hard-boot the system to get it back. This may or may not be relevant but when I got back a day ago, looking at pfSense, our packet loss is a steady 10%~35% while our RTT and RTTsd are pretty low 10ms / 1.8~8ms, respectively. Staff say working remote is impossible... obviously.
So what I did:
1. Took a backup of our configuration.
2. Replaced the cable from our pfSense to HP ProCurve uplink.
3. Did tcpdumps on our WAN and LAN ports to wireshark using commands:
plink.exe -P 622 -ssh -pw [pw] firstname.lastname@example.org tcpdump -n -nn -s 0 -U -i igb1 -w - port not 622| "C:\Program Files\Wireshark\dumpcap.exe" -i - -b filesize:65535 -b files:100 -w "C:\users[name]\Documents\Wireshark\igb1 wan\capture.pcapng"
plink.exe -P 622 -ssh -pw [pw] email@example.com tcpdump -n -nn -s 0 -U -i igb0 -w - port not 622| "C:\Program Files\Wireshark\dumpcap.exe" -i - -b filesize:65535 -b files:100 -w "C:\users[name]\Documents\Wireshark\igb0 lan\capture.pcapng"
4. Looked at Conversations, traffic analysis, etc. Found no excessive packets, bandwidth use extremely low. On the LAN side there's a lot of ACK Resets, Out-Of-Sequence, etc, black-indicators everywhere. WAN has much less but they're there too. Discovered that around 48% of our WAN traffic is ARP broadcasts (average of 157 per second at around 10 KB/s) but I see online that this is normal apparently.
5. Checked firewall system log and noticed constant IPSec VPN Filter Reloads each minute due to something similar to: Endpoint IP changed notification. Disabling the IPSec tunnel seemed to do the trick.
6. Problem started again. Tried disabling traffic shaper. Problem gets slightly worse.
7. Changed WAN port to OPT1. Problem resolves for like 5 minutes then restarts.
8. Did a speed test and began pinging 184.108.40.206 from pfSense - lost roughly 25% of pings and speed test was up and down, all over the place.
9. Connected laptop directly to cable modem and did a speed test - went perfectly, could ping 220.127.116.11 hundreds of times, etc.
10. I reset the firewall and bingo - all is well! Or so I thought. Restoring backup causes extreme packet loss again, even after-hours with barely any bandwidth being used.
11. Begin restoring pieces of configuration backup in order: IPSec, NAT, Firewall Rules… firewall rules being restored causes packet loss to be extreme again. So I find out that when I disable our 1:1 NAT for a Virtual IP going to our SimpleHelp Ubuntu server that packet loss goes down to 0.0%.
12. I reset the firewall to defaults and just set up our WAN Gateway and an NAT 1:1 to our SimpleHelp and immediately packet loss.
13. I reset the firewall to defaults again and packet loss is a steady 2%. It's midnight and I'm cranky so I restore our backup config, disable our 1:1 NAT, packet loss averages 7.5% today without our 1:1. Again, RTTsd is low. None of this behavior is normal for us.
14. I wireshark our WAN and LAN and enable our NAT 1:1 to see if maybe something sticks out. RTTsd leads RTT by about 1ms, packet loss hits 22%. O.K. how does our NAT 1:1 to a server that's barely used at all cause massive packet loss when it doesn't saturate a single 1% of bandwidth?
continued wireshark analysis:
Filtering WAN traffic ip.src == VirtualIP || ip.dst == VirtualIP to our SimpleHelp server, I see it takes up .2% of our WAN traffic but filled with: ICMP 282 Destination Unreachable/QUIC 254 Payload Encrypted/ICMP/QUIC/forever and everywhere ... at first glance, kinda looks like we're being DDoS'd from a Canadian Broadband address.
Filtered LAN traffic for ip.src == internal 1:1 IP || ip.dst == internal 1:1 IP
I investigate further and the IP address sending continuous UDP packets (QUIC) is one of our remote laptops running a SimpleHelp service. It's not even really using much bandwidth with its packets anyways - like B/s, not KB or MB. However, I disable the service remotely on the laptop anyways and the packet loss decreases to around 3~8%.
So with NAT 1:1 our packet loss increases. Without it, it's great - most of the time. In 300 seconds our 1:1 Virtual IP might get 4 requests from the usual sources such as SSH attempts from Germany, Ukraine, etc, SYN from China, Brazil, a few Spurious Retransmissions from them but that's it. Practically 0% bandwidth used by our Virtual IP.
In my tests if I just start an HTTP GET to our SimpleHelp server with 1:1 enabled from a remote site our WANGW hits losses of greater than 22% (from 5 GETs in a row). If I stop refreshing the page all is well, it trickles back down to about 5% now. If I disable 1:1 NAT to the server, WANGW goes back to 0.0% and intermittently rises every now and then.
So here's the question: How does a NAT 1:1 that's using barely any bandwidth at all, hurt our WANGW after months of perfect operation? There's gotta be something wrong with hardware somewhere or over at our ISP. I'm guessing either pfSense SG-4860-1U or our uplink port on our old HP ProCurve but what I fail to understand is how something behind a firewall, that's not saturating bandwidth, could possibly affect the pinger to our WANGW.
Some additional information:
Our ISP leased us 3 statics from a group of 5, handled by our gateway. Two of them we can get a lease for, the other one is in use by another one of their customers using the same gateway. They know this and we pay for all 3 anyways because they don't care. I've spoken to them over the phone about a half a dozen times and they won't kick them off the gateway or the IP they're using but we pay for. I kind of wonder what would happen if I change my WANGW to the IP that the 1:1 is using and see what happens. I wonder if there's somehow a lease conflict going on.
Problem solved. It turns out our ISP upgraded their equipment that interfaced with our static IPs to a Cisco Multihome chassis. They were able to queue a command to their systems that allow multiple IP's to a single MAC.
I'll share some information I learnt along the way in case anyone else has this problem:
a) you can try to lease the IP with CARP. This didn't work for me, but theoretically it should use a new MAC address for the connections.
b) I was able to create a fake virtual adapter using ngctl and was about to create custom routes for this to work. I didn't complete this testing but ngctl looks pretty powerful, I'd like to play with it more.