Machine behind 1:1 NAT keeps losing connection until the Virtual IP is reset

mcrawford

We have several servers setup with static IPs on the internal network. Each of them is configured with a 1:1 NAT connecting the internal static to an external static IP from our ISP. These were configured as Proxy ARP and everything was working fine until the morning of November 17th. After much trial and error and hours on the phone with spectrum I was able to get the servers working again by changing the Virtual IPs' type to IP Alias. Everything seemed fine until we noticed the Meshcentral server had stopped talking to systems outside the internal network.

The Meshcentral server is running on Ubuntu Server 18.04 and shares firewall rules (1 wan, 1 lan) with other linux servers running the same version of Ubuntu.

All of these servers have the same config with only the static IP varying. The other servers continue to function with no issues.

After more trial and error I noticed that changing the Virtual IP type to proxy arp, applying the changes, changing it back to IP Alias and applying the changes again would bring it back online for 10-45 minutes. However it always loses its ability to talk to the outside world after some time.

I have spectrum checking the static IP on their end just in case.

Does anyone have any idea what the source of the issue might be and how to go about solving this issue permanently?

viragomann

@mcrawford
Something related in the system log?
Especially look for ARP entries. I can imagine that there are possibly conflicts to see.

The 1:1 NAT rules are still in place, also for the concerned device?

mcrawford

@viragomann thanks for replying.

I haven't been able to find anything in the system logs relating to the device when it falls offline. I've checked the General logs, Gateway Logs, Routing Logs, Firewall logs, and DHCP logs at the times it drops off. Neither the external IP, internal IP, nor the MAC address of the server show up anywhere in the logs around the timeframe that the server falls offline.

The NAT and firewall rules remain the same for all of the servers and only this single server is having this issue.

The only thing I'm changing to get the server back online is the virtual IP. I think swapping the type back and forth is just causing a reset somewhere. The server always pops right back online after applying changes to set the type back to IP Alias.

I've switched back and forth between having the internal static IP of the server assigned a static ARP table entry as well in case the issue is related to the arp table entry expiring. The server still fell offline when it was set to static ARP.

I swapped it back to static ARP just now after making another check of the logs.

viragomann

@mcrawford
To rule out outside issues, take the concerned public IP and try to ping it from outside, when outbound of the device fails.

Ensure before that ICMP is not forwarded and that there is a firewall rule in place which allow it to "This firewall".

If you don't get responses, run a packet capture on pfSense and check if the packets even arrive on WAN as expected.

mcrawford

@viragomann thank you for that suggestion.

I'll try that when the system falls offline after we've closed for the day and the system can be offline for a few minutes to facilitate more in depth testing.

mcrawford

Finally found the source of the issue. The static IP from the ISP has something wrong on the ISP end that messes up the ARP on that IP. Changing the type back and forth was resetting the ARP record, but the problem on the ISP end would inevitably damage the record again.

My boss called in a favor from one of his friends who is a network admin with much more experience (20+ years) than I have. After he confirmed that everything was configured exactly the way it needed to be and we went through every trouble shooting thing we could think of, we finally just swapped out the external static ip to a new one and updated the domain records to point to the new IP. Problem solved.

Thanks for your time and help with trying to sort this out @viragomann, I very much appreciate it. Just wanted to let you know how everything ended up in the end.

viragomann

@mcrawford
Thanks for coming back with details about the issue.

mcrawford

Of course. Wouldn't feel right just leaving it hanging after you tried to help out.

The ISP just got back to me a short while ago. One of the support techs had setup a reflector on the IP while troubleshooting and forgot to disable it.

If anyone runs into an issue like this in the future and finds this thread, have the ISP check to make sure a tech hasn't messed with stuff like that.