Changing NAT Translation Pool Options locks up server

aceadmin

Hey all. First time poster, so sorry if it's a poor discussion! We had an issue today on a physical SuperMicro PFSense box. We're running version 2.2.4, FreeBSD 10.1-RELEASE-p15.

When we changed the Outbound NAT Translation Option for our LAN subnets from 'Round Robin', to 'Round Robin with Sticky', the PFSense box immediately locked up. I could no longer administer the box, ping it, etc…The interfaces stayed up and were active, but traffic stopped flowing through the box. Devices in the LAN subnets went down, and we couldn't administer anything in those subnets. We could not login on a local ethernet port we have setup for PFSense.

We also tried getting into the box with the local VGA/USB console, but that was acting extremely odd. When I would plug in a USB keyboard, the session would indicate it saw the keyboard inserted. However, nothing I typed would show up until I actually disconnected the keyboard. It's as if it was queuing up everything I typed, and then dumping it once I disconnected the USB keyboard. Might not be relevant, but I figure it is worth mentioning.

I looked at the logs in the server, but didn't see anything that stuck out. See below for relevant logs after I made the change:

Dec 10 07:59:18 Firewall-Name check_reload_status: Syncing firewall
Dec 10 07:59:26 Firewall-Name check_reload_status: Syncing firewall
Dec 10 07:59:31 Firewall-Name check_reload_status: Syncing firewall
Dec 10 07:59:33 Firewall-Name check_reload_status: Reloading filter
Dec 10 08:00:07 Firewall-Name check_reload_status: updating dyndns WAN_652GW
Dec 10 08:00:07 Firewall-Name check_reload_status: Restarting ipsec tunnels
Dec 10 08:00:07 Firewall-Name check_reload_status: Restarting OpenVPN tunnels/interfaces
Dec 10 08:00:07 Firewall-Name check_reload_status: Reloading filter
Dec 10 08:00:07 Firewall-Name check_reload_status: updating dyndns BVI1500
Dec 10 08:00:07 Firewall-Name check_reload_status: Restarting ipsec tunnels
Dec 10 08:00:07 Firewall-Name check_reload_status: Restarting OpenVPN tunnels/interfaces
Dec 10 08:00:07 Firewall-Name check_reload_status: Reloading filter
Dec 10 08:02:35 Firewall-Name kernel: igb3: link state changed to UP
Dec 10 08:02:35 Firewall-Name check_reload_status: Linkup starting igb3
Dec 10 08:02:37 Firewall-Name php-fpm[14472]: /rc.linkup: Hotplug event detected for LOCAL_MGMT(opt2) but ignoring since interface is configured with static IP (192.168.1.1 )
Dec 10 08:02:37 Firewall-Name check_reload_status: rc.newwanip starting igb3
Dec 10 08:02:38 Firewall-Name php-fpm[14472]: /rc.newwanip: rc.newwanip: Info: starting on igb3.
Dec 10 08:02:38 Firewall-Name php-fpm[14472]: /rc.newwanip: rc.newwanip: on (IP address: 192.168.1.1) (interface: LOCAL_MGMT[opt2]) (real interface: igb3).
Dec 10 08:02:38 Firewall-Name check_reload_status: Reloading filter

We cut to a spare box with identical config, just to get things back up and running. We've since rebooted the troubled box, and everything seems to be working correctly. We've tried reproducing the issue, but haven't been able to at this point.

Looking if anyone has seen this before, or has any advice. Possible bug maybe?

Let me know if you need additional info or screenshots. Thanks!

Josh

cmb

sticky has had no shortage of problems historically, and isn't very widely used, but I'm not aware of any such issues in current versions. If you find a means of replicating that, please report back.

aceadmin

Thanks for the reply CMB.

Have some new information to add to this. Here's our setup: We have 2 identical SuperMicro servers, one in production, one as a backup. No CARP, as we have another issue/bug we a troubleshooting there as well. Currently, the production box is set to "Round Robin" for NAT Translation, and the backup box is set to "Round Robin with Sticky", and both are running fine.

We changed the production box from Round Robin, to Round Robin with Sticky. The server was fine for about 30 seconds, and then locked up the exact same way we saw before. All interfaces stayed up, and everything looks fine, but the box is not administrate-able and no traffic was actually passing. We cut to the backup (which was running fine with Sticky), and we saw the exact same thing. Everything worked fine for about 30 seconds, and then poof, box explodes. We had to reboot the server, disconnect all traffic-bearing interfaces (so it wouldn't immediately lock up again), and revert the config to get things back up and running.

So, it seems changing the config wasn't a problem until you start having traffic use the new NAT translations options. Has anyone seen anything like this before? Is this a software bug, or does it seem more like a hardware incompatibility?

For those who are curious. We are trying the Sticky option due to possible issues with client devices have multiple sessions that NAT to multiple IP public addresses. When we statically set those clients to a single NAT IP address, those problems clear up. So, we were hoping that the Sticky option for NAT might alleviate these issues wide-scale.

Thanks!

Josh