All NAT routing stops until reboot
-
Hello guys, I have been having serious problems with all NAT routing since I added a second VPN gateway.
I am running 2.4.4 (with ZFS) on an HP T620 Plus thin client with 4GB of RAM and an HP Intel PRO 1000 PT four port card. I have a single WAN connection. I also run pfblocker and ntopng.
I have manual NAT settings and I am running two VPN gateways (two different VPN providers) plus two VPN site to site tunnels to other locations.
Since I added the second VPN gateway I immediately observed a problem. I have multiple VLANs and each VLAN has either VPN1 or VPN2 gateway set in its Firewall rules. Let's say VLAN1 uses VPN1 and VLAN2 uses VPN2. Occasionally and with no reason (apart from maybe VPN1 tunnel reconnecting) VLAN1 clients would start using VPN2 to connect to the internet, without any failover or loadbalancing set. I removed the VPN2 NAT entries for VLAN1 and vice versa and the problem resolved itself.
I now sometimes get absolutely no NAT on any interface for no reason. The firewall itself can ping e.g. 8.8.8.8 from all of its gateways but none of the clients can ping or access the internet. Restarting the tunnels, resetting all firewall states and rebooting the clients does not do anything. The only remedy is a total firewall reboot. I observed this running top in ssh after a brief duration (maybe 30 secs) of 100% CPU usage by php but that might be irrelevant.
Logs do not seem to provide anything useful. Any ideas?
-
@sotirone said in All NAT routing stops until reboot:
Hello guys, I have been having serious problems with all NAT routing since I added a second VPN gateway.
I am running 2.4.4 (with ZFS) on an HP T620 Plus thin client with 4GB of RAM and an HP Intel PRO 1000 PT four port card. I have a single WAN connection. I also run pfblocker and ntopng.
I have manual NAT settings and I am running two VPN gateways (two different VPN providers) plus two VPN site to site tunnels to other locations.
Since I added the second VPN gateway I immediately observed a problem. I have multiple VLANs and each VLAN has either VPN1 or VPN2 gateway set in its Firewall rules. Let's say VLAN1 uses VPN1 and VLAN2 uses VPN2. Occasionally and with no reason (apart from maybe VPN1 tunnel reconnecting) VLAN1 clients would start using VPN2 to connect to the internet, without any failover or loadbalancing set. I removed the VPN2 NAT entries for VLAN1 and vice versa and the problem resolved itself.
I now sometimes get absolutely no NAT on any interface for no reason. The firewall itself can ping e.g. 8.8.8.8 from all of its gateways but none of the clients can ping or access the internet. Restarting the tunnels, resetting all firewall states and rebooting the clients does not do anything. The only remedy is a total firewall reboot. I observed this running top in ssh after a brief duration (maybe 30 secs) of 100% CPU usage by php but that might be irrelevant.
Logs do not seem to provide anything useful. Any ideas?
Do you have Log Settings set so that DEFAULT Deny AND Allow are logged?
If yes, I suspect you are going to have to post your configurations/rules to get help with this/ -
@HansSolo No I only have Block rules set to be logged
-
Just happened again. Devices using the native WAN interface as a Gateway stay unaffected.
Logs (System --> General) show ntopng crashing:
May 7 17:38:10 kernel pid 15404 (ntopng), uid 0: exited on signal 11 (core dumped)
May 7 17:38:10 kernel igb2: promiscuous mode disabled
May 7 17:38:10 kernel igb3: promiscuous mode disabled
May 7 17:38:32 ntopng [HTTPserver.cpp:924] ERROR: [HTTP] set_ports_option: cannot bind to 3000s: Address already in use
May 7 17:38:32 ntopng [mongoose.c:4584] ERROR: set_ports_option: cannot bind to 3000s: No error: 0
May 7 17:38:32 ntopng [HTTPserver.cpp:1104] ERROR: Unable to start HTTP server (IPv4) on ports 3000s
May 7 17:38:32 ntopng [HTTPserver.cpp:1110] ERROR: Either port in use or another ntopng instance is running (using the same port)Logs (System --> Gateways)
May 7 17:37:55 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr REMOVED bind_addr REMOVED identifier "WAN "
May 7 17:37:55 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr REMOVED bind_addr REMOVED identifier "VPN1 "
May 7 17:37:55 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr REMOVED bind_addr REMOVED identifier "SITETOSITE1 "
May 7 17:37:55 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr REMOVED bind_addr REMOVED identifier "SITETOSITE2 "
May 7 17:37:55 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr REMOVED bind_addr REMOVED identifier "VPN2 "Bold Italics edited by me
Edit: ntopng is the problem. Every time I restart a gateway tunnel, ntopng crashes and NAT stops working.
Here is what the ntopng logs are filled with:
[Mutex.cpp:46] WARNING: pthread_mutex_lock() returned 11 [Resource deadlock avoided][errno=0]
RAM had ~1600M free so not running out of RAM. CPU as I said was 100% on one of four cores at the time of this happening.
I uninstalled ntopng for now as it was unusable.
Edit 2: Totally not fixed. Seems to happen when I restart VPN2 but not always I think. WAN and VPN1 gateways always register as Down in Status --> Gateways even when they are up. ntopgn not the problem!
VPN2 has a NAT port forward rule with it's corresponding Firewall rule, will try to disable that and see if anything changes. Will investigate more and report back.
Edit 3: Seems to be fixed by selecting System --> Advanced --> Misc --> Reset states on Gateway down. I also had to add VPN1 Gateway in LAN Firewall Rules as Gateway as it would still not work with the Gateway set to default. I would like some input from someone if this is correct.