"tun_wg0: loop detected" following WireGuard gateway 100% loss and WAN interface rc.newwanip
-
Hmm, so you see this when the WAN gateway goes down causing the WG tunnel to fail?
-
I do not see WAN gateway going down. When problem occurs WAN gateway have 0% loss when WG gateway at same time 100%. I also tried to disable WAN gateway monitoring, but logs and problem is still the same.
I do not know if its related or not (maybe not), but today some LAN clients were totally offline (could not even ping to pfSense box). All had IP's assigned so that was not the case. I pin down problem that they all was behind same dummy switch. I restarted that switch and everything worked like 2 minutes and then goes down. I removed that switch and everything back to normal. After that my original problem stayed away about 12 hours which is very long. I was almost celebrating, but then see those log messages again.
I also looked firewall logs and there was also spammed new lines. Have not seen any "let out anything from firewall host itself" before. For some reason some LAN traffic seems to originating from WG interface, which I think is strange. There are DNS and HTTPS destination.
Logs:
-
You would only see that if you have
Log packets matched from the default pass rules put in the ruleset
set in the log settings.
With that set you are seeing log entries for all the traffic from WG clients leaving the WAN. Probably, since I can't see the interface or IPs in the connection. -
Thanks. Do not remember to touch on that settings. I did change "Log Rotation Size" recently. I had same as this one https://forum.netgate.com/topic/169923/tons-sshguard-log-entries-and-its-not-enabled
That anyway did not fix/change my original problem.
-
Yup, that wouldn't be related to the loop issue, just confusing the logs.
This entry implies there is no longer a subnet on igb0 for the gateway yto be added in:
Aug 4 09:42:43 kernel arpresolve: can't allocate llinfo for *SNIP* on igb0
I assume igb0 is the WAN?
If you disable 'State Killing on Gateway Failure' does that stop the loop errors?
Steve
-
Actually those "arpresolve" messages have been non-existing for couple of days. Trying now anyway to disable "State Killing on Gateway Failure" option. Let see how it works. Thanks!
-
Loop messages still continues...
Log (removed some lines and code tags, because it was getting marked as spam):
Aug 8 02:47:23 check_reload_status 487 Reloading filter
Aug 8 02:47:23 check_reload_status 487 Restarting OpenVPN tunnels/interfaces
Aug 8 02:47:23 check_reload_status 487 Restarting IPsec tunnels
Aug 8 02:47:23 check_reload_status 487 updating dyndns WG_VPN_GW
Aug 8 02:47:23 rc.gateway_alarm 25152 >>> Gateway alarm: WG_VPN_GW (Addr:1.1.1.1 Alarm:0 RTT:1.316ms RTTsd:.151ms Loss:12%)
Aug 8 02:45:36 kernel tun_wg0: loop detected
Aug 8 02:45:36 kernel tun_wg0: loop detected
Aug 8 02:45:34 check_reload_status 487 Reloading filter
Aug 8 02:45:34 check_reload_status 487 Restarting OpenVPN tunnels/interfaces
Aug 8 02:45:34 check_reload_status 487 Restarting IPsec tunnels
Aug 8 02:45:34 check_reload_status 487 updating dyndns WG_VPN_GW
Aug 8 02:45:34 rc.gateway_alarm 68371 >>> Gateway alarm: WG_VPN_GW (Addr:1.1.1.1 Alarm:1 RTT:1.597ms RTTsd:1.754ms Loss:21%)
Aug 8 02:45:34 kernel tun_wg0: loop detected
Aug 8 02:45:34 kernel tun_wg0: loop detected
Aug 8 02:45:12 check_reload_status 487 Reloading filter
Aug 8 02:45:12 php-fpm 448 /rc.newwanip: rc.newwanip: on (IP address: SNIP) (interface: WAN1[wan]) (real interface: igb0).
Aug 8 02:45:12 php-fpm 448 /rc.newwanip: rc.newwanip: Info: starting on igb0.
Aug 8 02:45:12 kernel tun_wg0: loop detected
Aug 8 02:45:11 check_reload_status 487 rc.newwanip starting igb0
Aug 8 02:45:11 kernel tun_wg0: loop detected
Aug 8 02:45:11 kernel tun_wg0: loop detected
Aug 8 02:37:26 check_reload_status 487 Reloading filter
-
@mixka You're static route is using your WAN DHCP gateway, should be the WG gateway.
-
No. WG gateway is default gateway and it's need route to out via WAN gateway.
-
@mixka Right, it will. But the peer is on the other end of the tunnel, correct? It should use the WG gateway, then you have the WG gateway set as the upstream gateway in the interface or maybe I'm missing something. You have a lot redacted that doesn't need to be.
-
I'm new with WG (this abbreviation always got me, because WG is WatchGuard for me ) so this setup might be wrong in many places. I applied this guide for my use https://youtu.be/wYe7FzZ_0X8 (not using Mullvad).
What might be right way to use WG gateway as default gateway? Peer endpoint is in public domain and interface+gateway private.
-
Yes, the static route for subnets at the remote end of the tunnel must be the Wireguard gateway.
The public peer end point IP would not normally need a static route. The only reason you might need that is if you change the system default route to be the Wireguard gateway. I would not recommend doing that though.
-
@mixka Use this:
https://youtu.be/2oe7rTMFmqc -
Golden video as Christians videos always are. But I can't see relations to my setup there (other than that WG is used to make tunnel). Video anyway explains routes and NAT's nicely.
So, setup is somewhat right, but not recommended. I see, but goal is to use WG gateway as default route, so recommended or not this is the way.
Anything else to test out?
I also found when started to look other logs, that ntpd have log lines just after first "loop detected" message is timestamped. Is it root cause or just following system melt down?
Log:
Aug 8 20:16:14 check_reload_status 487 Reloading filter Aug 8 20:14:26 kernel tun_wg0: loop detected Aug 8 20:14:05 ntpd 1691 Listen normally on 25 igb0 80.221.147.161:123 Aug 8 20:14:02 ntpd 1691 Deleting interface #24 igb0, 80.221.147.161#123, interface stats: received=0, sent=0, dropped=0, active_time=14804 secs Aug 8 20:14:01 kernel tun_wg0: loop detected Aug 8 16:09:28 check_reload_status 487 Reloading filter Aug 8 16:07:42 kernel tun_wg0: loop detected Aug 8 16:07:18 ntpd 1691 Listen normally on 24 igb0 80.221.147.161:123 Aug 8 16:07:16 ntpd 1691 Deleting interface #23 igb0, 80.221.147.161#123, interface stats: received=0, sent=0, dropped=0, active_time=14756 secs Aug 8 16:07:15 kernel tun_wg0: loop detected Aug 8 12:03:30 check_reload_status 487 Reloading filter Aug 8 12:01:44 kernel tun_wg0: loop detected Aug 8 12:01:20 ntpd 1691 Listen normally on 23 igb0 80.221.147.161:123 Aug 8 12:01:18 ntpd 1691 Deleting interface #22 igb0, 80.221.147.161#123, interface stats: received=0, sent=0, dropped=0, active_time=1782 secs Aug 8 12:01:17 kernel tun_wg0: loop detected
-
Mmm, it's unclear where the loop is being created there but I'd have to guess it's because the system default route is switched back to the WAN when the WG gateway goes down and you have something else routing traffic back.
Do you have the default gateway set specifically to the WG gateway or to a group? -
@mixka Did you fix the WG config? That's what the video was for.
From what I can see, you need to remove the upstream gateway on the interface and correct the static route. -
Specifically to WG gateway.
I found out that video really helpful, but not on my case. I do not want to remove upstream gateway, because traffic out of tunnel needs to be NAT'd. Or does NAT rules still operate even upstream gateway is disabled? What does it help in that case? That video did not mention. And what comes to static route, I think there is no things to correct. In video WAN gateway was used and I need to use WG as my gateway.
-
If you have a gateway on the interface automatic outbound NAT will be applied but you can still add manual rules to an interface that has no gateway.
You need a gateway there to use as the default route or to add policy routing for but it doesn't have to be on the interface itself. However I wouldn't expect it to make any difference.Steve
-
It's been long days trying to solve this. I even deleted all firewall rules and redo only basic setup.
I have for example tried these:
- Disable NTP: No affect
- Disable WG interface upstream gateway: Interface errors out statistics goes down by ~90%
- Disable Negate rules: No affect
- Disable gateways monitoring: After that it did not recover from loop
- Started firewall rules from scrach with basic setup: Maybe loop occurances are lower, or not...
- Put "supersede interface-mtu 0" to WAN interface DHCP option modifiers: Definately made loop occurance lower, or maybe just coincidence
Now I did get 15 hours before problem occured after reboot.
Should I try to play with WAN interface DHCP protocol timing or other options? Connection is optical fiber 100/100M.
I also found this Reddit post https://reddit.com/r/PFSENSE/comments/ln14gu/arpresolve_cant_allocate_llinfo_ip_crippling_wan
Should I try dumb switch before WAN interface and what difference it might doing? I do not think this is tagged traffic anyway.I just did speedtest with different services and my connection speed is low like 8/30M. I have run speedtest often and haven't got problems before. I think only thing what I have changed after last time is that DHCP option modifier.
During speedtest:
I have had traffic shaper, but disabling it won't help any. I do not know if this is related in anyway, but is this right way to reduce buffer bloating? It is (was) working really well for me. https://forum.netgate.com/topic/112527/playing-with-fq_codel-in-2-4/775
And I have this spam for 1000 lines per minute:
(51820 is WG peer endpoint port and 43279 is WG tunnel listening port)Aug 14 14:06:00 WAN1 Default deny rule IPv4 (1000000103) *WG_Endpoint_IP*:51820 *WAN_Interface_IP*:43279 UDP Aug 14 14:06:00 WAN1 Default deny rule IPv4 (1000000103) *WG_Endpoint_IP*:51820 *WAN_Interface_IP*:43279 UDP Aug 14 14:06:00 WAN1 Default deny rule IPv4 (1000000103) *WG_Endpoint_IP*:51820 *WAN_Interface_IP*:43279 UDP
Should I allow this traffic or is there misconfiguration?
I have now used way over 50 hours to trying to solve this. I am even thinking to clear install whole box. I do not know what to do. I have tried everything and problems just pile up.
Loop occurance:
Aug 14 10:18:34 check_reload_status 484 Reloading filter Aug 14 10:18:34 check_reload_status 484 Restarting OpenVPN tunnels/interfaces Aug 14 10:18:34 check_reload_status 484 Restarting IPsec tunnels Aug 14 10:18:34 check_reload_status 484 updating dyndns *WG_GW* Aug 14 10:18:34 rc.gateway_alarm 17746 >>> Gateway alarm: *WG_GW* (Addr:8.8.8.8 Alarm:0 RTT:7.390ms RTTsd:36.380ms Loss:11%) Aug 14 10:18:25 php 14097 [pfBlockerNG] filterlog daemon started Aug 14 10:18:25 tail_pfb 13938 [pfBlockerNG] Firewall Filter Service started Aug 14 10:18:25 lighttpd_pfb 12699 [pfBlockerNG] DNSBL Webserver started Aug 14 10:18:25 php_pfb 10645 [pfBlockerNG] filterlog daemon stopped Aug 14 10:18:25 lighttpd_pfb 10375 [pfBlockerNG] DNSBL Webserver stopped Aug 14 10:18:25 tail_pfb 10461 [pfBlockerNG] Firewall Filter Service stopped Aug 14 10:18:24 php-fpm 30665 /rc.start_packages: Configuration Change: (system): [pfSense-pkg-WireGuard] Applied package default settings as necessary. Aug 14 10:18:24 php-fpm 30665 /rc.start_packages: Configuration Change: (system): [pfSense-pkg-WireGuard] Installed Unbound ACL group (WireGuard). Aug 14 10:18:24 php-fpm 30665 /rc.start_packages: Configuration Change: (system): [pfSense-pkg-WireGuard] De-installed Unbound ACL group (WireGuard). Aug 14 10:18:24 php-fpm 30665 /rc.start_packages: Configuration Change: (system): [pfSense-pkg-WireGuard] Installed interface group (WireGuard). Aug 14 10:18:24 php-fpm 30665 /rc.start_packages: Configuration Change: (system): [pfSense-pkg-WireGuard] De-installed interface group (WireGuard). Aug 14 10:18:24 check_reload_status 484 Syncing firewall Aug 14 10:18:24 php-fpm 30665 /rc.start_packages: Configuration Change: (system): [pfSense-pkg-WireGuard] Installed earlyshellcmd(s). Aug 14 10:18:24 check_reload_status 484 Syncing firewall Aug 14 10:18:24 php-fpm 30665 /rc.start_packages: Configuration Change: (system): [pfSense-pkg-WireGuard] De-installed earlyshellcmd(s). Aug 14 10:18:24 php 83866 haproxy: started new pid:87304 Aug 14 10:18:24 php 83866 haproxy: reload old pid:42610 Aug 14 10:18:23 php-fpm 30665 /rc.start_packages: Restarting/Starting all packages. Aug 14 10:18:22 check_reload_status 484 Starting packages Aug 14 10:18:22 php-fpm 10714 /rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 0.0.0.0 -> *WAN_Interface_IP* - Restarting packages. Aug 14 10:18:20 php-fpm 10714 /rc.newwanip: Creating rrd update script Aug 14 10:18:20 php-fpm 10714 /rc.newwanip: Resyncing OpenVPN instances for interface WAN1. Aug 14 10:18:15 php-fpm 10714 /rc.newwanip: IP Address has changed, killing all states (ip_change_kill_states is set). Aug 14 10:18:15 check_reload_status 484 Reloading filter Aug 14 10:18:15 check_reload_status 484 Restarting OpenVPN tunnels/interfaces Aug 14 10:18:15 check_reload_status 484 Restarting IPsec tunnels Aug 14 10:18:15 check_reload_status 484 updating dyndns *WG_GW* Aug 14 10:18:15 rc.gateway_alarm 70659 >>> Gateway alarm: *WG_GW* (Addr:8.8.8.8 Alarm:1 RTT:0.000ms RTTsd:0.000ms Loss:100%) Aug 14 10:18:13 php-fpm 10714 /rc.newwanip: Removing static route for monitor 8.8.8.8 and adding a new route through *WG_GW_IP* Aug 14 10:18:13 php-fpm 10714 /rc.newwanip: Removing static route for monitor 8.8.4.4 and adding a new route through *WAN_GW_IP* Aug 14 10:18:11 php-fpm 10714 /rc.newwanip: rc.newwanip: on (IP address: *WAN_Interface_IP*) (interface: WAN1[wan]) (real interface: igb0). Aug 14 10:18:11 php-fpm 10714 /rc.newwanip: rc.newwanip: Info: starting on igb0. Aug 14 10:18:10 check_reload_status 484 rc.newwanip starting igb0 Aug 14 10:18:10 kernel tun_wg0: loop detected Aug 14 10:18:10 kernel tun_wg0: loop detected Aug 14 10:18:10 kernel tun_wg0: loop detected Aug 14 10:18:10 kernel arpresolve: can't allocate llinfo for *WAN_GW_IP* on igb0 Aug 14 10:18:09 kernel tun_wg0: loop detected Aug 14 10:18:09 kernel tun_wg0: loop detected Aug 14 10:18:09 kernel arpresolve: can't allocate llinfo for *WAN_GW_IP* on igb0 Aug 14 10:18:09 kernel tun_wg0: loop detected Aug 14 10:18:09 kernel tun_wg0: loop detected Aug 14 10:18:09 kernel tun_wg0: loop detected Aug 14 10:18:09 kernel tun_wg0: loop detected Aug 14 10:18:09 kernel arpresolve: can't allocate llinfo for *WAN_GW_IP* on igb0 Aug 14 10:18:08 kernel tun_wg0: loop detected Aug 14 10:18:08 kernel arpresolve: can't allocate llinfo for *WAN_GW_IP* on igb0 Aug 14 10:18:34 dpinger 65033 *WG_GW* 8.8.8.8: Clear latency 7390us stddev 36380us loss 11% Aug 14 10:18:15 dpinger 65033 *WG_GW* 8.8.8.8: Alarm latency 0us stddev 0us loss 100% Aug 14 10:18:13 dpinger 65033 send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 8.8.8.8 bind_addr *WG_Interface_IP* identifier "*WG_GW*" Aug 14 10:18:13 dpinger 64735 send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 8.8.4.4 bind_addr *WAN_Interface_IP* identifier "WAN_DHCP"
-
@mixka said in "tun_wg0: loop detected" following WireGuard gateway 100% loss and WAN interface rc.newwanip:
Should I try dumb switch before WAN interface and what difference it might doing?
That's unlikely to help here because, AFAIK, you are not seeing igb0 actually lose link which is what that would prevent.
It would still hit whatever is causing it to see a new WAN IP triggering the reload of the interface and everything on it.Reviewing the thread though, this seems odd:
@mixka said in "tun_wg0: loop detected" following WireGuard gateway 100% loss and WAN interface rc.newwanip:I pin down problem that they all was behind same dummy switch. I restarted that switch and everything worked like 2 minutes and then goes down. I removed that switch and everything back to normal.
Could it be that switch has some loop detection and was shutting down one or more ports to prevent it?
If so that implies whatever is creating the loop ends up sending traffic to the LAN side incorrectly.Try checking the routing tables when it's working and compare that to the failed state.
Steve