"tun_wg0: loop detected" following WireGuard gateway 100% loss and WAN interface rc.newwanip

stephenw10

Hmm, so you see this when the WAN gateway goes down causing the WG tunnel to fail?

Mixka

I do not see WAN gateway going down. When problem occurs WAN gateway have 0% loss when WG gateway at same time 100%. I also tried to disable WAN gateway monitoring, but logs and problem is still the same.

I do not know if its related or not (maybe not), but today some LAN clients were totally offline (could not even ping to pfSense box). All had IP's assigned so that was not the case. I pin down problem that they all was behind same dummy switch. I restarted that switch and everything worked like 2 minutes and then goes down. I removed that switch and everything back to normal. After that my original problem stayed away about 12 hours which is very long. I was almost celebrating, but then see those log messages again.

I also looked firewall logs and there was also spammed new lines. Have not seen any "let out anything from firewall host itself" before. For some reason some LAN traffic seems to originating from WG interface, which I think is strange. There are DNS and HTTPS destination.

Logs:

stephenw10

You would only see that if you have Log packets matched from the default pass rules put in the ruleset set in the log settings.
With that set you are seeing log entries for all the traffic from WG clients leaving the WAN. Probably, since I can't see the interface or IPs in the connection.

Mixka

@stephenw10

Thanks. Do not remember to touch on that settings. I did change "Log Rotation Size" recently. I had same as this one https://forum.netgate.com/topic/169923/tons-sshguard-log-entries-and-its-not-enabled

That anyway did not fix/change my original problem.

stephenw10

Yup, that wouldn't be related to the loop issue, just confusing the logs.

This entry implies there is no longer a subnet on igb0 for the gateway yto be added in:

Aug 4 09:42:43 	kernel 		arpresolve: can't allocate llinfo for *SNIP* on igb0

I assume igb0 is the WAN?

If you disable 'State Killing on Gateway Failure' does that stop the loop errors?

Steve

Mixka

@stephenw10

Actually those "arpresolve" messages have been non-existing for couple of days. Trying now anyway to disable "State Killing on Gateway Failure" option. Let see how it works. Thanks!

Mixka

@stephenw10

Loop messages still continues...

Log (removed some lines and code tags, because it was getting marked as spam):

Aug 8 02:47:23 check_reload_status 487 Reloading filter
Aug 8 02:47:23 check_reload_status 487 Restarting OpenVPN tunnels/interfaces
Aug 8 02:47:23 check_reload_status 487 Restarting IPsec tunnels
Aug 8 02:47:23 check_reload_status 487 updating dyndns WG_VPN_GW
Aug 8 02:47:23 rc.gateway_alarm 25152 >>> Gateway alarm: WG_VPN_GW (Addr:1.1.1.1 Alarm:0 RTT:1.316ms RTTsd:.151ms Loss:12%)
Aug 8 02:45:36 kernel tun_wg0: loop detected
Aug 8 02:45:36 kernel tun_wg0: loop detected
Aug 8 02:45:34 check_reload_status 487 Reloading filter
Aug 8 02:45:34 check_reload_status 487 Restarting OpenVPN tunnels/interfaces
Aug 8 02:45:34 check_reload_status 487 Restarting IPsec tunnels
Aug 8 02:45:34 check_reload_status 487 updating dyndns WG_VPN_GW
Aug 8 02:45:34 rc.gateway_alarm 68371 >>> Gateway alarm: WG_VPN_GW (Addr:1.1.1.1 Alarm:1 RTT:1.597ms RTTsd:1.754ms Loss:21%)
Aug 8 02:45:34 kernel tun_wg0: loop detected
Aug 8 02:45:34 kernel tun_wg0: loop detected
Aug 8 02:45:12 check_reload_status 487 Reloading filter
Aug 8 02:45:12 php-fpm 448 /rc.newwanip: rc.newwanip: on (IP address: SNIP) (interface: WAN1[wan]) (real interface: igb0).
Aug 8 02:45:12 php-fpm 448 /rc.newwanip: rc.newwanip: Info: starting on igb0.
Aug 8 02:45:12 kernel tun_wg0: loop detected
Aug 8 02:45:11 check_reload_status 487 rc.newwanip starting igb0
Aug 8 02:45:11 kernel tun_wg0: loop detected
Aug 8 02:45:11 kernel tun_wg0: loop detected
Aug 8 02:37:26 check_reload_status 487 Reloading filter

Jarhead

@mixka You're static route is using your WAN DHCP gateway, should be the WG gateway.

Mixka

@jarhead

No. WG gateway is default gateway and it's need route to out via WAN gateway.

Jarhead

@mixka Right, it will. But the peer is on the other end of the tunnel, correct? It should use the WG gateway, then you have the WG gateway set as the upstream gateway in the interface or maybe I'm missing something. You have a lot redacted that doesn't need to be.

Mixka

@jarhead

I'm new with WG (this abbreviation always got me, because WG is WatchGuard for me ) so this setup might be wrong in many places. I applied this guide for my use https://youtu.be/wYe7FzZ_0X8 (not using Mullvad).

What might be right way to use WG gateway as default gateway? Peer endpoint is in public domain and interface+gateway private.

stephenw10

Yes, the static route for subnets at the remote end of the tunnel must be the Wireguard gateway.

The public peer end point IP would not normally need a static route. The only reason you might need that is if you change the system default route to be the Wireguard gateway. I would not recommend doing that though.

Jarhead

@mixka Use this:
https://youtu.be/2oe7rTMFmqc

Mixka

@jarhead

Golden video as Christians videos always are. But I can't see relations to my setup there (other than that WG is used to make tunnel). Video anyway explains routes and NAT's nicely.

@stephenw10

So, setup is somewhat right, but not recommended. I see, but goal is to use WG gateway as default route, so recommended or not this is the way.

Anything else to test out?

I also found when started to look other logs, that ntpd have log lines just after first "loop detected" message is timestamped. Is it root cause or just following system melt down?

Log:

Aug 8 20:16:14 	check_reload_status 	487 	Reloading filter 
Aug 8 20:14:26 	kernel 		tun_wg0: loop detected 
Aug 8 20:14:05 	ntpd 	1691 	Listen normally on 25 igb0 80.221.147.161:123
Aug 8 20:14:02 	ntpd 	1691 	Deleting interface #24 igb0, 80.221.147.161#123, interface stats: received=0, sent=0, dropped=0, active_time=14804 secs
Aug 8 20:14:01 	kernel 		tun_wg0: loop detected 

Aug 8 16:09:28 	check_reload_status 	487 	Reloading filter
Aug 8 16:07:42 	kernel 		tun_wg0: loop detected 
Aug 8 16:07:18 	ntpd 	1691 	Listen normally on 24 igb0 80.221.147.161:123
Aug 8 16:07:16 	ntpd 	1691 	Deleting interface #23 igb0, 80.221.147.161#123, interface stats: received=0, sent=0, dropped=0, active_time=14756 secs
Aug 8 16:07:15 	kernel 		tun_wg0: loop detected

Aug 8 12:03:30 	check_reload_status 	487 	Reloading filter
Aug 8 12:01:44 	kernel 		tun_wg0: loop detected
Aug 8 12:01:20 	ntpd 	1691 	Listen normally on 23 igb0 80.221.147.161:123
Aug 8 12:01:18 	ntpd 	1691 	Deleting interface #22 igb0, 80.221.147.161#123, interface stats: received=0, sent=0, dropped=0, active_time=1782 secs 
Aug 8 12:01:17 	kernel 		tun_wg0: loop detected

stephenw10

Mmm, it's unclear where the loop is being created there but I'd have to guess it's because the system default route is switched back to the WAN when the WG gateway goes down and you have something else routing traffic back.
Do you have the default gateway set specifically to the WG gateway or to a group?

Jarhead

@mixka Did you fix the WG config? That's what the video was for.
From what I can see, you need to remove the upstream gateway on the interface and correct the static route.

Mixka

@stephenw10

Specifically to WG gateway.

@Jarhead

I found out that video really helpful, but not on my case. I do not want to remove upstream gateway, because traffic out of tunnel needs to be NAT'd. Or does NAT rules still operate even upstream gateway is disabled? What does it help in that case? That video did not mention. And what comes to static route, I think there is no things to correct. In video WAN gateway was used and I need to use WG as my gateway.

stephenw10

If you have a gateway on the interface automatic outbound NAT will be applied but you can still add manual rules to an interface that has no gateway.
You need a gateway there to use as the default route or to add policy routing for but it doesn't have to be on the interface itself. However I wouldn't expect it to make any difference.

Steve

Mixka

It's been long days trying to solve this. I even deleted all firewall rules and redo only basic setup.

I have for example tried these:

Disable NTP: No affect
Disable WG interface upstream gateway: Interface errors out statistics goes down by ~90%
Disable Negate rules: No affect
Disable gateways monitoring: After that it did not recover from loop
Started firewall rules from scrach with basic setup: Maybe loop occurances are lower, or not...
Put "supersede interface-mtu 0" to WAN interface DHCP option modifiers: Definately made loop occurance lower, or maybe just coincidence

Now I did get 15 hours before problem occured after reboot.

Should I try to play with WAN interface DHCP protocol timing or other options? Connection is optical fiber 100/100M.

I also found this Reddit post https://reddit.com/r/PFSENSE/comments/ln14gu/arpresolve_cant_allocate_llinfo_ip_crippling_wan
Should I try dumb switch before WAN interface and what difference it might doing? I do not think this is tagged traffic anyway.

I just did speedtest with different services and my connection speed is low like 8/30M. I have run speedtest often and haven't got problems before. I think only thing what I have changed after last time is that DHCP option modifier.

During speedtest:

I have had traffic shaper, but disabling it won't help any. I do not know if this is related in anyway, but is this right way to reduce buffer bloating? It is (was) working really well for me. https://forum.netgate.com/topic/112527/playing-with-fq_codel-in-2-4/775

And I have this spam for 1000 lines per minute:
(51820 is WG peer endpoint port and 43279 is WG tunnel listening port)

Aug 14 14:06:00 	WAN1 	Default deny rule IPv4 (1000000103) 	*WG_Endpoint_IP*:51820		*WAN_Interface_IP*:43279		UDP
Aug 14 14:06:00 	WAN1 	Default deny rule IPv4 (1000000103) 	*WG_Endpoint_IP*:51820		*WAN_Interface_IP*:43279		UDP
Aug 14 14:06:00 	WAN1 	Default deny rule IPv4 (1000000103) 	*WG_Endpoint_IP*:51820		*WAN_Interface_IP*:43279		UDP

Should I allow this traffic or is there misconfiguration?

I have now used way over 50 hours to trying to solve this. I am even thinking to clear install whole box. I do not know what to do. I have tried everything and problems just pile up.

Loop occurance:

Aug 14 10:18:34 	check_reload_status 	484 	Reloading filter
Aug 14 10:18:34 	check_reload_status 	484 	Restarting OpenVPN tunnels/interfaces
Aug 14 10:18:34 	check_reload_status 	484 	Restarting IPsec tunnels
Aug 14 10:18:34 	check_reload_status 	484 	updating dyndns *WG_GW*
Aug 14 10:18:34 	rc.gateway_alarm 	17746 	>>> Gateway alarm: *WG_GW* (Addr:8.8.8.8 Alarm:0 RTT:7.390ms RTTsd:36.380ms Loss:11%)
Aug 14 10:18:25 	php 	14097 	[pfBlockerNG] filterlog daemon started
Aug 14 10:18:25 	tail_pfb 	13938 	[pfBlockerNG] Firewall Filter Service started
Aug 14 10:18:25 	lighttpd_pfb 	12699 	[pfBlockerNG] DNSBL Webserver started
Aug 14 10:18:25 	php_pfb 	10645 	[pfBlockerNG] filterlog daemon stopped
Aug 14 10:18:25 	lighttpd_pfb 	10375 	[pfBlockerNG] DNSBL Webserver stopped
Aug 14 10:18:25 	tail_pfb 	10461 	[pfBlockerNG] Firewall Filter Service stopped
Aug 14 10:18:24 	php-fpm 	30665 	/rc.start_packages: Configuration Change: (system): [pfSense-pkg-WireGuard] Applied package default settings as necessary.
Aug 14 10:18:24 	php-fpm 	30665 	/rc.start_packages: Configuration Change: (system): [pfSense-pkg-WireGuard] Installed Unbound ACL group (WireGuard).
Aug 14 10:18:24 	php-fpm 	30665 	/rc.start_packages: Configuration Change: (system): [pfSense-pkg-WireGuard] De-installed Unbound ACL group (WireGuard).
Aug 14 10:18:24 	php-fpm 	30665 	/rc.start_packages: Configuration Change: (system): [pfSense-pkg-WireGuard] Installed interface group (WireGuard).
Aug 14 10:18:24 	php-fpm 	30665 	/rc.start_packages: Configuration Change: (system): [pfSense-pkg-WireGuard] De-installed interface group (WireGuard).
Aug 14 10:18:24 	check_reload_status 	484 	Syncing firewall
Aug 14 10:18:24 	php-fpm 	30665 	/rc.start_packages: Configuration Change: (system): [pfSense-pkg-WireGuard] Installed earlyshellcmd(s).
Aug 14 10:18:24 	check_reload_status 	484 	Syncing firewall
Aug 14 10:18:24 	php-fpm 	30665 	/rc.start_packages: Configuration Change: (system): [pfSense-pkg-WireGuard] De-installed earlyshellcmd(s).
Aug 14 10:18:24 	php 	83866 	haproxy: started new pid:87304
Aug 14 10:18:24 	php 	83866 	haproxy: reload old pid:42610
Aug 14 10:18:23 	php-fpm 	30665 	/rc.start_packages: Restarting/Starting all packages.
Aug 14 10:18:22 	check_reload_status 	484 	Starting packages
Aug 14 10:18:22 	php-fpm 	10714 	/rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 0.0.0.0 -> *WAN_Interface_IP* - Restarting packages.
Aug 14 10:18:20 	php-fpm 	10714 	/rc.newwanip: Creating rrd update script
Aug 14 10:18:20 	php-fpm 	10714 	/rc.newwanip: Resyncing OpenVPN instances for interface WAN1.
Aug 14 10:18:15 	php-fpm 	10714 	/rc.newwanip: IP Address has changed, killing all states (ip_change_kill_states is set).
Aug 14 10:18:15 	check_reload_status 	484 	Reloading filter
Aug 14 10:18:15 	check_reload_status 	484 	Restarting OpenVPN tunnels/interfaces
Aug 14 10:18:15 	check_reload_status 	484 	Restarting IPsec tunnels
Aug 14 10:18:15 	check_reload_status 	484 	updating dyndns *WG_GW*
Aug 14 10:18:15 	rc.gateway_alarm 	70659 	>>> Gateway alarm: *WG_GW* (Addr:8.8.8.8 Alarm:1 RTT:0.000ms RTTsd:0.000ms Loss:100%)
Aug 14 10:18:13 	php-fpm 	10714 	/rc.newwanip: Removing static route for monitor 8.8.8.8 and adding a new route through *WG_GW_IP*
Aug 14 10:18:13 	php-fpm 	10714 	/rc.newwanip: Removing static route for monitor 8.8.4.4 and adding a new route through *WAN_GW_IP*
Aug 14 10:18:11 	php-fpm 	10714 	/rc.newwanip: rc.newwanip: on (IP address: *WAN_Interface_IP*) (interface: WAN1[wan]) (real interface: igb0).
Aug 14 10:18:11 	php-fpm 	10714 	/rc.newwanip: rc.newwanip: Info: starting on igb0.
Aug 14 10:18:10 	check_reload_status 	484 	rc.newwanip starting igb0
Aug 14 10:18:10 	kernel 		tun_wg0: loop detected
Aug 14 10:18:10 	kernel 		tun_wg0: loop detected
Aug 14 10:18:10 	kernel 		tun_wg0: loop detected
Aug 14 10:18:10 	kernel 		arpresolve: can't allocate llinfo for *WAN_GW_IP* on igb0
Aug 14 10:18:09 	kernel 		tun_wg0: loop detected
Aug 14 10:18:09 	kernel 		tun_wg0: loop detected
Aug 14 10:18:09 	kernel 		arpresolve: can't allocate llinfo for *WAN_GW_IP* on igb0
Aug 14 10:18:09 	kernel 		tun_wg0: loop detected
Aug 14 10:18:09 	kernel 		tun_wg0: loop detected
Aug 14 10:18:09 	kernel 		tun_wg0: loop detected
Aug 14 10:18:09 	kernel 		tun_wg0: loop detected
Aug 14 10:18:09 	kernel 		arpresolve: can't allocate llinfo for *WAN_GW_IP* on igb0
Aug 14 10:18:08 	kernel 		tun_wg0: loop detected
Aug 14 10:18:08 	kernel 		arpresolve: can't allocate llinfo for *WAN_GW_IP* on igb0

Aug 14 10:18:34 	dpinger 	65033 	*WG_GW* 8.8.8.8: Clear latency 7390us stddev 36380us loss 11%
Aug 14 10:18:15 	dpinger 	65033 	*WG_GW* 8.8.8.8: Alarm latency 0us stddev 0us loss 100%
Aug 14 10:18:13 	dpinger 	65033 	send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 8.8.8.8 bind_addr *WG_Interface_IP* identifier "*WG_GW*"
Aug 14 10:18:13 	dpinger 	64735 	send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 8.8.4.4 bind_addr *WAN_Interface_IP* identifier "WAN_DHCP"

stephenw10

@mixka said in "tun_wg0: loop detected" following WireGuard gateway 100% loss and WAN interface rc.newwanip:

Should I try dumb switch before WAN interface and what difference it might doing?

That's unlikely to help here because, AFAIK, you are not seeing igb0 actually lose link which is what that would prevent.
It would still hit whatever is causing it to see a new WAN IP triggering the reload of the interface and everything on it.

Reviewing the thread though, this seems odd:
@mixka said in "tun_wg0: loop detected" following WireGuard gateway 100% loss and WAN interface rc.newwanip:

I pin down problem that they all was behind same dummy switch. I restarted that switch and everything worked like 2 minutes and then goes down. I removed that switch and everything back to normal.

Could it be that switch has some loop detection and was shutting down one or more ports to prevent it?
If so that implies whatever is creating the loop ends up sending traffic to the LAN side incorrectly.

Try checking the routing tables when it's working and compare that to the failed state.

Steve