Netgate 2100 ARP problem after replugging WAN port

somerino

It's set as default. I've a gateway WAN_DHCP and a OpenVPN gateway to somewhere else.
Thanks for this tip, I'll set it to auto, to see if this resolves the issue.

I'll create a PCAP of the 2100 LAN when it's in the failed state.

stephenw10

You need to set the default IPv4 gateway to 'WAN_DHCP', not auto (which is the default setting).

It's usually not an issue for OpenVPN gateways since they go down at the same time as the WAN. But definitely worth making that change anyway.

Steve

somerino

@stephenw10
Oh glad that you just said it...
I've set it to auto and the default was set to a VPN tunnel...

stephenw10

Ah, in that case you may have set the specifically as part of the VPN setup to force all traffic over the VPN?

Usually that's OK. This could be unrelated to what ever's causing the apparent ARP issue.

Steve

mikedob

@somerino ok network storms sometimes take time to build. The reason why I went after the sonos is because they can be bridging 1 network to a different network from its wireless. It's best to make sure all the sonos equipment is on the same and using an ssid you have control of. Also on some sonos equipment there are multiple ports. They are small 2 port switches, I advise only using 1

I've taken a 2nd look at the packets you posted. And see multiple devices asking for the mac address of a device, but the first part looks like it's looped, with a response from the device in question. Then other devices are asking and getting no response.
Firewall rules?

somerino

@stephenw10
I've run a PCAP on a self-hosted pfsense that had the same error state.
IntelCor_36:fb:39 (Pfsense)
Ubiquiti_d2:1a:a4 (UniFi Switch directly connected to the pfsense)

What I've noticed is that the issue I've mentioned above isn't only related to a device from one vendor.

somerino

@mikedob

I don't have any sonos in this case. But in my other post about sonos. I've seen on my UniFi Switch on a sonos port that it was declared as downlink to my core switch, which is absolute bs. That caused a loop

stephenw10

Mmm, this starts to look like the switch doing something odd. Some loop/storm prevention setting maybe?
You can see pfSense is responding to every ARP request it sees but it appears the requester is not seeing that response. If that's the directly connected switch it's hard to see how it could be failing to see it....
The only other possibility is the driver/NIC doing something odd to the packet before it physically leaves. But you're seeing that across two different NIC types on two different architectures.

Steve

somerino

@stephenw10

Steve I've simplified the network and there's only one switch connected to the pfsense. zero possibility for a loop. storm control is also disabled
The pfsense is still spamming ARP request, it never receives an answer, because the device with the IP: 192.168.70.4 is offline.

Is this a normal behaviour?

Screenshot 2022-07-27 155552.png

stephenw10

@somerino said in Netgate 2100 ARP problem after replugging WAN port:

The pfsense is still spamming ARP request, it never receives an answer, because the device with the IP: 192.168.70.4 is offline.

Yes, that's fine. If you have something referencing it in the config and pfSense is trying to send traffic to it, an internal DNS server for example, it will keep ARPing for it until it responds.

Steve

mikedob

@somerino sorry for not understanding that this is a different system than the slow sonos system you were working with. The original arp scan and the most recent ones you posted look normal. Like stephenw10 has said. My personal system has 2 IoT controllers ARPing for devices I shifted to a different VLAN 3 weeks ago. You probably don't have a ARP problem but may have something else wrong with your Gantner equipment.
For your Gantner equipment failing to work after loss of connection on the wan port makes me wonder what kind of cloud or internet or VPN connection needs to be reestablished. I would look at its full normal communication all packets not just ARP.
For example I'm currently researching the packet process for the DIAL protocol for multiscreen casting. It seems to be a different flavor of ssdp upnp and mdns that Avahi and pimd don't resolve

somerino

@stephenw10

What I've noticed is that after plugging the WAN port back in. The OpenVPN service works, but it's in a half functioning state:
Sorry for the blurry picture the status says: "Waiting for response from peer"

But still it's shown online:

Most of the network still works in this state except my access readers (gantner). Only through manually restarting the VPN tunnel, the problems gets resolved and I can see traffic on the VPN status page.

I'm using the Peer To Peer (SSL/TLS) server mode and I really haven't experienced such an issue before upgrading the netgate 6100 (other end of the connection) to 22.05

somerino

@mikedob

Hei mike, I think you're right, it's not an ARP issue.
I've captured the traffic between the server and the gantner device and it literally only sends a phantom byte and closes the connection:

Even weirder, sometimes I can only see the tcp handshake without closing the connection.
The problem is so frustrating. I've mentioned below, that It might be VPN related. But I can't figure out, why everything else works through the VPN, except those gantner readers.
I've found a work-around by restarting the VPN manually. There comes the next problem. I've to do it everyday, because for some other silly reason, the VPN tunnel automatically restarts at night a few times.

Jul 27 22:12:42 WHQ-FW01 check_reload_status[414]: Reloading filter
Jul 27 22:12:43 WHQ-FW01 php-fpm[51938]: /rc.openvpn: Gateway, NONE AVAILABLE
Jul 27 22:12:43 WHQ-FW01 php-fpm[51938]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use BUELACH_VPNV4.
Jul 27 22:26:00 WHQ-FW01 sshguard[87099]: Exiting on signal.
Jul 27 22:26:00 WHQ-FW01 sshguard[39533]: Now monitoring attacks.
Jul 27 22:35:46 WHQ-FW01 rc.gateway_alarm[38160]: >>> Gateway alarm: BUELACH_VPNV4 (Addr:10.0.0.2 Alarm:1 RTT:32.705ms RTTsd:2.141ms Loss:22%)
Jul 27 22:35:46 WHQ-FW01 check_reload_status[414]: updating dyndns BUELACH_VPNV4
Jul 27 22:35:46 WHQ-FW01 check_reload_status[414]: Restarting IPsec tunnels
Jul 27 22:35:46 WHQ-FW01 check_reload_status[414]: Restarting OpenVPN tunnels/interfaces
Jul 27 22:35:46 WHQ-FW01 check_reload_status[414]: Reloading filter
Jul 27 22:35:47 WHQ-FW01 php-fpm[51938]: /rc.openvpn: Gateway, NONE AVAILABLE
Jul 27 22:35:47 WHQ-FW01 php-fpm[51938]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use BUELACH_VPNV4.
Jul 27 22:38:04 WHQ-FW01 rc.gateway_alarm[33071]: >>> Gateway alarm: BUELACH_VPNV4 (Addr:10.0.0.2 Alarm:0 RTT:23.979ms RTTsd:1.969ms Loss:5%)
Jul 27 22:38:04 WHQ-FW01 check_reload_status[414]: updating dyndns BUELACH_VPNV4
Jul 27 22:38:04 WHQ-FW01 check_reload_status[414]: Restarting IPsec tunnels
Jul 27 22:38:04 WHQ-FW01 check_reload_status[414]: Restarting OpenVPN tunnels/interfaces
Jul 27 22:38:04 WHQ-FW01 check_reload_status[414]: Reloading filter
Jul 27 22:38:06 WHQ-FW01 php-fpm[375]: /rc.openvpn: Gateway, NONE AVAILABLE
Jul 27 22:38:06 WHQ-FW01 php-fpm[375]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use BUELACH_VPNV4.Jul 27 22:38:04 WHQ-FW01 rc.gateway_alarm[33071]: >>> Gateway alarm: BUELACH_VPNV4 (Addr:10.0.0.2 Alarm:0 RTT:23.979ms RTTsd:1.969ms Loss:5%)
Jul 27 22:38:04 WHQ-FW01 check_reload_status[414]: updating dyndns BUELACH_VPNV4
Jul 27 22:38:04 WHQ-FW01 check_reload_status[414]: Restarting IPsec tunnels
Jul 27 22:38:04 WHQ-FW01 check_reload_status[414]: Restarting OpenVPN tunnels/interfaces
Jul 27 22:38:04 WHQ-FW01 check_reload_status[414]: Reloading filter
Jul 27 22:38:06 WHQ-FW01 php-fpm[375]: /rc.openvpn: Gateway, NONE AVAILABLE
Jul 27 22:38:06 WHQ-FW01 php-fpm[375]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use BUELACH_VPNV4.

mikedob

@somerino at this point I am unable to assist farther my knowledge of vpn connections is limited. You may want to repost this as a open vpn connections problems or change the thread title to get others to assist. Maybe stephenw10 can work with this new information and help

stephenw10

Do you have DCO enabled at the HQ end? Try disabling it again if so. There is a known issue some people are hitting with DCO where only the server side IPs are available and nothing beyond.

Steve

somerino

@stephenw10
I haven't activated this feature yet.
I just found out on a site note, that my ISP had maintenance work the last couple days. So this explains why the VPN tunnels had to reset.

Still I can't find an answer, what the difference is between an automatic reset and manual reset of the tunnel.

somerino

@mikedob

This thread is already getting out of hand. It started with ARP went to Spanning-Tree, to TCP protocol, VPN and I hope it will find an end soon.

Thank you for your help so far. By the way, the sonos problem isn't solved yet, I'll comeback at it, after this :D

stephenw10

Hmm, you have clients logs leading up the 'stuck in pending' state?

somerino

@stephenw10 not more than this. This was the last time I've seen something related to the mentioned VPN tunnel in the logs.
I think it's the same error state as I mentioned in the original problem. By replugging the WAN port, I tear down the VPN tunnel unexpectedly for the functioning part of the tunnel.
After a while both sides try to reestablish the connection

Client Side

Jul 27 22:35:47 	php-fpm 	345 	/rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use HQ_TO_BUELACH_VPNV4.
Jul 27 22:38:07 	rc.gateway_alarm 	1141 	>>> Gateway alarm: HQ_TO_BUELACH_VPNV4 (Addr:10.0.0.1 Alarm:0 RTT:24.269ms RTTsd:1.975ms Loss:5%)
Jul 27 22:38:07 	check_reload_status 		updating dyndns HQ_TO_BUELACH_VPNV4
Jul 27 22:38:07 	check_reload_status 		Restarting ipsec tunnels
Jul 27 22:38:07 	check_reload_status 		Restarting OpenVPN tunnels/interfaces
Jul 27 22:38:07 	check_reload_status 		Reloading filter
Jul 27 22:38:08 	php-fpm 	99524 	/rc.openvpn: Gateway, none 'available' for inet6, use the first one configured. ''
Jul 27 22:38:08 	php-fpm 	99524 	/rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use HQ_TO_BUELACH_VPNV4

Server Side

Jul 27 03:11:44 WHQ-FW01 php-fpm[41160]: /rc.openvpn: Gateway, NONE AVAILABLE
Jul 27 03:11:44 WHQ-FW01 php-fpm[41160]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use BUELACH_VPNV4.
Jul 27 03:14:00 WHQ-FW01 sshguard[21891]: Exiting on signal.
Jul 27 03:14:00 WHQ-FW01 sshguard[55980]: Now monitoring attacks.
Jul 27 03:14:04 WHQ-FW01 rc.gateway_alarm[97993]: >>> Gateway alarm: BUELACH_VPNV4 (Addr:10.0.0.2 Alarm:0 RTT:26.625ms RTTsd:22.486ms Loss:5%)
Jul 27 03:14:04 WHQ-FW01 check_reload_status[414]: updating dyndns BUELACH_VPNV4
Jul 27 03:14:04 WHQ-FW01 check_reload_status[414]: Restarting IPsec tunnels
Jul 27 03:14:04 WHQ-FW01 check_reload_status[414]: Restarting OpenVPN tunnels/interfaces
Jul 27 03:14:04 WHQ-FW01 check_reload_status[414]: Reloading filter

stephenw10

There's only one client on that tunnel I assume?

With the server side assigned as an interface like that both sides reset when the link goes down because of the gateway failures. You might try disabling the gateway monitoring on the server side so it doesn't reset twice. There could be a timing issue.

Steve