XG-7100 unreachable from *some* local LAN nodes

tkotl

I have an XG-7100 whose LAN interface is unreachable from most, but not all, nodes on the local subnet. Until I figure this out, I do not have the WAN interface connected and I can't commission the device.

All other devices can ping every other device just fine. While my Windows machine can't connect to the XG-7100 directly, it can ssh in by first accessing my Ubuntu machine.

Devices that are impacted can neither ping nor ssh nor http into the XG-7100. tcpdump at either end shows ICMP Echo Requests (or TCP SYNs depending on the test) going out, but never arriving at the target device. (Whether testing from the XG-7100 or another device). ARPs, however, show up just fine, and everybody's ARP table is complete and correct. The same behaviors are observed using IPv4 and IPv6. The MAC table on the switch is correct. I've found nothing useful in the logs.

Devices that are reachable:
x.x.x.1 (ISP router (the one I want to replace))
x.x.x.2 (Ubuntu machine)
x.x.x.32 (Unifi US-8-150W switch)
x.x.x.96 (Unifi UAP-AC-PRO)
x.x.x.97 (Unifi UAP-AC-PRO)

Devices that are not reachable:
x.x.x.3 (QNAP NAS)
x.x.x.33 (Unifi US-8 Switch)
x.x.x.98 (Unifi UAP-MESH)
x.x.x.99 (Unifi UAP-MESH)
...nor any of the Raspberry Pis on DHCP, nor my Windows laptop.

XG-7100 is connected to the US-8-150W, as is the QNAP, the UAP-AC-PROs, one of the UAP-MESHs and the Pis.

I have flushed the ARP tables on everything, including the switches. I have disabled firewalling entirely on the XG-7100 (pfctl -d). I have attached the QNAP directly to the onboard Marvel switch on the XG-7100, with no change in behavior. I have swapped out all the cables. I have replaced the US-8-150W with a dumb switch. I have tested using IPv4 and IPv6. I've validated netmask configurations on all devices.

XG-7100 has been factory reset and is a close to factory normal as possible.

The next hypothesis I intend to test is that there's something going on with VLAN tagging such that packets being emitted are getting dropped as corrupted by some IP stacks but not others. I'm going to get setup port mirroring on the switch to see if packets getting transmitted are reaching the switch and getting relayed.

Any suggestions or recommendations?

stephenw10

You are connected to the XG-7100 via it's switched ports? The default WAN and LAN there?

The WAN is still set to DHCP and not connected, so can't be conflicting?

You have disabled the DHCP server on the XG-7100 LAN? If not do you have two dhcp servers on the same segment? Which one are clients using?

Steve

tkotl

Yes, connected on the regular ethernet points, eth3-7, using the default configuration putting them all on lagg0.4091 (LAN). WAN is eth1 via lagg0.4090, and is not connected because a firewall that can't talk to a majority of nodes on the network is not helpful.

The DHCP server on the LAN interface is definitely disabled, and there's just one DHCP server. All testing has been done using IPs rather than names, and all devices are reachable from all nodes... except the XG-7100.

Additional:

I have connected a device directly to the onboard switch on the XG-7100, and it is still unreachable.
pcaps of a ping FROM the XG-7100 to that device show ICMP Echo Requests leaving the XG-7100, arriving on the device, and an Echo Reply being generated... but that Echo Reply never shows in the pcap on the XG-7100.
pcaps of a ping TO the XG-7100 show the ICMP Echo request leaving the attached device, but never appear on the pcap on the XG-7100.
Similar behavior for TCP.

Netgate Support confirms tcpdump happens prior to the packet filtering (pf) on the XG-7100, and I have disabled the firewalling anyway (pfctl -d).

Current hypothesis: !(&!&?

tkotl

I have successfully resolved the issue.

I pulled the XG-7100 out of the rack and moved it to the bench for testing, got everything setup and... couldn't replicate the fault. Everything worked fine. I left it all weekend and it's working 100% correctly, so I restored the proper configuration and put it back into production. All good.

All I can come up with is that factory resetting it, rebooting it and re-imaging it were not enough to clear some weird low-level hardware transient, but a hard power cycle was.

In other words, in all my debugging I failed to do that all important first step: I didn't turn it off and turn it back on again. :-( I trusted that a reboot would be sufficient, but it clearly was not.

Leaving this here to get indexed in case anybody else runs into this. Remember to turn it off and turn it back on again!

stephenw10

Hmm, I would not expect to ever require a power cycle for the switched ports.
The only time I have ever seen that is if a bad SFP module is used. It's possible the ix0/1 ports can require a power cycle to clear their state.
Anyway glad you found it.

Steve