internet access lost on wired (not wireless) devices after power outage

stephenw10

It's not a WAN side problem if the wifi clients can still get out.

How are the switch and Orbi device connected to ESXi?

Can LAN side wired clients connect to each other?

Do they even show as linked to the switch?

About the only thing I could see adding a downstream switch doing would be to allow connecting if auto-MDI/X is disabled on the main switch. It would also have to be set permanently to MDI such that you can only connect other switches and not clients directly. That seems very unlikely though.

Steve

charleyp1

@stephenw10
Yeah, I agree it's not a WAN side problem.

I have a server with 4 NICs, One is dedicated for WAN, one is connected to the switch, and one is connected to the Orbi. Everything is VLAN 0. There is a vSwitch for the WAN and another for the VMs.

I have a Freenas server in the environment as well and all wired and wireless devices have access to file shares and can connect to each other.

The switch all shows green link lights and pfSense shows green for WAN and LAN interfaces.

Yeah, I didn't think inserting a switch between the WAN port and the FIOS ONT would do anything but I stumped at this point.

The only thing I can think of is that something got hosed in pfSense. Or maybe ESX, though I can't imagine what it would be there. The fact that I also have a single VM that won't stay connected to the vSwitch/network is rather suspicious though.

I sure do wish I had a backup of pfSense prior to the power outage... I don't really want to have to rebuild from scratch.

stephenw10

So the two LAN NICs are passed through to the pfSense VM? Or they also have vswitches?

The fact pfSense shows as linked in the VM can be deceptive because it will always be linked to the vswitch even if the external link fails. Unless you have specifically configured the vswitch to pass that state.

The confusing part for me is that inserting the switch for one client between it and the main switch allows it to connect. That makes very little sense.

However you say all wired devices can access the FreeNAS server and that's a VM in the same ESXi host?

That implies it must be pfSense filtering or routing issue.

Are you applying policy routing to the wireless clients by any chance?
That might give them a default route when nothing else does if pfSense has a bad default gateway.
Check Diag > Routes. Make sure there is a default route and that it's correct.
If you have more than one gateway make sure the correct one is set as default in System > Routing > Gateways. If it's still set to auto try setting it to the WAN specifically to prevent it moving to an incorrect gateway.

Steve

charleyp1

@stephenw10

Ah, sorry. Yes, there is a vSwitch on the LAN side for the VMs (which includes pfSense).

I see, yes that would be deceptive. I did not configure it to pass that state, as far as I recall.

Yes, this is confusing to me as well. So, specifically, my switch is in a utility room in the basement. My desktop is in my office, also in the basement. The house has in-wall ethernet wiring. The path from switch to desktop is Switch--->In-wall wire--->wall plate--->Switch (small 5 port, unmanaged)--->Desktop (and a couple of other devices). All of those devices reach the internet. Why the devices that funnel down to one port on the main switch reach the internet, while no other port-connected device will is a true mystery to me.

Yes, all wired devices can access the LAN without issue. They just can't reach the internet. The FreeNAS server is a separate physical box, also connected to the main switch. And now that I think about it, I just did a ping test from the FreeNAS server to an outside IP (the ISP's gateway address) and it pinged successfully. So, really, there are 2 ports on which internet access is working. This further lends to your idea that pfSense is filtering or its a routing issue. The only thing that gives me pause there is that all of this was working fine before the power outage. After, things got squirrelly.

While I like to think I'm a decent data storage guy, I'm not that strong with networking. :-) I'm not sure exactly what you mean by policy routing. I did not configure anything special for the wireless clients. I only set the wireless router (Orbi) into AP mode. I'm looking at Diag > Routes and see the default route points to the ISP gateway address. That doesn't actually seem quite right to me. Shouldn't it point to the local gateway?

My System>Routing>Gateway setup looks like this:
Name: WAN_DHCP
Interface: WAN
Gateway: 173.xx.xx.1 (ISP gateway, I presume)
Monitor IP: 8.8.8.8
Description: Interface WAN_DHCP Gateway

And then below that, Default Gateway IPv4 is set to Automatic. I can change it to specify WAN_DHCP. Will test that now.

Thanks for the suggestion!

stephenw10

Ok so pfSense only has two interfaces; WAN and LAN. And all the wired and wireless devices are in the LAN subnet including other VMs in ESXi?

Is the Orbi connected to the main switch too? It read like that was on a separate NIC.

If it isn't then it couldn't have policy routing on it so it's very unlikely it would be a bad route.

I would try to ping one of the other VMs from one of the LAN clients that cannot reach the internet. That has to pass the switch and through ESXi but not pfSense.

Steve

charleyp1

@stephenw10

That is correct. All wired and wireless devices are in the LAN subnet including all VMs in ESXi.

The Orbi is not connected to the main switch. It has a separate LAN port on the ESXi host.

Interesting. I have a laptop plugged into the main switch and it cannot ping one of the VMs. But it can access the pfSense VM.

Now it feels like we're getting somewhere!

charleyp1

@stephenw10
It's starting to look like all VMs aside from pfSense are unreachable from wired devices attached directly to the main switch. And they cannot reach the internet. The exception, again, being my main desktop which has a dumb switch in between it and the main switch. But it CAN access all of the VMs (via RDP), even though it cannot ping them.

The plot thickens.

stephenw10

Ok, so you have two physical NICs connected to the LAN vswitch in ESXi?

If I've understood the setup here traffic between LAN clients and VMs should be able to reach them just via the vswitch. That traffic never goes through pfSense so if it's failing it must be because of something either in the main switch or the LAN vswitch.
That does assume that the LAN clients and VMs have the correct IP addresses and subnets which might not be the case if they are using DHCP?

charleyp1

@stephenw10

That is correct. (now that I think about it, maybe I should set up a link aggregate on my physical switch. Otherwise, I'm not really NIC teaming, am I?)

Interesting. I was thinking the same thing early on, but I replaced the switch with another smaller one to test that theory but the problem remained. I have not yet completely ruled out the vSwitch, and as I previously mentioned I am having another anomalous issue with one VM that no longer stays connected to the network. But vSwitches are pretty dumb and don't have much in the way of configuring so I'm not really sure where to go here in the troubleshooting process. I will start by getting a handle on the few settings that I can change to see if they will make a difference (i.e. Security, NIC teaming and Traffic shaping)

I should probably mention that I have 2 vSwitches. 1 is for the WAN port group (vSwitch1) and the other is for the LAN and Management port groups (vSwitch0). I don't actually recall why I have 2 port groups here and I might explore collapsing to a single port group for LAN. But my Management pg only has the ip for the ESX host in it so I think it might need to stay. It's been quite a while since I set this up and my memory is a bit hazy on this.

It took me a while to get back to you as I noticed that my main switch is using the default vlan of 1 and my LAN vSwitch is set to 0. I changed the vSwitch to 1 and locked myself out of my entire virtual environment, lol. Rookie move. ;-) Not really being familiar with ESX CLI, I couldn't make the command to change the VLAN back to 0 work (apparently its deprecated after ESXi 4 or 5, and I am running 6.0, which itself is pretty far out of date). I resolved this by going back to the ESX "GUI" via KVM and changed my Management network's vlan to "4095" which is essentially the same as "0" or "no vlan" and access was restored. I have since changed my LAN network to 4095 but that didn't do anything either. My physical switch only has vlans 2-4094 available so I can't really make them match. But with the vSwitches at 4095 (which I believe is essentially "no vlan") I don't think that matters, but I'm certainly no expert here.

I am continuing to tinker with settings and will reply with any successes. Thanks again for being interested enough to help!

charleyp1

@stephenw10

Well... It seems I have been waaaaay overthinking this. I just reread your last bit about LAN clients and VMs having the correct IP addresses and subnets. So I looked at the DHCP leases in pfSense and noticed that my laptop which is plugged directly into my main switch wasn't listed. I checked the IP address and noticed I had given it a static IP. I changed it to DHCP and, lo and behold, I was suddenly able to reach the internet. It seems that I never set up a static reservation for that IP. This is a tad bit embarrassing. I will now check all of my other physically connected devices and see if they also have static IPs and either change them to DHCP or create reservations. I'm using a 192.168 network and had set the DHCP range from .10 to .245, reserving the rest for static IPs. I don't know why this suddenly became an issue after a power outage but I am content to let that mystery lie buried.

I really appreciate you stepping through this with me. I at least learned a little bit more about ESXi networking and my managed switch. I tried the Link Aggregation but that was a bad idea and reversed it. It doesn't do what I thought it did. Again, thanks a ton for the help!

stephenw10

It's dangerous (or at least confusing) to talk about VLAN 0 or 1 as an actual VLAN because you almost never want that. Switches use 1 as the 'native' VLAN meaning they use that for untagged traffic internally in the switch. You should never see traffic tagged VLAN1 outside the switch.l Seeing it usually means something is configured incorrectly and unexpected results may occur!
https://docs.netgate.com/pfsense/en/latest/vlan/security.html#using-the-default-vlan-1

In ESXi VLAN 4095 means pass all VLANs. So allow tagged traffic on any VLAN to pass the switch much like most unmanaged switches would.

If you do have some tagging happening somewhere the addition switch on that one client that works could be stripping it. Especially if it's VLAN1.
That seems unlikely though. Hard to imagine that could have been set by a power outage. Or that it would have worked before that.

Steve