LAN Interface Drops Every Few Hours

Bisho

Hi,

I have the below diagram in a hotel environment. DHCP Leases and Captive Portal users sometimes reach 1000.

Every few hours the pfSense's LAN Interface drops and becomes unpingable from the VM Host's LAN NIC itself which causes devices not to get DHCP IPs and users to lose the internet connection.
Rebooting the pfSense VM solves the issue. The System Log doesn't show anything unusual or odd at the time of the LAN Interface getting dropped.

The VM Host PC is a high-end one with 2 physical NICs.
Both WAN and LAN are "Bridged" types and assigned Static IPs. Is this correct? Trying Host or NAT for the LAN prevented the devices from getting DHCP IPs.
No VLANs are setup in pfSense, which doesn't seem to cause any issues for devices communicating with the pfSense DHCP Server.
There's a Guest VLAN configured on the switches. Untagged on the Core Switch's port connected to the VM Host's LAN and Untagged also on all the Edge Switches' ports connected to the APs. Do I need to configure this Guest VLAN on the pfSense as well although its inexistent doesn't seem to affect the communication?
If I want to put the Captive Portal on an OPT Interface, do I need a 3rd physical NIC on the VM Host or can it be a virtual one?

I appreciate your thoughts on this issue.

Screenshot 2024-08-23 213524.png

stephenw10

I assume you have VLANs setup in the Core switch to separate the pfSense WAN and LAN and the ISP router and edge switches?

You don't need an additional NIC in the hypervisor to add an OPT interface in pfSense. That can be a VLAN. It could be either a VLAN in the hypervisor or in pfSense directly.

When the LAN interface 'drops' what exactly happens?

Can you still ping out from pfSense directly? On both interfaces?

Steve

Bisho

@stephenw10 Thanks for your response.

Yes, I have two VLANs setup on the Core Switch. let's call them VLAN1 for WAN (untagged on the ports of the ISP Router and the Hypervisor WAN NIC) and VLAN2 for LAN (untagged on the ports of the Hypervisor LAN NIC and APs).

When the LAN interface drops, the Hypervisor still has access to the internet via WAN. I'm still able to access the webConfigurator via the WAN IP and I see the DHCP Server Service running, but it's unable to give out IPs because the LAN Interface is dropped and not passing any traffic.
It also became unpingable from the Hypervisor LAN NIC.
You can ping our from pfSsense to 8.8.8.8 for example but not to 10.10.0.2 (Hypervisor NIC).

The System Log doesn't show anything unusual. I always see these three nginx errors which I think none of them is related to my LAN issue.

One is (accept4() failed (53: Software caused connection abort) I have read about this error and they said it happens in a situation when the user didn’t wait for a page heavily populated with images to load fully, and clicked on a different link.

Two is SSL_do_handshake() failed (SSL: error:0A0000A4:SSL routines::too much early data) while SSL handshaking, client: 10.10.XX.XX, server: 0.0.0.0:8003

Three is limiting connections by zone "addr", client: 10.10.XX.XX, server: , request: "GET / HTTP/1.1", host: "r11.i.lencr.org"

I have another interesting update. As I went with the simple setup (no OPT) by making the LAN Interface to be used for both the DHCP Server and the Captive Portal, I created a rule on the LAN to block the DHCP Leases from accessing the webConfigurator IP via ports 443, 80, and 22. I just didn't want to give some smart guests the opportunity to type in webConfigurator URL (DHCP Gateway IP) and try to access it, although the Admin password had already been changed.

This rule did its job, but I noticed when I had it enabled, that the ping from the Hypervisor NIC (10.10.0.2) to the pfSesne LAN (10.10.0.1) was high (between 50ms-500ms) and even more, even though this rule is not applied on the whole range 10.10.0.2-10.10.0.254 and shouldn't affect it as it resides outside the DHCP Pool Range.

After I posted my question, it crossed my mind to disable this rule, and I was surprised to see the continuous ping became excellent, not more than 20ms. and the LAN Interface hasn't dropped until this moment (over 12 hours).
I'm going to keep monitoring to see if the LAN Interface will ever drop.

I don't know what the relationship was between this rule and the LAN interface getting dropped.

Anyway, can you answer my questions? Was I right in making the VMnet0 and VMnet1 Bridged? That was the only type that worked in my environment.

Also, am I supposed to create the Core Switch VLAN2 in pfSesne and make its Parent Interface the LAN with VLAN Tag 2 ? Even though the communication is working fine without it?

Lastly, if I decide to go with the OPT Interface implementation, is it going to be used for the DHCP Server and Captive Portal while the LAN is for management only or vice versa?

What's the best approach to achieve that? Do I need to create a 3rd VLAN on the Core and Edge Switches?

stephenw10

Yes making them bridged in the hypervisor is probably what you want in that setup. Anything else just adds an extra layer of NAT you don't need.

No you don't need to make any VLANs in pfSense for just WAN and LAN. The switch is passing that traffic untagged to pfSense.

If you create an OPT interface you would need to add that as a VLAN somewhere since the switch would pass that tagged. That could be in pfSense or it could be in the hypervisor with pfSense just seeing a 3rd virtual NIC.

You can use either LAN or OPT for CP clients. pfSense treats them the same. The only difference is LAN has some predefined rules.

Bisho

@stephenw10 Thank you for answering my questions.

As an update to my original issue, the pfSense has been running continuously for 3 days without any issues. The LAN hasn't dropped at all and I haven't had to restart it every 2-3 hours.

I believe the issue was related to the LAN rule I created to block the DHCP Leases from accessing the webConfigurator IP via ports 443, 80, and 22. It might have been causing a high load or some kind of conflict and causing the LAN to drop.

I assume my only option to achieve that (instead of a rule) is to add the OPT to separate the guest devices via VLAN.

One last question. Regarding the 3 errors I keep seeing in the log, anything of concern?

stephenw10

Nope those errors in the nginx log are almost certainly nothing to worry about.