Subnet collapses periodically since 24.11-RELEASE
-
I am having a very difficult problem to solve.
Occasionally (it has happened 3 times now in 1 week) my netgate will collapse my 192.168.3.xxx LAN to 192.168.0.xxx which basically disables the entire network. VNC services, printers, the whole shebang becomes inaccessible.
Reseting the netgate pfsense router basically fixes this. This means it is physically dependent on a person to press the reset button. Not ideal, especially with people dialing in at different time zones and remotely after working hours.
When I go to DHCP Service and Refresh, the various linux servers, printers, and PC still have 192.168.0.xx address. When I reboot these servers, printers, PCs, they receive a 192.168.3.xx address. This is also not good.
When I check DHCP leases under Status I don't see any 192.168.0.xx entries.
DHCP is disabled in all the wifi routers that I use strictly for wifi and nothing else so I can't assume those TP-link Archer X73s are the culprit.
Our system has been stable for over a year, which may indicate the update is to blame, but not 100% sure.
Other things I can think of
- updated to 24.11-RELEASE
- played around with HAproxy
- saw a php error one day which was a memory issue, but it has never come back either ...
I unfortunately have nothing more to go on at this point, not even to diagnose.
Would appreciate any insight
-
@vf1954 said in Subnet collapses periodically since 24.11-RELEASE:
which may indicate the update is to blame
While that is possible that something changed with the update.. I am on 24.11 with multiple networks and they are not collapsing or changing in any way..
There is zero reason why a static IP set on an interface in pfsense would ever just change to a different network. Is your network actually 192.168/16 or something or 192.168.0/22 or something and your dhcp pool is changing?
Is this network a vlan on pfsense?
If pfsense somehow changed its IP to 192.168.0 vs 192.168.3 - and your clients got a dhcp from 192.168.0 then they would be listed under leases on pfsense. Another theory of what might be going on is you have some other dhcp server on the network handing out these 192.168.0 addresses..
So if it happens again, before go rebooting pfsense - look on pfsense for its actual address on this interface.. simple ifconfig would work if you can not get to the gui? from console for example
But pfsense doesn't just randomly change the IP address on an interface..
Lets see your current config for your dhcp and interface.. Example here is my lan and dhcp server settings.
So lets look at what yours is currently, and then when you see the problem again if you do - lets see what pfsense shows.
-
@vf1954
To start, I would choose a pre-update version in Boot Environments and run it for a while to make sure the issue is specifically with the 24.11 update and subsequent actions, rather than, say, a hardware problem. -
@johnpoz said in Subnet collapses periodically since 24.11-RELEASE:
There is zero reason why a static IP set on an interface in pfsense would ever just change to a different network.
I agree. Hence why I am scratching my head.
Thank you for helping so quickly. Let me get you some more information:
LAN is on 192.168.3/24
I do not use vlans.
I have not ruled out a random DHCP server acting out, but currently, afaik, I have disabled all DHCP servers except the netgate. The three tp-link archers are linked together (ethernet) with easymesh and dhcp is off. What I can say is that when I reset the netgate, the leases are all 192.168.3.xx yet the PCs (that haven't been restarted) are still on 192.168.0.xx until I reboot them too.
Next time this happens, I will do as you requested: log into through serial console and do ifconfig. Last three times I had to urgently reset it for logistic reasons.
Please see attached my configs requested!
netgate-lan1.png
-
@vf1954 you don't have to use console - that is only if you can not get to the lan interface.. But I would bet a large sum of money there is no way pfsense just out of the blue changes its static IP, which you left off your lan pic is set to static right? I have to assume that from the 192.168.3.2 /24 setting..
You sure its not rebooting and booting say a different image, like a previous one.. Just up out of the blue change its IP - yeah that makes zero sense.. Validate your uptime once it happens again before you go and reboot it.. Rebooting is really the last thing that should ever be done when troubleshooting something.. Its like a hail mary pass with 2 seconds left on the clock from the 50 yard line.. ;)
Or somehow a previous config got loaded? But normally for those to take effect you have to do a reboot. Yeah If I thought my pfsense just out of the blue changed its IP - I would be for sure scratching my head going WTF ;)
-
@johnpoz For some reason (perhaps it was the lack of !) the first image did not show (but it is clickable to show, yes, I have it on static).
I have never seen "track interface" before (which is enabled for ipv6). I don't know if that was part of the new release (or the one before).
Next time I'll bring a cable from the lan port to my own laptop to see if I can go onto the GUI. Prior attempts to see what was going on failed (save bringing out the serial cable which I didn't have time for, but I'll make time for it next time).
What I don't understand still is pfsense not communicating to the other PCs/printers to update the address when I reboot the pfsense (even if I refresh the DHCP service I would expect the ip address to update on the PC, but it doesn't).
-
@vf1954 once a client gets a lease, it doesn't care if the dhcp server is on, changes its IP range.. The only time dhcp client cares for a dhcp server is when it tries to renew, which again even no dhcp server the client is happy with the IP it had.. it will start screaming faster and faster hey give me a renewal.. When the lease finally expires it will then send a out discover.
So no your clients are not going to change their ip just because you rebooted pfsense.
-
@johnpoz Okay, that is good to know. Thank you for explaining that to me. How, then, would one change the clients IP from within pfsense? (If at all...?). Seems like this is a task beyond pfsense. But if that is the case, even more an issue as one has to be physically present in front of every PC (unless the lease expires)
-
@vf1954 you can't really change a clients IP on a whim from just pfsense, you can sure give it a reservation and then if you reboot it or release renew its lease etc it would get the new IP.
Same goes if you changed your IP range, clients wouldn't get an IP from the new lease unless the client was asking for an ip, etc.
I do this pretty much any time I bring a new device online - I let it get an IP, then I change it to reservation and have the client then reboot or release/renew so it now gets the reservation.
If the device was poe, you could prob force the IP change by cycling its ethernet port on your switch.
How fast a client would move over to a reservation or would depend on the length of the lease you gave it to start with - if your lease is like the default of 2 hours... Then 2 hours later all devices would be on the new iP range or be using a reservation you set for them, it could be faster.. But for sure you would know 2 hours all devices would have a new IP.
If your lease was like 8 days - then yeah it could take up to 8 days to move to the new IP without intervention on your part at the client.
-
@johnpoz I got a bit further. It went down three times since we last talked, this being the third. Thankfully it's midnight so I can finally test around without having to immediately restart it. Here are some findings.
Setup
Internet -> netgate/pfsense -> {wifi_router1, aruba switch}
-> wifi_router1 -> wifi_router2 -> wifi_router3
(Easy-Mesh Tp-link which disables router2/router3)
-> aruba switch -> 2nd tp-link switchTesting
- plugging in eth cable directly from netgate LAN -> laptop (running linux) does not produce a connection.
- therefore, no access to online GUI
- access to serial shows uptime of 6 days.
- I can ping 1.1.1.1 in pfsense shell, but i cannot ping domain (DNS server is pi-hole that is now on the 192.168.0.x network)
- wifi access gave me gateway of 192.168.0.1
- logging into 192.168.0.1 sent me to the second switch
- second switch was set to manual DHCP, IP 192.168.0.1 with 0.0.0.0 as gateway (not 100% sure if it went to static IP automatically but when pfsense is back up I'll create a rule for it)
- changed 2nd TP-switch to automatically get IP from DHCP server (i.e., netgate pfsense) and restarted ...
- no more access to tp-link switch ... -.-" still on 192.168.0.x connected to wifi (with no internet access)
- ran ip route | grep default and found new gateway at 192.168.0.254 (which is a TP-link router). TP-link router not accessible as I use 3 of them with Easy-Mesh and disabled DHCP... so likely using wifi_router2
- physically disconnected switch2 and router2 forcing me to go to wifi_router1 only
- 192.168.0.254 still gateway (surprised me). Still not accessible (now I'm only using wifi_router1 which I should be able to access...)
(also, wifi_router1 is set to 192.168.3.3 in pfsense) - not sure what other test I can run while under serial shell for pfsense...
Will restart system and ensure 2nd switch is in DHCP rules and update wifi_router firmware.
-
Your LAN :
so 192.168.3.2/24
Why not 192.168.3.1/24 ? .2 is ok of course, any .1 to .253 is ok - but 'strange'.@vf1954 said in Subnet collapses periodically since 24.11-RELEASE:
192.168.0.1 sent ...
Where does this network come from ? It's not a pfSense interface.
You have a router-after-router setup ? ( ! ). Why ? Again, it can be done, it can work, but why make a more complicated network like that ?
What about the god old [ISP] <=> [pfSense WAN <-> pfSense LAN] <=> switch <=> (all your PCs, APs, all other devices)
Your PCs and all other device will use the default DHCP, so they will connect.
If you use APs, set them up with static IPs like 192.168.3.3 192.168.3.4 etc - they will all have their gateway set to 192.168.3.2 (pfSense) - disable on all APs the DHCP server - set the DNS on all APs to 192.168.3.2 (pfSense) - if your APs have a labeled "WAN" port do not use it, use a LAN port. after all, you use the APs as an AP, you don't want them to use as a 'router'. pfSense your one and only router.@vf1954 said in Subnet collapses periodically since 24.11-RELEASE:
plugging in eth cable directly from netgate LAN -> laptop (running linux) does not produce a connection.
Before plugging your laptop into the pfSense LAN port : check :
Is the pfSense DHCP server up and running ?edit : on console, menu option 8, type
ps aux | grep 'kea'
If you use ISC :
ps aux | grep 'kea'
end edit.
Is the laptop using DHCP client (default, it is) ?
Now, console access pfSense, menu option 8 :tail -f /var/log/dhcpd.log
and now connect you laptop.
What shows up ? -
@Gertjan Thank you for your wonderful reply.
I have everything up and running since I reset the pfsense.
I have router after router because I use them as an "Easy-Mesh" network so the company can traverse the entire property without dropping the signal. So the "routers" don't actually do any DHCP. If I make all 3 AP then I lose the Easy-Mesh functionality.
The only problem is whenever I update firmware I have to start the entire process over again because these TP Archers are not connected via WAN but LAN.
.2 was because .1 was problematic due to our ISP. Today I may revert back to .1 but meh.
I suspect something strange is occuring with the routers. So I completely re-programmed them and updated the firmware. I also set a few key components to static (like the DNS and that second switch)
If the network goes down again, I'll follow your advice with the shell prompts (I assume the second one was meant to say 'isc')? Thank you so much!
-
@vf1954 said in Subnet collapses periodically since 24.11-RELEASE:
.2 was because .1 was problematic due to our ISP
Hummmmm
You took .2 because .1 was already used ? Like "192.168.3.1" is already occupied on LAN ? WAN ? Where ? On WAN ? If so, you can't use 192.168.3.x/24 on LAN. -
@Gertjan This was many years ago.
192.168.3.1 is not in use. But since so many clients have 192.168.3.2 hardcoded it's best to just use .2
Clearly updating the firmware didn't solve the problem.
It happens just randomly. Today at 3PM I suddenly lose wifi and ethernet access. And more bizarre, only a few computers, but progressively all of them.
Uptime is currently 7 days.
When I run
ps aux | grep 'isc'
I get
root 1651 0.0 0.1 4672 2256 u0 S+ 15:34 0:00.01 grep isc
running
tail -f /var/log/dhcpd.log
Produces
Sending to Solicit (multiple lines)
The actual time it takes to even get a connection is a good 45 seconds, and then I just get a ? on the wired connection on ubuntu laptop and when I go to properties of the wired connection ... no IP shows up.
When I wrote the 'kea' I get more dhcp6 stuff (which is turned off in the GUI)
What is happening?
:(
-
@vf1954 said in Subnet collapses periodically since 24.11-RELEASE:
tail -f /var/log/dhcpd.log
Doing it after I reboot the netgate produces some warnings
-
@vf1954 said in Subnet collapses periodically since 24.11-RELEASE:
192.168.0.254 (which is a TP-link router)
Where is this set on your TP-Link? How is it connected to your pfSense LAN network?
-
@SteveITS It is simply plugged in, gets assigned a lan address from pfsense at 192.168.3.3, and then that's it
-
@SteveITS sorry I see what you mean.
It is set at 192.168.3.3 in hte LAN settings in tplink
AND
it is set to 192.168.3.3 in pfsense dhcp static.
-
@vf1954 So, what is the .254 you mentioned?
Screencap the change in pfSense when this happens.
If the fields in pfSense aren’t changing I suspect what you’re seeing is another DHCP server. Windows and I’m sure other clients will show the DHCP server used for example “ipconfig /all”
-
@SteveITS said in Subnet collapses periodically since 24.11-RELEASE:
Screencap the change in pfSense when this happens.
Not sure what you mean here. Does screencap mean screenshot? Screenshot what?
The address being circulated is 192.168.0.xx but the other DHCP router is the wifi which is turned off.