Netgate 6100 unstable since upgrade to 26.03
-
Since upgrading my 6100 from 25.11 to 26.03 I've had three occasions where the router has locked up - it stopped responding on all interfaces - and I had to power cycle it to recover. After the most recent instance the system log shows the following up until the moment of me power cycling it:
Apr 17 14:18:23 kernel pppoe: received PADO but could not find request for it Apr 17 14:18:23 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:18:23 kernel pppoe: received PADO but could not find request for it Apr 17 14:18:23 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:18:23 kernel pppoe: received PADO but could not find request for it Apr 17 14:18:23 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:18:23 kernel pppoe: received PADO but could not find request for it Apr 17 14:18:23 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:18:23 kernel pppoe: received PADO but could not find request for it Apr 17 14:18:23 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:17:27 kernel pppoe: received PADO but could not find request for it Apr 17 14:17:27 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:17:27 kernel pppoe: received PADO but could not find request for it Apr 17 14:17:27 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:17:27 kernel pppoe: received PADO but could not find request for it Apr 17 14:17:27 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:17:27 kernel pppoe: received PADO but could not find request for it Apr 17 14:17:27 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:17:26 kernel pppoe: received PADO but could not find request for it Apr 17 14:17:26 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:16:30 kernel pppoe: received PADO but could not find request for it Apr 17 14:16:30 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:16:30 kernel pppoe: received PADO but could not find request for it Apr 17 14:16:30 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:16:30 kernel pppoe: received PADO but could not find request for it Apr 17 14:16:30 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:16:30 kernel pppoe: received PADO but could not find request for it Apr 17 14:16:30 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:16:30 kernel pppoe: received PADO but could not find request for it Apr 17 14:16:30 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:15:34 kernel pppoe: received PADO but could not find request for it Apr 17 14:15:34 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:15:34 kernel pppoe: received PADO but could not find request for it Apr 17 14:15:34 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:15:34 kernel pppoe: received PADO but could not find request for it Apr 17 14:15:34 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:15:34 kernel pppoe: received PADO but could not find request for it Apr 17 14:15:34 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:15:34 kernel pppoe: received PADO but could not find request for it Apr 17 14:15:34 kernel pppoe0: host unique tag found, but it belongs to a connection in state 3 Apr 17 14:15:34 kernel pppoe0: link state changed to DOWN Apr 17 14:15:34 kernel if_pppoe: pppoe0: LCP keepalive timeoutCould this be related to the lock up? Are there any known issues that could cause this instability? System was rock solid on 25.11.
This is rather concerning.
-
You're using 'pppoe'.
I was told (see the xxxx post on the forum) that pppoe uses a new driver.
Go here : System > Advanced > Networking at the bottom of the page, where you can pick 'the other' one. -
@Gertjan said in Netgate 6100 unstable since upgrade to 26.03:
You're using 'pppoe'.
I was told (see the xxxx post on the forum) that pppoe uses a new driver.
Go here : System > Advanced > Networking at the bottom of the page, where you can pick 'the other' one.I am using the new driver (I guess the log messages do not differentiate).

-
They do log differently. Those logs are from the new if_pppoe driver. They usually show a connection where the remote side is not in the same connection state. So either the local client brought down the link but the server thinks it's still up or the other way around. Either way I'd expect it to be resolvable by reconnecting the WAN. A reboot should not be necessary.
However if it stopped responding on any interface that's something else. Those PPPoE logs could just be a symptom.
Are there any other errors logged? When it first hit the issue perhaps?
-
@stephenw10 Nothing that looked out of the ordinary in the system log. The messages immediately before the snippet I posted were some SSH connect/disconnect from one of my monitoring hosts (which were successful). The issue seemed to start pretty much at the time those PPPoE messages were logged. The next message after that snippet was the system boot message after I had power cycled the device. If it happens again, which I suspect it will, I will try just unplugging and replugging the WAN connection to see if it recovers, but I'm not convinced that is the issue as I was using the same WAN connection, and if_pppoe, under 25.11 and never had this issue.
Is there anything I can do to get more diagnostics if/when it happens again?
-
Was it still responding at the console?
If so you could grab the ifconfig output and to ping out from pfSense itself to something on the LAN. Try to determine if anything is still passing any traffic.
-
@stephenw10 No idea about the console. Where the unit is located makes it very hard to maintain a permanent console connection (though I will look into that for the future). The unit has 4 physical interfaces in use; 3x 2.5 Gbit igc and 1x 10 Gbit ix. One of the igc interfaces has two subnets on it via separate VLANs. One of the igc interfaces has a VLAN (911) underlying the PPPoE ISP connection.
When the issue started I tried pinging all of the internal (non WAN) interfaces and none of them were responding to pings. Sadly I wasn't able to try to ping the WAN externally as I had no connectivity and I was in a hurry to resurrect everything.
-
Hmm, well I would try to hook up the console to something if you can. That would show you just how unresponsive it is.
-
@stephenw10 I've managed to rig up a permanent USB console connection. Let's see what happens next time...
-
@stephenw10 Just to follow up on this. I may have identified the cause. I integrated my firewall with Home assistant using the pfSense integration. This uses the pfSense XMLRPC service to query status, statistics etc. for display in Home Assistant. As part of my troubleshooting I disabled this integration in Home Assistant and since then no hangs (so far at least - many days now). That suggests to me that there may be some kind of issue in the handling of the XMLRPC API in 26.03? I am leaving it disabled for now, though I believe it is fairly widely used.
-
Ah interesting. That has caused some problems in the past but I'm not aware of anything in 26.03 specifically.
Without any errors to work with it's impossible to say really. I assume nothing was shown on the console?
-
@stephenw10 Well, at the same time as I set up the permanent console I also disabled the HA pfSense integration. As there haven't been any issues since there is nothing to see on the console (or in the logs). I'm not sure I want to risk enabling the integration again as stability is paramount for me.
-
Hmm, I don't use Home Assistant here so I can't test it. Maybe you can replicate it on a test instance?
-
@stephenw10 The thing is that the issue affects my Netgate device, of which I only have one, nor Home Assistant. So using a test HA instance won't really help (if that is what you meant). I don't really have any easy way to spin up a test pfSense+ instance. Personally I will live without the HA integration for now as, while nice, it is not critical for me. Maybe someone else will run into the issue and be able to capture diagnostics etc.
-
I could probably test this. @ChrisJenk, can you give me an idea of where you got the integration (I don't see one in the standard distribution), and how it is configured?
-
@dennypage It's from the Home Assistant Community Store (HACS). Info here:
https://github.com/travisghansen/hass-pfsense
I created a dedicated pfSense user for it to use and gave it the credentials for that user.
There isn't much to configure; I just has the default set of enabled sensors and metrics. I wasn't using any of the control functions.
My suspicion is that over time (takes many days, perhaps weeks) the frequent polling either causes some kind of resource leak (though nothing was obvious; memory and CPU were fine) or some corner case concurrency issue causes some kind of lock up. Each time my Netgate unit locked up it was not even responding to pings, so it was a pretty hard lock up.
Since disabling the integration and blocking its access to pfSense (I locked the user) I haven't had the issue. Not conclusive proof but somewhat suspicious.
-
@ChrisJenk said in Netgate 6100 unstable since upgrade to 26.03:
I just has the default set of enabled sensors and metrics.
Just to be sure I understand, you didn't change (enable/disable) any of the controls or sensors? Everything in Home Assistant is still at defaults?
I have it up and running with defaults against one of my pfSense VMs.
-
Nice. I guess watch for memory leaks somewhere.
-
@dennypage I can't recall if I enabled any additional sensors but here is a list pf all the pfSense integration entities that were enabled on my system when the problem was occurring.
I'd be surprised if it were a memory leak since one of the things the integration monitors is memory usage and it was always very stable at 10-11%.

-
@ChrisJenk said in Netgate 6100 unstable since upgrade to 26.03:
I can't recall if I enabled any additional sensors but here is a list pf all the pfSense integration entities that were enabled on my system when the problem was occurring.
Okay, thanks. Looks like you had all the excess filesystems turned off, but otherwise it looks like default.
One thing, the last entity on the list (the one entitled "Update")... I assume that a HACS specific thing? I don't use HACS, so I just want to be sure.