New PPPoE backend, some feedback
-
@stephenw10
Updated the latest public beta (25.03.b.20250429.1329) but no change in the symptoms or the GUI presentation of the true interface status.I captured system logs and dmesg but nothing notable against the previous ones. It took a few retries to get itself going again.
[25.03-BETA][@Router-7.me]/root: pppcfg pppoe0 dev: igc0 state: session sid: 0x1552 PADI retries: 5 PADR retries: 0 time: 02:57:46 sppp: phase network authproto auto authname "xxxxxxxxxx@idnet" peerproto auto dns: 212.69.40.23 212.69.36.23 [25.03-BETA][@Router-7.me]/root:
️
-
On 25.03.b.20250429.1329
With the “Do not wait for a RA” box unchecked (my usual config is to have this box checked for reasons long since forgotten but sure to bite me at some point) the PPPoE interface symptoms, when selecting disconnect / reconnect, appear to have gone.
Not really sure why this box makes a difference but I have yet to see pfSense trip over itself since testing it with it unchecked.
I can see the additional PPP logging that has been added (rather than just reflecting when I last used the old PPPoE backend). Not sure what value it has added just yet as the logs are just filled with this:
if_pppoe: pppoe0 (8864) state=3, session=0x16bb output -> f8:13:08:xx:xx:ea, len=92 if_pppoe: pppoe0 (8864) state=3, session=0x16bb output -> f8:13:08:xx:xx:ea, len=347 if_pppoe: pppoe0 (8864) state=3, session=0x16bb output -> f8:13:08:xx:xx:ea, len=424 if_pppoe: pppoe0 (8864) state=3, session=0x16bb output -> f8:13:08:xx:xx:ea, len=37 if_pppoe: pppoe0 (8864) state=3, session=0x16bb output -> f8:13:08:xx:xx:ea, len=64 etc...
A further observation, with a layer of the onion now removed, is that additional services (eg Avahi, pfBlocker, VPNs etc) spend considerable time in the PPPoE interface starting session tying pfSense in knots, trying to re-initialise themselves for each and every stage of the PPPoE connection process.
Rather than waiting for the PPPoE interface to be fully up they clutter up the process with each (very short lived) up/down, port open/closed or re-numeration of the interface. This is somewhat similar to how the GUI seems to think the interface is 'up' for WAN / PPPoE when it is in the middle of restarting the session. It is like everything expects the PPPoE to be up and running before if_pppoe has signalled that it has completed the task.
ifconfig pppoe0 debug, dmesg -a and system log available on request.
️
-
With the “Do not wait for a RA” box unchecked (my usual config is to have this box checked for reasons
long since forgottenbut sure to bite me at some point)Ok, un-forgotten quite quickly. Whilst leaving the RA box unchecked works for taking the PPPoE interface down and up again it screws-up a full reboot instead.
Without the “Do not wait for a RA” box checked, on a full reboot the interface and the PPPoE appear to be up and running on the GUI but no actual internet traffic is passed for a further 4 or 5 minutes or more.
Start Time:
May 3 17:06:55 kernel ---<<BOOT>>--- May 3 17:06:55 syslogd kernel boot file is /boot/kernel/kernel May 3 17:05:09 syslogd exiting on signal 15 May 3 17:05:09 reboot 97088 rebooted by root
To this point when pfSense thinks it is ready (and normally where it should be up and running) but cannot reach outside:
May 3 17:07:50 kernel done. May 3 17:07:48 php-cgi 68067 notify_monitor.php: Could not send the message to xxxxxxx@xxxxxxx.me -- Error: Failed to connect to mail.haveworx.co.uk:587 [SMTP: Failed to connect socket: php_network_getaddresses: getaddrinfo for mail.haveworx.co.uk failed: Name does not resolve (code: -1, response: )]
To this point, where traffic does actually flow:
May 3 17:11:44 php-fpm 44318 /rc.newwanipv6: Resyncing OpenVPN instances for interface WAN. May 3 17:11:44 check_reload_status 680 Reloading filter May 3 17:11:35 php_pfb 5699 [pfBlockerNG] filterlog daemon started May 3 17:11:35 php_pfb 4074 [pfBlockerNG] filterlog daemon started May 3 17:11:35 php-fpm 44318 /rc.newwanipv6: rc.newwanipv6: on (IP address: 2a02:xxx:feed:xxxx:xxxx:xxxx:xxxx:xx06) (interface: wan) (real interface: pppoe0). May 3 17:11:35 php-fpm 44318 /rc.newwanipv6: rc.newwanipv6: Info: starting on pppoe0 due to REQUEST.
So I guess we still have a problem but we can move the problem somewhere else.
️
-
Hmm, interesting.
I expect to not have that checked because the dhcp is set to go over PPPoE. It should only try to pull a lease once the PPPoE is up and remote server sends an RA over it. But that does depend on the frequency the ISP sends at. One of the other issues we are seeing is with ISPs that send RAs at high frequency, like 10s intervals, and trigger events at each.
But I suspect the difference here is that the old backend only marks the interface up once it's actually connected and if_pppoe is seen as UP as soon as it's created. If dhcp6c doesn't wait for an RA it will immediately try and fail and then.... get stuck in some fail-loop!
We are changing that behaviour now so it may be fixed in the next build anyway.
-
@stephenw10 said in New PPPoE backend, some feedback:
Hmm, interesting.
I expect to not have that checked because the dhcp is set to go over PPPoE. It should only try to pull a lease once the PPPoE is up and remote server sends an RA over it.
Looking forward to the changes.
My ISP RA's are sent reasonably infrequently so once the PPPoE session is up the client router (pfSense) should send an RS upstream and get the RA straight back. Occasionally an RA is captured first but typically the RA used will be triggered by the RS.
The days of waiting obediently for an RA should be confined to history (well, whenever the replacement RFC came out, which is a number of years ago now). ISPs that deliberately machine-gun out unsolicited RAs should be sent a burning copy of the standards.
️
-
The 171.diff patch really improves things. New text file with logs, dmesg -a and my remaining comments sent direct.
️
-
Hmm, I can't replicate that FQDN access issue. How does it fail when you do that?
-
The GUI stalls and I get this:
With the fqdn it's like as soon as the WAN is lost it forgets that local access is still available. Perhaps unbound is restarting and local look-ups are dropped but I'm not really sure of the cause. I'm not using Kea, if that is a factor.
If I use the GUI via the IP address instead and take the PPPoE interface down / up then the GUI stays alive.
️
-
Hmm, yeah that does seem like it must be an Unbound issue. I guess I'm not seeing it here because I'm not using that box for DNS...
-
Just a reference: There is another thread where a user reports an issue with connecting to the webGUI (503 error) after upgrading from 24.11 to 25.03-BETA and then switching on the if_pppoe module:
-
@patient0
This may be a manifestation of the same bug: https://forum.netgate.com/topic/197119/dns-resolver-exiting-when-loading-pfblocker-25-03-b-20250409-2208. Some ISPs send RA packets too aggressively, and due to a bug, pfSense starts endlessly restarting related services and daemons. On certain hardware, it's even possible that PHP hangs as a result. -
Yup I'd bet it's that ^. Should be fixed in the next beta build.
-
@w0w said in New PPPoE backend, some feedback:
@patient0
This may be a manifestation of the same bug: Some ISPs send RA packets too aggressively...Thankfully my ISP is very mild with the RAs (and complies with the standards), so it is very rare for the process to be triggered by an RA and is almost exclusively an RS from pfSense kicking it all off.
The dns-resolver loosing its mind when pfBlocker does its thing would probably explain why the fqdn gets tossed.
Whilst it doesn't solve everything the 171.diff experimental patch has really calmed things down on boot & interface status change. Looking forward to all this being collected in a new beta. This is all looking positive.
️
-
-
Here's the file to test. This is not the final fix that will be in the build though.
171.diff -
@stephenw10
Oh yes, I tested an earlier version too, but this one at least works with the latest snapshot.
It looks promising.