new if_pppoe Backend - getting HA/CARP to work like in MPD

w0w

@perrin
This is just switching on maintenance mode on the primary, nothing unusual.

crl

Hi,
I really appreciate the time you put into this. Thanks for sharing.

I have installed the solution. After analyzing the logs it is clear that

CARP transition detected
Slave starts PPPoE session successfully at first
ISP rejects authentication with Too many sessions. ISP is refusing a second PPPoE login because the old session from my master pfSense is still alive
-Slave keeps retrying repeatedly but still no luck
(I even waited for 2-3 minutes).

So the slave's WAN is never up.

How to fix / work around? Add gui option to add a startup delay on the slave, so that when CARP changes, pfSense will wait 20 seconds before starting PPPoE.

MAC spoofing came also to my mind, but ISP can use a variety of signals to track PPPoE sessions:

PPP username/session state (most important)
PPPoE/PPPoE session id on their BRAS
CPE MAC address / modem association

w0w

@crl
I have experimented with different variants, and I can say that using a delay is not a good solution, as I mentioned earlier, because the firewall status can change during that delay. The logic needs improvement, but I don’t have enough time to work on it right now.
My script version handles this case much better, but it’s slower and not fully synchronized with status changes.

The only approach I see is to avoid breaking the connection immediately when the backup status is detected. Instead, register the status, start a time-based trigger that checks the status again before executing and quits if the current status has not changed or proceeds with the action if it is changed based on the first registered status. The same applies to the master: monitor it using a time-based trigger synchronized with the first status change, and quit if the status is unchanged or perform the action and then exit. This sounds simple but it is not, because we need also to ignore status changes after first change is detected and start it again in some time after all things have happened. And this all makes me think that logic becomes too complicated and too much code used to serve this implementation.

perrin

@crl said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

ISP rejects authentication with Too many sessions. ISP is refusing a second PPPoE login because the old session from my master pfSense is still alive
-Slave keeps retrying repeatedly but still no luck
(I even waited for 2-3 minutes).

Hi,
the same applies to my ISP. I also get a denied login at first when the slave comes up. Only in my case the ISP times out the old master session within a few minutes allowing the slave to connect.

Whenever the master fails "badly" it is unable to end the session cleanly and will always result in the slave not able to establish a connection for the first amount of time.

@crl said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

So the slave's WAN is never up.

I did not think about this case when designing the plugin cause from my understanding of PPPoE there is something called LCP keepalive which will time out a stale session at the ISP after some time. My ISP does that within seconds. Maybe your ISP has a quite lengthy setting of that timeout.

You could try to set the same MAC address on both firewalls for the PPPoE interface and see if that helps. The session definitely is still in a different state but maybe it helps with your ISP.

The most elegant solution however would be to syncronize the PPPoE session id, configuration values (IP addresses, gateways and so forth) between master and slave and have the slave pick up the current session. But that won't work without patching the if_pppoe itself which might be out of scope...

w0w

@perrin
How does your HA pair react if you put the master node into maintenance mode via Status → CARP → Enable Persistent Maintenance Mode (or whatever it’s called)?

perrin

@w0w Enabling the Maintenance Mode on the Master raises its skew thus transitioning MASTER to BACKUP. pppoe-ha picks up the backup state an disables the interface accoringly.

Since i don't have a problem moving the PPPoE session, in my case the failover works as expected.

Maybe @crl should try that and see

a) if if_pppoe correctly closes the session on the master prior to disabling the interface and
b) if his backup can correctly establish a new PPPoE session

crl

Please check it this workaround:
Github Issue - ISP side 'Too many sessions' keeping backup pfsense's WAN down

It solves only one use case:
-OK: enter and leave carp maintenance mode on manual trigger

-Solution requested: if a wan cable is pulled (between the wan switch and any of the pfsense devices) or if the pfsense machine is down:
perform MASTER --> BACKUP transition and connect pppoe on the BACKUP. Should the MASTER come back again, it shall take back the MASTER role and pppoe-reconnect on the MASTER.

crl

I tried to summarize what is going on during the switchover experiments. This is one example.

w0w

@crl
This 2:20 looks familiar to me...
@crl, @perrin do you both have dual stack pppoe?

perrin

@w0w said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

@crl, @perrin do you both have dual stack pppoe?
In my case yes, dual stack v4 and V6

@crl said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

I tried to summarize what is going on during the switchover experiments. This is one example.

Some of these issues might be related to configuration and or default behavior of pfSense (e.g. when pppoe fails and you're expecting a carp switch.)
Do these things work as expected when you are using the old time based scripts?

w0w

@perrin

Yes, in my setup things work somewhat differently, as you noticed. There are at least a few reasons. Most importantly, every time PPPoE comes up, the VIPs get reconfigured and CARP reinitializes. I suspect this behavior is related to IPv6 and the fact that the LAN uses the Track Interface option to obtain its IPv6 address, but I’m not certain. I’m currently trying to track down the root cause—or perhaps it’s an “incompatible” configuration.

How does this behave on your side? As I understand it, bringing up PPPoE does not trigger VIP reconfiguration/CARP initialization for you, right?

perrin

@w0w said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

@perrin

How does this behave on your side? As I understand it, bringing up PPPoE does not trigger VIP reconfiguration/CARP initialization for you, right?

No, with my config no VIP reconfig takes place when PPPoE comes up. In my case PPPoE is running in a vlan from the provider side and I've added the carp VIP on the "physical" interface, so without a vlan tag. This only triggers when a firewall goes down or the interface goes down, which in my case is exactly what I am expecting it to do.

In my case I am running two Proxmox hosts each running a virtual pfSense, one being master one being slave.
The most common reason I need failover to happen is when we are rebooting one of the Proxmox hosts due to software upgrades. In this case the master pfSense would be shut down cleanly and the slave takes over all interfaces with the PPPoE being one of them.

w0w

@perrin said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

In my case I am running two Proxmox hosts each running a virtual pfSense, one being master one being slave.

I am running the same configuration. Looks like I have found something related to this VIP reconfiguration issue. I will do some tests and report back if I find anything else.

w0w

I've experimented a lot with code, here is what I did to make it work with “buggy” config. pppoe_ha_event.php .

The biggest difference is that we shouldn’t run pfSctl -c 'interface reload <friendly>' (e.g., wan) if the PPPoE interface already exists. We only do that if, for some reason, the interface doesn’t exist. The shell script does the same, by the way.
Changes:

MASTER bring-up path updated: on MASTER we now first try ifconfig <real pppoeX> up if the PPPoE interface already exists; if it doesn’t, we fall back to pfSctl -c 'interface reload <friendly>' (e.g., wan). (Original only triggered the pfSctl reload path.)
CARP event suppression window: after switching to MASTER, the script temporarily ignores further CARP events (~60 seconds total in two 30s steps) to prevent flapping during stabilization.
Staged targeted reconciles: after ~30s (still MASTER) run a focused reconcile; after another ~30s run a safety reconcile. These checks act only if state truly differs (see next point).
Smarter reconcile rules: if MASTER and PPPoE already has a valid IPv4 P2P or global IPv6 address, do nothing; if BACKUP, ensure the real PPPoE iface is down.
BACKUP/INIT handling refined: on BACKUP/INIT we bring the real PPPoE interface down. On INIT we first re-read actual CARP state; only bring the PPPoE real iface down if the current state is truly BACKUP. Actually ignores init state, only backup brings pppoeX down.
Quiet periodic health check: every 5 minutes, perform a low-noise reconcile (skipped during the suppression window) to keep state honest if it missed for some reason. - this feature currently broken and I don't think iti is needed anyway

@perrin
I apologize for the possibly clunky AI-assisted code changes—I hope it works for you too. For now it’s been running quite stably on my side. Failover is instant and stable. Thank you for bringing it to life in a more acceptable form than what I had.

zjamali

@w0w Can these changes merged with original git repo so i can test it out?

w0w

@zjamali

Diagnostic - Edit File select
/usr/local/sbin/pppoe_ha_event.php
You can just replace the content of the file with one stored in archive.

perrin

Thanks for updating the script and testing.

@w0w said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

MASTER bring-up path updated: on MASTER we now first try ifconfig <real pppoeX> up if the PPPoE interface already exists; if it doesn’t, we fall back to pfSctl -c 'interface reload <friendly>' (e.g., wan). (Original only triggered the pfSctl reload path.)

does that work in your case? In my tests doing a ifconfig xxx up did not connect the interface. Can you confirm if ifconfig up is sufficient in your case?

@w0w said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

BACKUP/INIT handling refined: on BACKUP/INIT we bring the real PPPoE interface down. On INIT we first re-read actual CARP state; only bring the PPPoE real iface down if the current state is truly BACKUP. Actually ignores init state, only backup brings pppoeX down.

I remember that ignoring INIT state caused a problem which leads to both firewalls trying to connect to PPPoE, that is why I handled INIT in the same way as BACKUP to prevent an unclear state.

@w0w said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

Smarter reconcile rules: if MASTER and PPPoE already has a valid IPv4 P2P or global IPv6 address, do nothing; if BACKUP, ensure the real PPPoE iface is down.

That already is be the current functionality of the function get_pppoe_status

@w0w said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

Staged targeted reconciles: after ~30s (still MASTER) run a focused reconcile; after another ~30s run a safety reconcile. These checks act only if state truly differs (see next point).

I'd really love to come around any time delay based method. Time delays are never accurate under all circumstances and can cause issues with different configurations. They way it is implemented is quite stable using a file for syncing the script calls but it would be much cleaner if we could avoid running some background tasks in case of failover. I'd like to handle the devd events as purely as pfSense itself does that internally with pure pfSctl calls.

Can we try to understand why the time delay in your configuration is the better approach as compared to the pure event based approach?

w0w

@perrin said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

does that work in your case?

At first it didn’t work because the status changed from INIT to BACKUP within just a few milliseconds. And every time php followed it to put down pppoex.

It seems this caused if_pppoe or some pfSense code to get stuck in an unknown state; sometimes I even noticed that the IPv6 address remained on the interface.

Now it is working just fine, reconnecting in just seconds.

@perrin said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

I remember that ignoring INIT state caused a problem which leads to both firewalls trying to connect to PPPoE, that is why I handled INIT in the same way as BACKUP to prevent an unclear state.

It looks like it never happened to me, but maybe I need more tests to be done.

@perrin said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

That already is be the current functionality of the function get_pppoe_status

Yep, possible that this part is unnecessary or AI just listed one of my earliest changes for some reason. Will check it later.

@perrin said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

I'd really love to come around any time delay based method. Time delays are never accurate under all circumstances and can cause issues with different configurations. They way it is implemented is quite stable using a file for syncing the script calls but it would be much cleaner if we could avoid running some background tasks in case of failover. I'd like to handle the devd events as purely as pfSense itself does that internally with pure pfSctl calls.

Can we try to understand why the time delay in your configuration is the better approach as compared to the pure event based approach?

I think this is an incorrect description of what actually happens. The script still handles devd events as before. However, when a master event occurs, it brings pppoex up without delay. Then, it ignores devd events for 30 seconds to give the system some time to stabilize, and afterwards checks the status.

If any devd events were missed during this time, we simply repeat the reconciliation process and again ignore events for 30 seconds to allow the system to stabilize. After that, the script continues listening for events.

This logic can definitely be improved.

In my case, I can't just listen to events continuously, because after connecting to the ISP, I receive a backup status for a very short time. This causes the firewall to enter a continuous loop of connecting and disconnecting.

w0w

https://github.com/woffko/pfSense-pppoe-ha/blob/main/pfSense-pkg-pppoe-ha/stage/usr/local/sbin/pppoe_ha_event.php

A bit improved code and logic.

w0w

New package for 25.11 is ready for testing.

 pkg add -f 'https://raw.githubusercontent.com/woffko/pfSense-pppoe-ha/refs/heads/main/pfSense-pkg-pppoe-ha-0.1.3.pkg'