Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    new if_pppoe Backend - getting HA/CARP to work like in MPD

    Scheduled Pinned Locked Moved Development
    52 Posts 4 Posters 4.3k Views 5 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • w0wW Offline
      w0w
      last edited by w0w

      Tested—no luck. WAN stays Pending and only acquires IPv4; IPv6 never comes up. It seems you can’t just use pppoe0 up—you need to run something like:
      /usr/local/sbin/pfSctl -c 'interface reload wan'
      to bring it up correctly.

      see Bringing PPPoE and checking it after.

      Here’s my script’s logic, see Bringing PPPoE and checking it after.

      • Purpose & scope

        • A CARP-aware PPPoE watchdog for pfSense: it tracks the node’s CARP role (MASTER/BACKUP) and reacts by starting/validating or stopping PPPoE, and (on BACKUP) restoring missing VIPs (VHID 5).
        • All syslog lines use a uniform prefix PPPoE_script_message00X: with ascending IDs.
      • Tunables / constants

        • LOCKFILE=/var/run/run.sh.lock — stores PID (line 1) and last known CARP role (line 2).
        • PPPOE_IF=pppoe0, LAN_IF=lagg2.
        • Role detection is by grepping ifconfig ${LAN_IF} for MASTER vhid 5; VIP presence check greps for vhid 5.
        • PHP_BIN=/usr/local/bin/php.
        • Internal flag PPPOE_ALREADY_STARTED tracks whether the script has (re)started PPPoE in this run.
      • Singleton guard

        • On start, check_already_running() reads the lockfile; if the recorded PID is alive (ps -p), the script exits to avoid multiple instances.
      • Optional discovery

        • find_pppoe_info() grabs the first pppoeN interface and its IPv4 address (kept for parity with older versions; not strictly required elsewhere).
      • Main loop (role monitor)

        • start_monitoring():

          • Logs launch (001) and initializes CUR_ROLE from the lockfile if present.

          • Every 30 seconds:

            • Derives NEW_ROLE from ifconfig ${LAN_IF} (MASTER vhid 5 → MASTER; otherwise BACKUP).

            • If role unchanged → continue silently (no log spam).

            • If role changed:

              • On MASTER: call handle_master_carp(); log 002.
              • On BACKUP: call handle_non_master_carp(); log 003.
            • Update lockfile with current PID and the new role.

      • MASTER path (handle_master_carp)

        • If PPPoE hasn’t been (re)started in this run, call handle_pppoe_start().
        • Otherwise, log/verify link (004) via check_pppoe().
      • BACKUP path (handle_non_master_carp)

        1. Shut PPPoE down if any pppoeN exists:

          • Wait 10s, ifconfig ${PPPOE_IF} down, set PPPOE_ALREADY_STARTED=false, log 005.
        2. Ensure VIPs (VHID 5) exist on LAN_IF:

          • If vhid 5 is missing, log 006 and re-install VIPs by running:

            • php -r 'require_once "/etc/inc/interfaces.inc"; interfaces_vips_configure();'
      • Bringing PPPoE up (handle_pppoe_start)

        • Wait 130s to let CARP converge.

        • If no pppoeN exists:

          • Log 007, run pfSctl -c 'interface reload wan', set PPPOE_ALREADY_STARTED=true.
        • If pppoe0 exists and is UP:

          • Log 008, do nothing.
        • If it exists but is not UP:

          • ifconfig ${PPPOE_IF} up, log 009.
      • Verifying PPPoE (check_pppoe)

        • Wait 180s (grace period).

        • If no pppoeN is present:

          • Log 010, try pfSctl -c 'interface reload wan'.
          • On success: set PPPOE_ALREADY_STARTED=true; on failure log 011 and return error.
      • Logging policy

        • Routine role polls are quiet; logs emit only when the CARP role flips, plus the specific action logs (004–011) triggered by that transition.
      • Entry points

        • start: runs singleton check, optional discovery, then the monitoring loop.
        • stop: placeholder (prints a message; no teardown).
        • Otherwise: prints usage and exits non-zero.
      • Operational timings (summary)

        • Poll interval: 30s.
        • MASTER bring-up grace: 130s (CARP settle).
        • PPPoE verification grace: 180s.
        • BACKUP PPPoE down delay: 10s.
      • Key side effects

        • Keeps a persistent record of PID + last role in the lockfile.
        • Ensures PPPoE is up and stable on MASTER, down on BACKUP.
        • Auto-repairs missing VIPs (VHID 5) on BACKUP via pfSense PHP API.

      P 1 Reply Last reply Reply Quote 0
      • P Offline
        perrin @w0w
        last edited by

        @w0w said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

        /usr/local/sbin/pfSctl -c 'interface reload wan'

        this is exactly what the script is doing. See GitHub.

        Please note, that the IPv6 stuff has nothing to do with my script but seems to be more related to the general if_pppoe troubles

        w0wW 1 Reply Last reply Reply Quote 1
        • w0wW Offline
          w0w @perrin
          last edited by

          @perrin said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

          this is exactly what the script is doing. See GitHub

          👍

          But there is something else, I don't know… safety timer, maybe. My script almost never fails to get the thing up and running, at least now. I will test it more and will give you more feedback this week, I hope.

          P 1 Reply Last reply Reply Quote 0
          • P Offline
            perrin @w0w
            last edited by

            @w0w said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

            But there is something else, I don't know… safety timer, maybe.

            Might be since your script is running every 30secs, so there is some some random delay between the CARP state change and the time your script runs.
            In my case the script runs immediately with the CARP change.

            To test the behavior you could add some sleep in /usr/local/sbin/pppoe_ha_event (the shell script wrapper, not the php) prior to the exec line, e.g.:

            #!/bin/sh
            sleep 5
            exec /usr/local/bin/php -q /usr/local/sbin/pppoe_ha_event.php "$@"
            

            let me know if that changes the behavior on your firewall

            w0wW 1 Reply Last reply Reply Quote 1
            • w0wW Offline
              w0w @perrin
              last edited by w0w

              @perrin said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

              In my case the script runs immediately with the CARP change.

              And that’s a real problem. CARP can flip/flap on the interface several times per seconds, so you’ll end up running multiple commands and hit a race condition. You need to detect CARP state changes, but you don’t need this overreaction, so you need to check the state a bit later and only when you really detect that final state changed you run your commands. SO I don't think the event sleep timer can solve this problem, it's just getting into the queue.

              2025-09-21 14:18:27.478723+03:00 	check_reload_status 	656 	Configuring interface wan
              2025-09-21 14:18:27.473171+03:00 	pppoe-ha 	37352 	VHID 5 MASTER - UP wan (pppoe0)
              2025-09-21 14:18:27.472332+03:00 	pppoe-ha 	37352 	Handle CARP command for 5@lagg2 - MASTER
              2025-09-21 14:18:27.451908+03:00 	php-fpm 	3406 	/rc.carpbackup: HA cluster member "(10.0.90.5@lagg2): (LAN)" has resumed CARP state "BACKUP" for vhid 10
              2025-09-21 14:18:27.132013+03:00 	check_reload_status 	656 	Carp master event
              2025-09-21 14:18:27.110669+03:00 	pppoe-ha 	28694 	no mappings for VHID 10; ignoring
              2025-09-21 14:18:27.110296+03:00 	pppoe-ha 	28694 	Handle CARP command for 10@lagg2 - BACKUP
              2025-09-21 14:18:27.073719+03:00 	php-fpm 	62482 	/rc.carpbackup: HA cluster member "(10.0.87.5@ixl3.87): (WIFIAP)" has resumed CARP state "BACKUP" for vhid 9
              2025-09-21 14:18:27.073279+03:00 	php-fpm 	62482 	/rc.carpbackup: Suppressing repeat e-mail notification message.
              2025-09-21 14:18:26.804464+03:00 	check_reload_status 	656 	Carp backup event
              2025-09-21 14:18:26.779787+03:00 	pppoe-ha 	17333 	no mappings for VHID 10; ignoring
              2025-09-21 14:18:26.779198+03:00 	pppoe-ha 	17333 	Handle CARP command for 10@lagg2 - INIT
              2025-09-21 14:18:26.769846+03:00 	php-fpm 	3406 	/rc.carpbackup: HA cluster member "(10.0.87.5@ixl3.87): (WIFIAP)" has resumed CARP state "BACKUP" for vhid 9
              2025-09-21 14:18:26.470904+03:00 	php-fpm 	51124 	/rc.carpbackup: HA cluster member "(10.0.100.155@lagg0): (WAN2)" has resumed CARP state "BACKUP" for vhid 7
              2025-09-21 14:18:26.360247+03:00 	check_reload_status 	656 	Carp backup event
              2025-09-21 14:18:26.332786+03:00 	pppoe-ha 	5832 	no mappings for VHID 9; ignoring
              2025-09-21 14:18:26.332425+03:00 	pppoe-ha 	5832 	Handle CARP command for 9@ixl3.87 - BACKUP
              2025-09-21 14:18:26.150680+03:00 	php-fpm 	3406 	/rc.carpbackup: HA cluster member "(10.0.100.155@lagg0): (WAN2)" has resumed CARP state "BACKUP" for vhid 7
              2025-09-21 14:18:25.999747+03:00 	kernel 	- 	carp: 10@lagg2: BACKUP -> MASTER (preempting a slower master)
              2025-09-21 14:18:25.999658+03:00 	kernel 	- 	carp: 7@lagg0: BACKUP -> MASTER (preempting a slower master)
              2025-09-21 14:18:25.999518+03:00 	kernel 	- 	carp: 5@lagg2: BACKUP -> MASTER (preempting a slower master)
              2025-09-21 14:18:25.968532+03:00 	check_reload_status 	656 	Carp backup event
              2025-09-21 14:18:25.951088+03:00 	pppoe-ha 	99836 	no mappings for VHID 9; ignoring
              2025-09-21 14:18:25.950701+03:00 	pppoe-ha 	99836 	Handle CARP command for 9@ixl3.87 - INIT
              2025-09-21 14:18:25.857771+03:00 	php-fpm 	3406 	/rc.carpbackup: HA cluster member "(10.0.77.5@lagg2): (LAN)" has resumed CARP state "BACKUP" for vhid 5
              2025-09-21 14:18:25.857471+03:00 	php-fpm 	3406 	/rc.carpbackup: Suppressing repeat e-mail notification message.
              2025-09-21 14:18:25.701909+03:00 	php-fpm 	7887 	/rc.filter_synchronize: Beginning XMLRPC sync data to https://10.0.88.2:443/xmlrpc.php.
              2025-09-21 14:18:25.701214+03:00 	php-fpm 	7887 	/rc.filter_synchronize: XMLRPC versioncheck: 24.1 -- 24.1
              2025-09-21 14:18:25.701112+03:00 	php-fpm 	7887 	/rc.filter_synchronize: XMLRPC reload data success with https://10.0.88.2:443/xmlrpc.php (pfsense.host_firmware_version).
              2025-09-21 14:18:25.667673+03:00 	check_reload_status 	656 	Carp backup event
              2025-09-21 14:18:25.650236+03:00 	pppoe-ha 	93377 	no mappings for VHID 7; ignoring
              2025-09-21 14:18:25.649873+03:00 	pppoe-ha 	93377 	Handle CARP command for 7@lagg0 - BACKUP
              2025-09-21 14:18:25.545307+03:00 	php-fpm 	16208 	/rc.carpbackup: HA cluster member "(10.0.77.5@lagg2): (LAN)" has resumed CARP state "BACKUP" for vhid 5
              2025-09-21 14:18:25.541200+03:00 	php-fpm 	7887 	/rc.filter_synchronize: Beginning XMLRPC sync data to https://10.0.88.2:443/xmlrpc.php.
              2025-09-21 14:18:25.368345+03:00 	check_reload_status 	656 	Carp backup event
              2025-09-21 14:18:25.351847+03:00 	pppoe-ha 	92360 	no mappings for VHID 7; ignoring
              2025-09-21 14:18:25.351495+03:00 	pppoe-ha 	92360 	Handle CARP command for 7@lagg0 - INIT
              2025-09-21 14:18:25.076109+03:00 	check_reload_status 	656 	Carp backup event
              2025-09-21 14:18:25.046639+03:00 	pppoe-ha 	89037 	VHID 5 BACKUP - DOWN wan (pppoe0)
              2025-09-21 14:18:25.045844+03:00 	pppoe-ha 	89037 	Handle CARP command for 5@lagg2 - BACKUP
              2025-09-21 14:18:24.772146+03:00 	check_reload_status 	656 	Carp backup event
              2025-09-21 14:18:24.743736+03:00 	pppoe-ha 	86518 	VHID 5 INIT - DOWN wan (pppoe0)
              2025-09-21 14:18:24.742953+03:00 	pppoe-ha 	86518 	Handle CARP command for 5@lagg2 - INIT
              2025-09-21 14:18:24.530618+03:00 	kernel 	- 	carp: 10@lagg2: INIT -> BACKUP (initialization complete)
              2025-09-21 14:18:24.530574+03:00 	kernel 	- 	carp: 10@lagg2: BACKUP -> INIT (hardware interface up)
              2025-09-21 14:18:24.530524+03:00 	kernel 	- 	carp: 9@ixl3.87: INIT -> BACKUP (initialization complete)
              2025-09-21 14:18:24.530480+03:00 	kernel 	- 	carp: 9@ixl3.87: BACKUP -> INIT (hardware interface up)
              2025-09-21 14:18:24.530434+03:00 	kernel 	- 	carp: 7@lagg0: INIT -> BACKUP (initialization complete)
              2025-09-21 14:18:24.530364+03:00 	kernel 	- 	carp: 7@lagg0: BACKUP -> INIT (hardware interface up)
              2025-09-21 14:18:24.530296+03:00 	kernel 	- 	carp: 5@lagg2: INIT -> BACKUP (initialization complete)
              2025-09-21 14:18:24.530178+03:00 	kernel 	- 	carp: 5@lagg2: BACKUP -> INIT (hardware interface up)
              2025-09-21 14:18:24.463858+03:00 	check_reload_status 	656 	Carp backup event
              2025-09-21 14:18:24.450717+03:00 	check_reload_status 	656 	Syncing firewall
              2025-09-21 14:18:24.322569+03:00 	php-fpm 	51124 	/status_carp.php: Configuration Change: admin@10.0.77.3 (Local Database): Leave CARP maintenance mode
              2025-09-21 14:17:52.159134+03:00 	pppoe-ha 	89953 	VHID 5 BACKUP - DOWN wan (pppoe0)
              2025-09-21 14:17:52.121279+03:00 	pppoe-ha 	89953 	Reconcile: evaluating 1 mapping(s)
              2025-09-21 14:12:34.906579+03:00 	php-fpm 	18188 	/rc.filter_synchronize: XMLRPC reload data success with https://10.0.88.2:443/xmlrpc.php (pfsense.restore_config_section).
              2025-09-21 14:12:31.710678+03:00 	php-fpm 	18188 	/rc.filter_synchronize: Beginning XMLRPC sync data to https://10.0.88.2:443/xmlrpc.php.
              2025-09-21 14:12:31.710120+03:00 	php-fpm 	18188 	/rc.filter_synchronize: XMLRPC versioncheck: 24.1 -- 24.1
              2025-09-21 14:12:31.710030+03:00 	php-fpm 	18188 	/rc.filter_synchronize: XMLRPC reload data success with https://10.0.88.2:443/xmlrpc.php (pfsense.host_firmware_version).
              2025-09-21 14:12:31.559469+03:00 	php-fpm 	18188 	/rc.filter_synchronize: Beginning XMLRPC sync data to https://10.0.88.2:443/xmlrpc.php.
              2025-09-21 14:12:30.477082+03:00 	check_reload_status 	656 	Syncing firewall 
              
              

              Edit:

              I think the next logic will be good for me and anyone else:

              On first CARP event for (VHID@iface):
              Start a Collect window = 5-10 s.
              During these 5 s, just record each new role (MASTER/BACKUP). Always keep only the latest.
              After Collect ends:
              Start a Silence window = 5-10 s.
              If any new event arrives in this window, restart: go back to step 1 (new Collect 5-10 s).
              If no events arrive for the whole 5-10 s, we consider the state settled.
              Act once on the last recorded role:
              If last = MASTER → bring PPPoE up (only if not already up).
              If last = BACKUP → bring PPPoE down (only if not already down).
              Record the applied role to avoid repeating the same action later.
              Safety add-ons (still simple):
              Boot grace: skip everything for the first ~150 s after boot.
              Demotion guard: if net.inet.carp.demotion > 0, postpone action and re-check later.

              P 1 Reply Last reply Reply Quote 0
              • P Offline
                perrin @w0w
                last edited by

                @w0w said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

                And that’s a real problem. CARP can flip/flap on the interface several times per seconds

                I don't think this should happen. Normally CARP should be very stable in a working environment. In my case with all firewalls i manage CARP interfaces are never flapping without a reason and the reason only being a failure of some network device inbetween the firewalls or one of the firewalls itself.

                From the log you sent with an event each second there seems to be something wrong with your config. On my firewalls I don't see a single CARP event in days or weeks

                w0wW 1 Reply Last reply Reply Quote 0
                • w0wW Offline
                  w0w @perrin
                  last edited by w0w

                  @perrin
                  This is just switching on maintenance mode on the primary, nothing unusual.

                  1 Reply Last reply Reply Quote 0
                  • C Offline
                    crl
                    last edited by

                    Hi,
                    I really appreciate the time you put into this. Thanks for sharing.

                    I have installed the solution. After analyzing the logs it is clear that

                    • CARP transition detected
                    • Slave starts PPPoE session successfully at first
                    • ISP rejects authentication with Too many sessions. ISP is refusing a second PPPoE login because the old session from my master pfSense is still alive
                      -Slave keeps retrying repeatedly but still no luck
                      (I even waited for 2-3 minutes).

                    So the slave's WAN is never up.

                    How to fix / work around? Add gui option to add a startup delay on the slave, so that when CARP changes, pfSense will wait 20 seconds before starting PPPoE.

                    MAC spoofing came also to my mind, but ISP can use a variety of signals to track PPPoE sessions:

                    • PPP username/session state (most important)
                    • PPPoE/PPPoE session id on their BRAS
                    • CPE MAC address / modem association
                    w0wW P 2 Replies Last reply Reply Quote 0
                    • w0wW Offline
                      w0w @crl
                      last edited by

                      @crl
                      I have experimented with different variants, and I can say that using a delay is not a good solution, as I mentioned earlier, because the firewall status can change during that delay. The logic needs improvement, but I don’t have enough time to work on it right now.
                      My script version handles this case much better, but it’s slower and not fully synchronized with status changes.

                      The only approach I see is to avoid breaking the connection immediately when the backup status is detected. Instead, register the status, start a time-based trigger that checks the status again before executing and quits if the current status has not changed or proceeds with the action if it is changed based on the first registered status. The same applies to the master: monitor it using a time-based trigger synchronized with the first status change, and quit if the status is unchanged or perform the action and then exit. This sounds simple but it is not, because we need also to ignore status changes after first change is detected and start it again in some time after all things have happened. And this all makes me think that logic becomes too complicated and too much code used to serve this implementation.

                      1 Reply Last reply Reply Quote 0
                      • P Offline
                        perrin @crl
                        last edited by perrin

                        @crl said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

                        ISP rejects authentication with Too many sessions. ISP is refusing a second PPPoE login because the old session from my master pfSense is still alive
                        -Slave keeps retrying repeatedly but still no luck
                        (I even waited for 2-3 minutes).

                        Hi,
                        the same applies to my ISP. I also get a denied login at first when the slave comes up. Only in my case the ISP times out the old master session within a few minutes allowing the slave to connect.

                        Whenever the master fails "badly" it is unable to end the session cleanly and will always result in the slave not able to establish a connection for the first amount of time.

                        @crl said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

                        So the slave's WAN is never up.

                        I did not think about this case when designing the plugin cause from my understanding of PPPoE there is something called LCP keepalive which will time out a stale session at the ISP after some time. My ISP does that within seconds. Maybe your ISP has a quite lengthy setting of that timeout.

                        You could try to set the same MAC address on both firewalls for the PPPoE interface and see if that helps. The session definitely is still in a different state but maybe it helps with your ISP.

                        The most elegant solution however would be to syncronize the PPPoE session id, configuration values (IP addresses, gateways and so forth) between master and slave and have the slave pick up the current session. But that won't work without patching the if_pppoe itself which might be out of scope...

                        w0wW C 2 Replies Last reply Reply Quote 0
                        • w0wW Offline
                          w0w @perrin
                          last edited by

                          @perrin
                          How does your HA pair react if you put the master node into maintenance mode via Status → CARP → Enable Persistent Maintenance Mode (or whatever it’s called)?

                          P 1 Reply Last reply Reply Quote 0
                          • P Offline
                            perrin @w0w
                            last edited by

                            @w0w Enabling the Maintenance Mode on the Master raises its skew thus transitioning MASTER to BACKUP. pppoe-ha picks up the backup state an disables the interface accoringly.

                            Since i don't have a problem moving the PPPoE session, in my case the failover works as expected.

                            Maybe @crl should try that and see

                            a) if if_pppoe correctly closes the session on the master prior to disabling the interface and
                            b) if his backup can correctly establish a new PPPoE session

                            1 Reply Last reply Reply Quote 1
                            • C Offline
                              crl @perrin
                              last edited by crl

                              Please check it this workaround:
                              Github Issue - ISP side 'Too many sessions' keeping backup pfsense's WAN down

                              It solves only one use case:
                              -OK: enter and leave carp maintenance mode on manual trigger

                              -Solution requested: if a wan cable is pulled (between the wan switch and any of the pfsense devices) or if the pfsense machine is down:
                              perform MASTER --> BACKUP transition and connect pppoe on the BACKUP. Should the MASTER come back again, it shall take back the MASTER role and pppoe-reconnect on the MASTER.

                              C 1 Reply Last reply Reply Quote 1
                              • C Offline
                                crl @crl
                                last edited by

                                I tried to summarize what is going on during the switchover experiments. This is one example.

                                2a61333b-245d-4e7b-8640-dfe047400ef5-image.png

                                w0wW 1 Reply Last reply Reply Quote 1
                                • w0wW Offline
                                  w0w @crl
                                  last edited by

                                  @crl
                                  This 2:20 looks familiar to me...
                                  @crl, @perrin do you both have dual stack pppoe?

                                  P 1 Reply Last reply Reply Quote 0
                                  • P Offline
                                    perrin @w0w
                                    last edited by perrin

                                    @w0w said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

                                    @crl, @perrin do you both have dual stack pppoe?
                                    In my case yes, dual stack v4 and V6

                                    @crl said in new if_pppoe Backend - getting HA/CARP to work like in MPD:

                                    I tried to summarize what is going on during the switchover experiments. This is one example.

                                    2a61333b-245d-4e7b-8640-dfe047400ef5-image.png

                                    Some of these issues might be related to configuration and or default behavior of pfSense (e.g. when pppoe fails and you're expecting a carp switch.)
                                    Do these things work as expected when you are using the old time based scripts?

                                    w0wW 1 Reply Last reply Reply Quote 0
                                    • w0wW Offline
                                      w0w @perrin
                                      last edited by

                                      @perrin

                                      Yes, in my setup things work somewhat differently, as you noticed. There are at least a few reasons. Most importantly, every time PPPoE comes up, the VIPs get reconfigured and CARP reinitializes. I suspect this behavior is related to IPv6 and the fact that the LAN uses the Track Interface option to obtain its IPv6 address, but I’m not certain. I’m currently trying to track down the root cause—or perhaps it’s an “incompatible” configuration.

                                      How does this behave on your side? As I understand it, bringing up PPPoE does not trigger VIP reconfiguration/CARP initialization for you, right?

                                      1 Reply Last reply Reply Quote 0
                                      • First post
                                        Last post
                                      Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.