Bug: Persistent Carp Maintenance Mode not effective through version update



  • Upgrading a box from 2.3.3-RELEASE-p1 to 2.3.4-RELEASE with Persistent Carp Maintenance Mode on, it came up with that still on according to the text on that button, but nonetheless tried to take over half of the interfaces – the half of the interfaces for which in normal operation the twin system is the master.

    This was one of a pair of systems where I'd put the one into Carp Maintenance because after months of CARP working fine they'd ended up going Master/Master, this despite the dedicated line between them still taking pings just fine.


  • Rebel Alliance Developer Netgate

    Sounds like you have a Layer 2 issue that is the root of your problem.

    Maintenance mode doesn't disable CARP, it sets the skew to 254. A node in maintenance mode can still be master if it sees no other heartbeats on the wire, or none faster than its own 254 skew.



  • Thanks. The skew would explain it taking the one set of IPs, on which it is backup. Not sure how it explains it not taking the other set, on which it is normally master. Both CARP tests are over the same NICs and wire.

    We have the two identical Dell servers connected directly, between identical NICs. Pings work in both directions. Indeed CARP had worked fine for months. But it had recently failed, and both systems viewed themselves as primaries – despite that pings continued to work in both directions on the wire between them. However there were also sporadic xml_rpc failures going back before then.

    What's most likely? One of the Dell's has a faulty NIC, despite pings being fine? Or there's some problem with the prior pfSense version? Are there standard BSD tools for diagnosing NICs that will run on pfSense?


  • Rebel Alliance Developer Netgate

    Most likely is a switch problem, it is failing to properly carry all heartbeats between both nodes. It is almost certainly not a pfSense/FreeBSD/OS issue.



  • jimp, thanks for the suggestion, but it cannot be a switch problem. As I said before, there is no switch in the circuit. The two Dells running pfSense are directly connected, redundantly so using lag.

    It could be the driver for the NICs in the Dells has a problem. Or it could be both NICs on one Dell or the other have a hardware problem. Or it could be both cables have a problem – we've replaced those a few days ago so we'll see, although the problem has been intermittent.



  • Someone know where to find BSD tools that will run on pfSense to diagnose why the lagg interface defined here is not 100% dependable?


  • Rebel Alliance Developer Netgate

    The issue you describe is almost always the switch, no matter how much you'd like to believe otherwise. Something is preventing the two nodes from exchanging multicast traffic when the interfaces come up, and only the switch could stop that.



    1. What are the tools available to test the signals?

    2. Is there a way to configure pfSense to handle IP failover not through testing on the specific interface, but on the crossover between the systems? That is, if the primary and secondary pfSense boxes can see each other on their direct connect, each takes it's allotted VIPs; if they can't see each other, each attempts to take over from the other?

    On (2), yes I get that this has a possibility of bad split-brain behavior. On the other hand, with a LAGG between them, that isn't such a likely scenario. And it gets around all switch problems with passing VRRP/CARP packets.


  • Rebel Alliance Developer Netgate

    1. Packet captures, perhaps from a switch span/mirror port.

    2. No.



  • The problem isn't the switches. The problem is the ports on the pfSense 2.3.4 system are not in promiscuous mode. At Diagnostics > Packet Capture > CARP without Enable Promiscuous Mode checked no CARP packets from the other system are seen. With it checked, they are seen.

    Unfortunately there's no option on the Interface configuration menu to set promiscuous mode on. What are the options using available FreeBSD utilities to set the Dells we're running this on to have the NICs in promiscuous mode? What's the proper place in pfSense to have this occur at boot?



  • @whitwye:

    Thanks. The skew would explain it taking the one set of IPs, on which it is backup. Not sure how it explains it not taking the other set, on which it is normally master. Both CARP tests are over the same NICs and wire.

    Maybe I'm not reading this correctly, but are you trying to do active-active? The only supported configuration for pfSense CARP (last time I checked, at least) is with all the VIPs master on the primary box.


  • Rebel Alliance Developer Netgate

    @whitwye:

    The problem isn't the switches. The problem is the ports on the pfSense 2.3.4 system are not in promiscuous mode. At Diagnostics > Packet Capture > CARP without Enable Promiscuous Mode checked no CARP packets from the other system are seen. With it checked, they are seen.

    The interfaces do not need to be in promiscuous mode for normal CARP operation. Your assumptions are still flawed.



  • Turning on promiscuous mode = CARP broadcasts received

    Having promiscuous mode off = CARP broadcasts not received

    The correlation here seems to rise the level of causation. Are there other adjustments that would enable the CARP broadcasts to be received short of having promiscuous mode on? If so, please tell us what they are.

    Meanwhile, since promiscuous mode on = CARP broadcasts received, and since CARP broadcasts received is a necessary condition for CARP to work, how to I have pfSense have promiscuous mode consistently on for all the interfaces on which I'm using CARP?

    The functions for this look to be in /etc/inc/interfaces.inc. But short of reverse engineering the whole thing, I'm sure there's an answer readily available to those who really understand the workings here. Here's hoping one of them will share it.


  • Rebel Alliance Developer Netgate

    Your diagnostic only covers what you see in tcpdump, not the system as a whole. If you never received CARP heartbeats without promiscuous mode, you would always see dual master 100% of the time and it wouldn't work in some cases and not others.

    You're still making statements based on flawed assumptions.



  • @dotdash:

    Maybe I'm not reading this correctly, but are you trying to do active-active? The only supported configuration for pfSense CARP (last time I checked, at least) is with all the VIPs master on the primary box.

    Yes. And it was working with 2.3.3. There are two WANs. In normal operation the VIPs for one WAN are on one pfSense box, the other on the other. It's active-passive for the pfsync configuration, and the VIPs are active-passive for each interface. But each WAN's VIPs are in normal operation on a different pfSense box.

    I've used CARP for years on Linux firewalls in similar configurations. There's no inherent limitation here, as long as the signal IDs are kept straight. Do you have a reference for this not being supported on pfSense? We've got a Gold subscription mostly to read the current book, but frankly it's sparse. Doesn't mention how pfSense decides which interfaces to put into promiscuous mode, or how to control that, for instance. Doesn't even mention that it can be a factor in CARP succeeding or not – when obviously it can, although I'm quite open to learning it's only required if something else which perhaps isn't clearly documented isn't done. Still it's frustrating not to have control of basic hardware configuration, because it's been automated with assumptions that don't even produce consistent results -- here consistency being whether on identical systems, with identical configurations, promiscuous mode is on or not on each interface.



  • @jimp:

    Your diagnostic only covers what you see in tcpdump, not the system as a whole. If you never received CARP heartbeats without promiscuous mode, you would always see dual master 100% of the time and it wouldn't work in some cases and not others.

    You're still making statements based on flawed assumptions.

    Pardon my assumptions, but you are ignoring plain facts:

    Turn on promiscuous mode on an interface, and CARP is seen. Turn it off, and CARP is not seen. Since promiscuous mode is on on some interfaces but not others, that accounts for the inconsistency I mentioned earlier. The details there are that of multiple interfaces using CARP and VIPs, those where promiscuous mode works on both sides work. Those where it is not on on the second system for that set of VIPs does consistently go dual master 100% of the time. You are making assumptions about my experience based on my having been nonspecific about that – since I hadn't spotted the promiscuous mode correlation before.

    Please, can we find someone to join the conversation who understands how the promiscuous mode configuration is handled, apparently differently, in the assumptions of different pfSense versions?


  • Rebel Alliance Developer Netgate

    active-active is not a supported configuration, never has been, and using different WANs on each CARP node is also not supported.

    And since you don't like taking sound advice from people who do know how CARP works quite well, I'm done with this thread. Have a good day.



  • I'm not using different WANs on each CARP node. I'm having failover work in a different direction for the two WANs. This was working with 2.3.3. With 2.3.4 it fails apparently because for inexplicable reasons 2.3.4 makes different decisions about which interfaces should be in promiscuous mode.

    Is there anywhere in the book or wiki where it says failover of one CARP interface's VIPs cannot be in a different direction between to pfSense boxes than the failover between another CARP interface's VIPs? There' s no logical reason inherent to CARP itself which would require such a constraint. Again, I've been using CARP for over a decade on Linux firewalls, in exactly this way. No problem.

    Now, since pfSense does handle the promiscuous mode assignment inconsistently between different versions, and since this, for my use, makes 2.3.4 a problem, it would be good to know how to take control of that assignment. Can someone who understands what it's doing there point me towards the right place to control this from?

    Thanks.



  • The pfSense docs do mention promiscuous mode being required in the context of hypervisors:

    https://doc.pfsense.org/index.php/CARP_Configuration_Troubleshooting

    We're not on a hypervisor here; we're running on metal. But that doesn't mean that our physical NICs, like those virtual switches, don't require promiscuous mode on to work. This may be a flaw in our switches. if so, turning promiscuous mode on for them is the workaround. How do I do that please?



  • @whitwye:

    Again, I've been using CARP for over a decade on Linux firewalls, in exactly this way. No problem.

    Interesting, I know there are some ports of CARP to Linux, but AFAIK, they have only been around for seven or eight years.
    Yes you can do active-passive on OpenBSD, FreeBSD, and probably Linux (no experience), but we are telling you what is supported on pfSense. If you think it worked for you on 2.3.3 you may want to go back to that version, or perhaps your configuration is more suited to an unfettered CARP implementation on BSD/Linux. I personally think you are mistaken about promiscuous mode, but do take a look at the FreeBSD man page for ifconfig if you want to pursue that further. I have no experience with active-passive clusters, so I'm afraid I can't be of any help there. Good luck, as you've managed to drive off one of the most knowledgeable and nicest users here, I'm not sure you are going to get any satisfactory answers on this thread.


  • LAYER 8 Netgate

    active/active is not a supported configuration. All VIPs on one node should be MASTER. All VIPs on the other should be BACKUP. If not your configuration is invalid.

    Promiscuous mode is not required to receive CARP heartbeats.

    Promiscuous mode in the hypervisors is so the hypervisor will pass the traffic to the VM for alternate MAC addresses and really has nothing to do with pfSense, but the "switch" in that case. Which is what has been pointed out to you as the almost certain cause of your problems multiple times regarding your environment but you refuse to listen.

    You will not find a list of all the stupid things people try to do that they can't do in the book. It would be a billion pages long.


Log in to reply