Secondary pfSense rendomly setting itself as CARP MASTER



  • Hi,

    I have 2 pfSenses, and around 16 networks set up with CARP, including the WAN.
    The 2 nodes are in version 2.4.4
    The nodes are Hyper-v VMs with 4 cores each, the usage rarely goes to more than 10%
    All the interfaces are connected via switch, and are visible from each other.

    When the pfsense boot up, everything works as intended, the primary has all the interfaces as MASTER and the secondary has all the interfaces as backup, and it can continue to work like this for some time, hours or days.

    The problem arises (I'm not 100% sure but I believe so) when I perform changes on the primary. A NAT rule, a firewall rule, modifications on the pfBlocker, etc... When I apply the changes, the pfSense webpage stays thinking for a few seconds until it responds again, and everything works fine except the WAN CARP interface, where now both the primary and secondary show MASTER state, and thus they keep messing up with the states of the connections. The rest of the interfaces stay the same, the primary as master and the secondary as backup

    This problem cannot be solved until the secondary is restarted, in which case everything goes back to normal, until I make more changes on the primary and it keeps it busy thinking, at which point the secondary will again adopt the master role, and then both will be in master until the secondary is restarted again.

    I looked at the logs, but they only indicate the secondary has assumed the role of Master, without saying a reason.

    The primary has an advertising frequency of 1, and a skew of 0, and the secondary has an advertising frequency of 1 and a skew of 100

    The pfsync's CARP demotion factor adjustment is at the default value of 0

    I even tested both VMs on the same Hyper-v host, with the interfaces on the same virtual switch, to make sure there were no connectivity issues between them, but with the same results.

    Any suggestions to what may be happening, or what can I look for clues?

    Thank you guys


  • LAYER 8 Netgate

    The only way a secondary would assume CARP MASTER is if it stopped receiving heartbeats from the primary. Packet capture on the primary and secondary for CARP packets and do whatever it is you do that causes it to malfunction. If the CARP packets are sent from the primary and not received by the secondary, then it's something in your infrastructure.


  • LAYER 8 Netgate

    @CPrat said in Secondary pfSense rendomly setting itself as CARP MASTER:

    The pfsync's CARP demotion factor adjustment is at the default value of 0

    pfsync has nothing to do with CARP demotion. See the sticky post in this category for an explanation.



  • @Derelict Thank you for your suggestions. Will definitely perform a capture when I reproduce the problem.

    So far, with the secondary turned off, I see this:

    10:57:10.028098 IP [My primary node IP] > 224.0.0.18: CARPv2-advertise 36: vhid=22 advbase=1 advskew=0 authlen=7 counter=17221242551398897455
    10:57:10.404865 IP [Net provider IP] > 224.0.0.18: CARPv3-advertise 12:
    10:57:10.404881 IP [Net provider IP] > 224.0.0.18: CARPv3-advertise 12:
    10:57:11.038133 IP [My primary node IP] > 224.0.0.18: CARPv2-advertise 36: vhid=22 advbase=1 advskew=0 authlen=7 counter=17221242551398897455
    10:57:11.255657 IP [Net provider IP] > 224.0.0.18: CARPv3-advertise 12:
    10:57:11.255682 IP [Net provider IP] > 224.0.0.18: CARPv3-advertise 12:
    10:57:12.048095 IP [My primary node IP] > 224.0.0.18: CARPv2-advertise 36: vhid=22 advbase=1 advskew=0 authlen=7 counter=17221242551398897455
    10:57:12.066538 IP [Net provider IP] > 224.0.0.18: CARPv3-advertise 12:
    10:57:12.066565 IP [Net provider IP] > 224.0.0.18: CARPv3-advertise 12:

    Is there a way to set the logs from the CARP module more verbose so I can have a better understanding of what is going on as well? I'd like to see if maybe my network provider has some device that uses CARP and it's interfering with this. I already asked them.

    Thank you


  • LAYER 8 Netgate

    CARP is very similar to VRRP. I would expect they are using that instead.

    They should coexist just fine.

    Wireshark might make more sense in decoding what is really out there.

    You can switch between decoding as CARP or VRRP by right-clicking a frame and choosing which to use to decode protocol 112 using Decode as....

    1f1905d7-0f69-49b7-b406-b2be4b3160d5-image.png



  • @Derelict Thank you for your answers.
    I've done some more packet capture and I found something:
    I am able to reproduce the issue consistently. As soon as I make a change on the firewall, like a NAT rule, a firewall rule, etc.. the secondary starts acting up as primary for the WAN (Not for the other VLANS though). I am unsure if making modifications to other specific things trigger the problem, but I've seen it other times, so I highly suspect that the trigger is the primary being "busy with some task".
    I gave the pfSenses more resources (4vCPU (usage around 10%) and 2GB of RAM (usage at around 30%)) but the problem still persists.
    The primary pfSense never stops sending CARP packets that are picked up by the secondary's packet capture, but the secondary still never resumes its role as a backup for that VLAN. I am attaching a pic of that, where I can see, from the secondary packet capture, the CARP messages from the primary and the secondary, with the correct advskew.
    Screen Shot 2019-06-14 at 12.44.21 AM.png

    I also took a packet capture from the primary at a time the secondary was turned off, to see the VRRP decoding in Wireshark, and I noticed something very strange:

    My provider's packet, first, is showing the router's own IP, and where it says provider's shared IP is the correct virtual one.

    Screen Shot 2019-06-20 at 11.24.34 AM.png

    On my case though, the packet comes from my pfSense, but the IP addresses at the bottom, show this random IP addresses that have no idea where they co me from. I checked them out and they are all from different countries and providers. And it definitely is not the IP I have configured on my CARP.

    Screen Shot 2019-06-20 at 11.29.29 AM.png

    I don't know if both things are related, but I was hoping somebody can shed some light here.

    Thank you.


  • LAYER 8 Netgate

    Tell wireshark to decode protocol 112 as CARP, not VRRP and you'll stop chasing phantoms. If you have to look at a capture containing both protocols, as far as I know you will have to switch back and forth.

    Countless, countless people use CARP and make changes to their firewall without dropping CARP MASTER. This is something unique to your environment.

    You masking things out is not helping us help you.



  • @Derelict Okay, I decoded it as CARP, and I see the capture with it now.

    I figured it's something unique to my environment, since I did not have this problem in other places, but basically the only difference that I can find is that this one has this VRRP packets there as well.

    Here is an untouched capture, the CARP shows cottectly the ID 22 and the skew 0 for the primary, and the secondary shows skew 240.

    The red frames are the VRRP ones from my provider.

    Screen Shot 2019-06-20 at 12.40.38 PM.png


  • LAYER 8 Netgate

    The skew of 240 means something is not right on the secondary, like it has been demoted. The default advskew for the secondary should be 100. Is there anything unusual showing on the secondary's Status > CARP page?

    The secondary should not be sending any CARP advertisements if it is receiving anything with a lower advskew. It is not happy about something.

    VRRP can coexist on the same subnet as CARP with no problems. You will need to be sure you are using a host ID such that the CARP MACs differ from anyone on the same subnet using CARP or VRRP.



  • @Derelict The secondary's skew is at the default (100) so I am also unsure of why it shows as 240. That is only when it's acting up as a Master. If I reboot it, it does not advertise.

    Here is an image of the config on the secondary
    Screen Shot 2019-06-20 at 1.06.07 PM.png

    And on the primary

    Screen Shot 2019-06-20 at 1.01.31 PM.png

    In the following capture from the secondary pfSense you can also see the primary's advertisements. I will reproduce the problem later today and get another capture showing what happens when both are acting up as Master.

    The IP 162 is my provider's, the pfSense VIrtual CARP IP is 164, then the primary pfSense is 165 and the secondary is 166

    Screen Shot 2019-06-20 at 1.08.27 PM.png

    When you talk about the Host ID you mean the VHID of the CARP? I will change it as well just in case. I also confirmed the MAC address of the WAN on each pfSense is different, and Wireshark also shows a different MAC address from the pfSense

    Screen Shot 2019-06-20 at 1.21.04 PM.png

    And from my provider

    Screen Shot 2019-06-20 at 1.21.14 PM.png

    When you said the secondary is not happy about something, is there somewhere on the logs that I can raise the log level for CARP events to see why the secondary is not happy?

    Thank you


  • LAYER 8 Netgate

    There is no need to just change things unless evidence indicates it is a problem.

    Both CARP and VRRP derive the virtual MAC address from configured settings. The Virtual Host ID in the case of CARP and the VRID in the case of VRRP. 00-00-5E-00-01-XX, where XX is the ID in hex.

    They need to be unique on the broadcast domain or, like any case where you have two devices on the same broadcast domain using the same MAC address, there will be problems. If there is not a known collision there is no reason to change anything.

    I would avoid being clicky-clicky here and make changes based on evidence.


  • LAYER 8 Netgate

    What shows on Status > CARP on the secondary?



  • @Derelict Okay, then since the VRRP and CARP ID seem to be different, there is no apparent need for me to change anything.

    I agree that I don't need to change anything if there is no evidence.

    The CARP status on the secondary shows now all Backup, and when the problem happens, it shows the WAN as Master. I have 19 VLANs configured with CARP

    Here is the current status:

    Screen Shot 2019-06-20 at 2.22.59 PM.png


  • LAYER 8 Netgate

    Any weird messages like a demotion set on that page or anything?

    When a secondary that should be BACKUP goes to MASTER it is prettty much invariable that it has stopped receiving heartbeats on that network from the master.

    What does this show on both primary and secondary:

    sysctl -a | grep carp



  • @Derelict No weird messages, and I don't recall seeing any of such messages when the secondary is acting up. Nonetheless, I will reproduce the problem again tonight and show you if any message like that shows up.

    This is why I was surprised to see the secondary is picking up the messages (from a packet capture) from the primary when they are both showing up as master, and yet, it seems for some reason is not going back to be Backup.

    Here is the result of the commands on both. The primary is pfSense01 and the secondary is pfSense02

    [2.4.4-RELEASE][admin@pfSense01.localdomain]/root: sysctl -a | grep carp
    device carp
    net.inet.carp.ifdown_demotion_factor: 240
    net.inet.carp.senderr_demotion_factor: 0
    net.inet.carp.demotion: 0
    net.inet.carp.log: 1
    net.inet.carp.preempt: 1
    net.inet.carp.allow: 1
    net.pfsync.carp_demotion_factor: 0

    [2.4.4-RELEASE][admin@pfSense02.localdomain]/root: sysctl -a | grep carp
    device carp
    net.inet.carp.ifdown_demotion_factor: 240
    net.inet.carp.senderr_demotion_factor: 0
    net.inet.carp.demotion: 0
    net.inet.carp.log: 1
    net.inet.carp.preempt: 1
    net.inet.carp.allow: 1
    net.pfsync.carp_demotion_factor: 0


  • LAYER 8 Netgate

    That all looks perfectly normal.

    The capture posted clearly shows the secondary using an advskew of 240, yet you say the advskews are all set at 100 as they should be.

    Another interesting data point would be the output of ifconfig -a at the time.



  • @Derelict If you see the other captures posted, it shows a skew of 100 configured on the secondary's GUI, it's strange that it says 240 through the console.

    Screen Shot 2019-06-20 at 2.57.29 PM.png

    Also, I logged into the primary, and it triggered the secondary picking up as Master, so I had to reboot it, but I got you a screenshot first, and the capture of the commands you asked, while the secondary was acting as Master.

    Screen Shot 2019-06-20 at 2.51.59 PM.png

    This was while Both, the primary and the secondary were showing up as Master for the WAN:

    [2.4.4-RELEASE][admin@pfSense01.localdomain]/root: sysctl -a | grep carp
    device carp
    net.inet.carp.ifdown_demotion_factor: 240
    net.inet.carp.senderr_demotion_factor: 0
    net.inet.carp.demotion: 0
    net.inet.carp.log: 1
    net.inet.carp.preempt: 1
    net.inet.carp.allow: 1
    net.pfsync.carp_demotion_factor: 0

    [2.4.4-RELEASE][admin@pfSense02.localdomain]/root: sysctl -a | grep carp
    device carp
    net.inet.carp.ifdown_demotion_factor: 240
    net.inet.carp.senderr_demotion_factor: 0
    net.inet.carp.demotion: 0
    net.inet.carp.log: 1
    net.inet.carp.preempt: 1
    net.inet.carp.allow: 1
    net.pfsync.carp_demotion_factor: 0

    For the interface settings, I created a pastebin since the output is giant, and to not completely clog this thread:
    Again, While the secondary was acting as Master

    https://pastebin.com/4dek4qGJ



  • The most relevant that I see is with the WAN interface:

    The primary:

    hn1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=48001b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,LINKSTATE,TXCSUM_IPV6>
    ether 00:15:5d:34:44:15
    hwaddr 00:15:5d:34:44:15
    inet6 fe80::215:5dff:fe34:4415%hn1 prefixlen 64 scopeid 0x6
    inet 205.251.108.165 netmask 0xffffffe0 broadcast 205.251.108.191
    inet 205.251.108.169 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.170 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.171 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.172 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.173 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.174 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.175 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.176 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.177 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.178 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.179 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.180 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.181 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.182 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.183 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.184 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.164 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
    media: Ethernet autoselect (10Gbase-T <full-duplex>)
    status: active
    carp: MASTER vhid 22 advbase 1 advskew 0

    The secondary:

    hn1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=48001b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,LINKSTATE,TXCSUM_IPV6>
    ether 00:15:5d:b5:91:0f
    hwaddr 00:15:5d:b5:91:0f
    inet6 fe80::215:5dff:feb5:910f%hn1 prefixlen 64 scopeid 0x6
    inet 205.251.108.166 netmask 0xffffffe0 broadcast 205.251.108.191
    inet 205.251.108.169 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.170 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.172 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.173 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.174 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.175 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.176 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.177 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.178 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.179 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.180 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.181 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.182 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.183 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.184 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.164 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
    inet 205.251.108.171 netmask 0xffffffe0 broadcast 205.251.108.191
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
    media: Ethernet autoselect (10Gbase-T <full-duplex>)
    status: active
    carp: MASTER vhid 22 advbase 1 advskew 254

    For some reason, the last inet (205.251.108.171) does not show with a vhid on the secondary, and the first one is the interface IP address, so I think it's normal it doesn't show a vhid


  • LAYER 8 Netgate

    The secondary thinks it has an interface down and has been demoted (hence advskew 254)

    You have something screwed up somewhere. Sorry but with what we have that is the best I can do here.

    My guess is something in Hyper-V. Hard to say. But these problems are almost always Layer 2 problems.



  • @Derelict Just so I understand it, this interface down would be the virtual one ended in 171 since it's not showing the vhid? Because there is only one interface, the WAN, and it's showing as active. The others are IP Alias of that interface made in pfSense

    The only reason I'm doubting it could anything in Hyper-v is because this same machines were all working fine until I switched to a different datacenter provider, so there's got to be a change somewhere or they are messing something up with some traffic, or I configured something wrong.

    Even now that it's showing as backup, it's showing an advskew of 254

    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
    media: Ethernet autoselect (10Gbase-T <full-duplex>)
    status: active
    carp: BACKUP vhid 22 advbase 1 advskew 254

  • LAYER 8 Netgate

    On a healthy system the primary would be showing skew 0, the secondary skew 100.

    Check the system log for entries related to why it is changing the skew to 254.


Log in to reply