Secondary pfSense rendomly setting itself as CARP MASTER
-
The only way a secondary would assume CARP MASTER is if it stopped receiving heartbeats from the primary. Packet capture on the primary and secondary for CARP packets and do whatever it is you do that causes it to malfunction. If the CARP packets are sent from the primary and not received by the secondary, then it's something in your infrastructure.
-
@CPrat said in Secondary pfSense rendomly setting itself as CARP MASTER:
The pfsync's CARP demotion factor adjustment is at the default value of 0
pfsync has nothing to do with CARP demotion. See the sticky post in this category for an explanation.
-
@Derelict Thank you for your suggestions. Will definitely perform a capture when I reproduce the problem.
So far, with the secondary turned off, I see this:
10:57:10.028098 IP [My primary node IP] > 224.0.0.18: CARPv2-advertise 36: vhid=22 advbase=1 advskew=0 authlen=7 counter=17221242551398897455
10:57:10.404865 IP [Net provider IP] > 224.0.0.18: CARPv3-advertise 12:
10:57:10.404881 IP [Net provider IP] > 224.0.0.18: CARPv3-advertise 12:
10:57:11.038133 IP [My primary node IP] > 224.0.0.18: CARPv2-advertise 36: vhid=22 advbase=1 advskew=0 authlen=7 counter=17221242551398897455
10:57:11.255657 IP [Net provider IP] > 224.0.0.18: CARPv3-advertise 12:
10:57:11.255682 IP [Net provider IP] > 224.0.0.18: CARPv3-advertise 12:
10:57:12.048095 IP [My primary node IP] > 224.0.0.18: CARPv2-advertise 36: vhid=22 advbase=1 advskew=0 authlen=7 counter=17221242551398897455
10:57:12.066538 IP [Net provider IP] > 224.0.0.18: CARPv3-advertise 12:
10:57:12.066565 IP [Net provider IP] > 224.0.0.18: CARPv3-advertise 12:Is there a way to set the logs from the CARP module more verbose so I can have a better understanding of what is going on as well? I'd like to see if maybe my network provider has some device that uses CARP and it's interfering with this. I already asked them.
Thank you
-
CARP is very similar to VRRP. I would expect they are using that instead.
They should coexist just fine.
Wireshark might make more sense in decoding what is really out there.
You can switch between decoding as CARP or VRRP by right-clicking a frame and choosing which to use to decode protocol 112 using Decode as....
-
@Derelict Thank you for your answers.
I've done some more packet capture and I found something:
I am able to reproduce the issue consistently. As soon as I make a change on the firewall, like a NAT rule, a firewall rule, etc.. the secondary starts acting up as primary for the WAN (Not for the other VLANS though). I am unsure if making modifications to other specific things trigger the problem, but I've seen it other times, so I highly suspect that the trigger is the primary being "busy with some task".
I gave the pfSenses more resources (4vCPU (usage around 10%) and 2GB of RAM (usage at around 30%)) but the problem still persists.
The primary pfSense never stops sending CARP packets that are picked up by the secondary's packet capture, but the secondary still never resumes its role as a backup for that VLAN. I am attaching a pic of that, where I can see, from the secondary packet capture, the CARP messages from the primary and the secondary, with the correct advskew.
I also took a packet capture from the primary at a time the secondary was turned off, to see the VRRP decoding in Wireshark, and I noticed something very strange:
My provider's packet, first, is showing the router's own IP, and where it says provider's shared IP is the correct virtual one.
On my case though, the packet comes from my pfSense, but the IP addresses at the bottom, show this random IP addresses that have no idea where they co me from. I checked them out and they are all from different countries and providers. And it definitely is not the IP I have configured on my CARP.
I don't know if both things are related, but I was hoping somebody can shed some light here.
Thank you.
-
Tell wireshark to decode protocol 112 as CARP, not VRRP and you'll stop chasing phantoms. If you have to look at a capture containing both protocols, as far as I know you will have to switch back and forth.
Countless, countless people use CARP and make changes to their firewall without dropping CARP MASTER. This is something unique to your environment.
You masking things out is not helping us help you.
-
@Derelict Okay, I decoded it as CARP, and I see the capture with it now.
I figured it's something unique to my environment, since I did not have this problem in other places, but basically the only difference that I can find is that this one has this VRRP packets there as well.
Here is an untouched capture, the CARP shows cottectly the ID 22 and the skew 0 for the primary, and the secondary shows skew 240.
The red frames are the VRRP ones from my provider.
-
The skew of 240 means something is not right on the secondary, like it has been demoted. The default advskew for the secondary should be 100. Is there anything unusual showing on the secondary's Status > CARP page?
The secondary should not be sending any CARP advertisements if it is receiving anything with a lower advskew. It is not happy about something.
VRRP can coexist on the same subnet as CARP with no problems. You will need to be sure you are using a host ID such that the CARP MACs differ from anyone on the same subnet using CARP or VRRP.
-
@Derelict The secondary's skew is at the default (100) so I am also unsure of why it shows as 240. That is only when it's acting up as a Master. If I reboot it, it does not advertise.
Here is an image of the config on the secondary
And on the primary
In the following capture from the secondary pfSense you can also see the primary's advertisements. I will reproduce the problem later today and get another capture showing what happens when both are acting up as Master.
The IP 162 is my provider's, the pfSense VIrtual CARP IP is 164, then the primary pfSense is 165 and the secondary is 166
When you talk about the Host ID you mean the VHID of the CARP? I will change it as well just in case. I also confirmed the MAC address of the WAN on each pfSense is different, and Wireshark also shows a different MAC address from the pfSense
And from my provider
When you said the secondary is not happy about something, is there somewhere on the logs that I can raise the log level for CARP events to see why the secondary is not happy?
Thank you
-
There is no need to just change things unless evidence indicates it is a problem.
Both CARP and VRRP derive the virtual MAC address from configured settings. The Virtual Host ID in the case of CARP and the VRID in the case of VRRP. 00-00-5E-00-01-XX, where XX is the ID in hex.
They need to be unique on the broadcast domain or, like any case where you have two devices on the same broadcast domain using the same MAC address, there will be problems. If there is not a known collision there is no reason to change anything.
I would avoid being clicky-clicky here and make changes based on evidence.
-
What shows on Status > CARP on the secondary?
-
@Derelict Okay, then since the VRRP and CARP ID seem to be different, there is no apparent need for me to change anything.
I agree that I don't need to change anything if there is no evidence.
The CARP status on the secondary shows now all Backup, and when the problem happens, it shows the WAN as Master. I have 19 VLANs configured with CARP
Here is the current status:
-
Any weird messages like a demotion set on that page or anything?
When a secondary that should be BACKUP goes to MASTER it is prettty much invariable that it has stopped receiving heartbeats on that network from the master.
What does this show on both primary and secondary:
sysctl -a | grep carp
-
@Derelict No weird messages, and I don't recall seeing any of such messages when the secondary is acting up. Nonetheless, I will reproduce the problem again tonight and show you if any message like that shows up.
This is why I was surprised to see the secondary is picking up the messages (from a packet capture) from the primary when they are both showing up as master, and yet, it seems for some reason is not going back to be Backup.
Here is the result of the commands on both. The primary is pfSense01 and the secondary is pfSense02
[2.4.4-RELEASE][admin@pfSense01.localdomain]/root: sysctl -a | grep carp
device carp
net.inet.carp.ifdown_demotion_factor: 240
net.inet.carp.senderr_demotion_factor: 0
net.inet.carp.demotion: 0
net.inet.carp.log: 1
net.inet.carp.preempt: 1
net.inet.carp.allow: 1
net.pfsync.carp_demotion_factor: 0[2.4.4-RELEASE][admin@pfSense02.localdomain]/root: sysctl -a | grep carp
device carp
net.inet.carp.ifdown_demotion_factor: 240
net.inet.carp.senderr_demotion_factor: 0
net.inet.carp.demotion: 0
net.inet.carp.log: 1
net.inet.carp.preempt: 1
net.inet.carp.allow: 1
net.pfsync.carp_demotion_factor: 0 -
That all looks perfectly normal.
The capture posted clearly shows the secondary using an advskew of 240, yet you say the advskews are all set at 100 as they should be.
Another interesting data point would be the output of
ifconfig -a
at the time. -
@Derelict If you see the other captures posted, it shows a skew of 100 configured on the secondary's GUI, it's strange that it says 240 through the console.
Also, I logged into the primary, and it triggered the secondary picking up as Master, so I had to reboot it, but I got you a screenshot first, and the capture of the commands you asked, while the secondary was acting as Master.
This was while Both, the primary and the secondary were showing up as Master for the WAN:
[2.4.4-RELEASE][admin@pfSense01.localdomain]/root: sysctl -a | grep carp
device carp
net.inet.carp.ifdown_demotion_factor: 240
net.inet.carp.senderr_demotion_factor: 0
net.inet.carp.demotion: 0
net.inet.carp.log: 1
net.inet.carp.preempt: 1
net.inet.carp.allow: 1
net.pfsync.carp_demotion_factor: 0[2.4.4-RELEASE][admin@pfSense02.localdomain]/root: sysctl -a | grep carp
device carp
net.inet.carp.ifdown_demotion_factor: 240
net.inet.carp.senderr_demotion_factor: 0
net.inet.carp.demotion: 0
net.inet.carp.log: 1
net.inet.carp.preempt: 1
net.inet.carp.allow: 1
net.pfsync.carp_demotion_factor: 0For the interface settings, I created a pastebin since the output is giant, and to not completely clog this thread:
Again, While the secondary was acting as Masterhttps://pastebin.com/4dek4qGJ
-
The most relevant that I see is with the WAN interface:
The primary:
hn1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=48001b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,LINKSTATE,TXCSUM_IPV6>
ether 00:15:5d:34:44:15
hwaddr 00:15:5d:34:44:15
inet6 fe80::215:5dff:fe34:4415%hn1 prefixlen 64 scopeid 0x6
inet 205.251.108.165 netmask 0xffffffe0 broadcast 205.251.108.191
inet 205.251.108.169 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.170 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.171 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.172 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.173 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.174 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.175 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.176 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.177 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.178 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.179 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.180 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.181 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.182 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.183 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.184 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.164 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect (10Gbase-T <full-duplex>)
status: active
carp: MASTER vhid 22 advbase 1 advskew 0The secondary:
hn1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=48001b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,LINKSTATE,TXCSUM_IPV6>
ether 00:15:5d:b5:91:0f
hwaddr 00:15:5d:b5:91:0f
inet6 fe80::215:5dff:feb5:910f%hn1 prefixlen 64 scopeid 0x6
inet 205.251.108.166 netmask 0xffffffe0 broadcast 205.251.108.191
inet 205.251.108.169 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.170 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.172 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.173 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.174 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.175 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.176 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.177 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.178 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.179 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.180 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.181 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.182 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.183 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.184 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.164 netmask 0xffffffe0 broadcast 205.251.108.191 vhid 22
inet 205.251.108.171 netmask 0xffffffe0 broadcast 205.251.108.191
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect (10Gbase-T <full-duplex>)
status: active
carp: MASTER vhid 22 advbase 1 advskew 254For some reason, the last inet (205.251.108.171) does not show with a vhid on the secondary, and the first one is the interface IP address, so I think it's normal it doesn't show a vhid
-
The secondary thinks it has an interface down and has been demoted (hence advskew 254)
You have something screwed up somewhere. Sorry but with what we have that is the best I can do here.
My guess is something in Hyper-V. Hard to say. But these problems are almost always Layer 2 problems.
-
@Derelict Just so I understand it, this interface down would be the virtual one ended in 171 since it's not showing the vhid? Because there is only one interface, the WAN, and it's showing as active. The others are IP Alias of that interface made in pfSense
The only reason I'm doubting it could anything in Hyper-v is because this same machines were all working fine until I switched to a different datacenter provider, so there's got to be a change somewhere or they are messing something up with some traffic, or I configured something wrong.
Even now that it's showing as backup, it's showing an advskew of 254
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> media: Ethernet autoselect (10Gbase-T <full-duplex>) status: active carp: BACKUP vhid 22 advbase 1 advskew 254
-
On a healthy system the primary would be showing skew 0, the secondary skew 100.
Check the system log for entries related to why it is changing the skew to 254.