CARP VIP periodic packet loss

Snoopy

I'm seeing very strange behaviour in pfsense 1.2.3, and I can't isolate the problem.

The config:
ISP switch - pfsense - wan_ip .6 – 192.168.0.1/24 user lan
carp_vip .5 -- DMZ interface for mail server with local ip 10.0.0.5
carp_vip .4 -- DMZ2 interface for vpn with local ip 10.0.1.2

DMZ and DMZ2 have NAT 1:1 rules.

It was working fine for years until yesterday. Since then pings from outside to vip .5 are periodically lost: it responds for 87 seconds, then there is no reply for 27 seconds, then it comes back for another 87 secs...
Pings from internal lan to mailserver's local ip 10.0.0.5 are not lost. Pings from outside to .4 vpn box are also fine.

Tried all this, and nothing helped:
changed physical interface on pfsense for DMZ
tried laptop instead of real mail server

Also tried disabling carp_vip .5, put laptop with the same address .5 directly on ISP switch - no loss!
Put everything back, then ran packet capture on WAN side, but when pings are lost, I don't even see the ICMP request coming from ISP switch (should I?)

ISP says everythings fine on their side. They also claim that ARP entry for .5 is not expired, when packets are lost. But if that is true, then why I can't see ICMP request? Is it problem on my side or theirs?

The only thing I changed on pfsense at around the moment of problem beginning, was changing one firewall rule for DMZ interface (changed source IP in traffic block rule).

rickbaran

I know that this is not a direct answer but I would look at upgrading. There where so many fixes in 2.x compared to 1.2.x. I had a whole lot of wierd things happening in 1.2.x and all of my issues went away, most important no new issues after upgrade.

Snoopy

Yup, probably it's time.

For the time being, I just moved mailserver to another external ip, works fine. I still will experiment some more on the old one, put some bogus server behind it, cause I still have a feeling that it's the ISP fault.

cmb

Sounds a lot like an IP conflict, or potentially a MAC conflict (if you or your ISP are running CARP or VRRP somewhere else on the same broadcast domain with the same VHID).

You should definitely upgrade, but it's highly unlikely the described scenario will be any different.

Reiner030

Hi,

upgrading can't help - I have same problem with pfSense 2.1. Beta1 … :(

I have a DMZ Setup for 2 buildings... the DMZ area is an public AS.
"Public" router pair on building 1 has the .1 and works great from all firewalls.
"Public" router pair on building 2 has the .254 and works lousely...

I first noticed it when my master router on building2 crashed and slave router was using the .254.
The only maschine who get a ping to the CARP IP was the slave itself. All other firewalls get not response.

Now when master is up again they got an answer but with different loss percentages between 18% and 52%.
The only packet-lossy machine is the holder of the .254. (even the slave has losses) :(
Important: other way works all completely packet-lossy so there can't be local networking problems
(it's an VLAN, all other normal traffic has also no problems).

When I found this problem I do debugging "tcpdump -ni em1 icmp" on the slave with .254 => no ICMP pings arrived at the machine but pings to .252 came 100% trough.
So I checked with "arp <ip>" what it MAC is.... correct CARP MAC was shown...
I deleted for security that there must be a second response partner with this IP the arp cache "arp -d -a" and ping again.
But again the right CARP appears in the arp cache... several times tested with different sources.

Here an short overview of actually ping/response to the internal master of building 1 to .254:


[2.1-BETA1][root@fw1-jws1.local]/root(3): ping xx.xx.176.254
PING xx.xx.176.254 (xx.xx.176.254): 56 data bytes
64 bytes from xx.xx176.254: icmp_seq=0 ttl=64 time=2.343 ms
64 bytes from xx.xx.176.254: icmp_seq=2 ttl=64 time=2.262 ms
64 bytes from xx.xx.176.254: icmp_seq=3 ttl=64 time=2.167 ms
64 bytes from xx.xx.176.254: icmp_seq=6 ttl=64 time=2.308 ms
64 bytes from xx.xx.176.254: icmp_seq=7 ttl=64 time=2.403 ms
64 bytes from xx.xx.176.254: icmp_seq=9 ttl=64 time=2.502 ms
64 bytes from xx.xx.176.254: icmp_seq=10 ttl=64 time=7.149 ms
^C
--- xx.xx.176.254 ping statistics ---
11 packets transmitted, 7 packets received, 36.4% packet loss
round-trip min/avg/max/stddev = 2.167/3.019/7.149/1.689 ms

And what is seen on interface from master on building 2 with IP .254:


[2.1-BETA1][root@gw1.zws8.local]/root(13): tcpdump -ni em1 icmp | grep -E "seq [0-9]{1,3},"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on em1, link-type EN10MB (Ethernet), capture size 96 bytes
13:09:09.795925 IP xx.xx.176.5 > xx.xx.176.254: ICMP echo request, id 4933, seq 0, length 64
13:09:09.795958 IP xx.xx.176.254 > xx.xx.176.5: ICMP echo reply, id 4933, seq 0, length 64
13:09:11.820866 IP xx.xx.176.5 > xx.xx.176.254: ICMP echo request, id 4933, seq 2, length 64
13:09:11.820879 IP xx.xx.176.254 > xx.xx.176.5: ICMP echo reply, id 4933, seq 2, length 64
13:09:12.830401 IP xx.xx.176.5 > xx.xx.176.254: ICMP echo request, id 4933, seq 3, length 64
13:09:12.830418 IP xx.xx.176.254 > xx.xx.176.5: ICMP echo reply, id 4933, seq 3, length 64
13:09:15.859130 IP xx.xx.176.5 > xx.xx.176.254: ICMP echo request, id 4933, seq 6, length 64
13:09:15.859143 IP xx.xx.176.254 > xx.xx.176.5: ICMP echo reply, id 4933, seq 6, length 64
13:09:16.868789 IP xx.xx.176.5 > xx.xx.176.254: ICMP echo request, id 4933, seq 7, length 64
13:09:16.868802 IP xx.xx.176.254 > xx.xx.176.5: ICMP echo reply, id 4933, seq 7, length 64
13:09:17.264433 IP xx.xx.176.6 > xx.xx.176.254: ICMP echo request, id 45327, seq 355, length 60
13:09:17.264452 IP xx.xx.176.254 > xx.xx.176.6: ICMP echo reply, id 45327, seq 355, length 60
13:09:18.274343 IP xx.xx.176.6 > xx.xx.176.254: ICMP echo request, id 45327, seq 611, length 60
13:09:18.274356 IP xx.xx.176.254 > xx.xx.176.6: ICMP echo reply, id 45327, seq 611, length 60
13:09:18.888109 IP xx.xx.176.5 > xx.xx.176.254: ICMP echo request, id 4933, seq 9, length 64
13:09:18.888121 IP xx.xx.176.254 > xx.xx.176.5: ICMP echo reply, id 4933, seq 9, length 64
13:09:19.284002 IP xx.xx.176.6 > xx.xx.176.254: ICMP echo request, id 45327, seq 867, length 60
13:09:19.284017 IP xx.xx.176.254 > xx.xx.176.6: ICMP echo reply, id 45327, seq 867, length 60
^C714 packets captured
4302 packets received by filter
0 packets dropped by kernel

And ARP cache fits again… this is an CARP IP and the right one...


[2.1-BETA1][root@fw1-jws1.local]/root(4): arp xx.xx.176.254
xx.xx.176.254 (xx.xx.176.254) at 00:00:5e:00:01:d5 on em0_vlan7 expires in 809 seconds [vlan]


[2.1-BETA1][root@gw1.zws8.local]/root(14): tcpdump -ni em1 proto carp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on em1, link-type EN10MB (Ethernet), capture size 96 bytes
...
13:41:04.203621 IP xx.xx.176.253 > 224.0.0.18: VRRPv2, Advertisement, vrid 213, prio 0, authtype none, intvl 1s, length 36

VHID/vrid 213 => MAC :D5 …

Courious: if I ping them on my other internal transfer net it works great, too (251 is virtual IP for gw1-zws8.local on it):


[2.1-BETA1][root@fw1-jws1.local]/root(7): ping 192.168.6.251
PING 192.168.6.251 (192.168.6.251): 56 data bytes
64 bytes from 192.168.6.251: icmp_seq=0 ttl=64 time=3.487 ms
64 bytes from 192.168.6.251: icmp_seq=1 ttl=64 time=2.282 ms
64 bytes from 192.168.6.251: icmp_seq=2 ttl=64 time=2.066 ms
64 bytes from 192.168.6.251: icmp_seq=3 ttl=64 time=2.157 ms
64 bytes from 192.168.6.251: icmp_seq=4 ttl=64 time=2.184 ms
64 bytes from 192.168.6.251: icmp_seq=5 ttl=64 time=2.125 ms
64 bytes from 192.168.6.251: icmp_seq=6 ttl=64 time=2.136 ms
64 bytes from 192.168.6.251: icmp_seq=7 ttl=64 time=2.613 ms
64 bytes from 192.168.6.251: icmp_seq=8 ttl=64 time=2.410 ms
64 bytes from 192.168.6.251: icmp_seq=9 ttl=64 time=2.508 ms
64 bytes from 192.168.6.251: icmp_seq=10 ttl=64 time=2.635 ms
^C
--- 192.168.6.251 ping statistics ---
11 packets transmitted, 11 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 2.066/2.418/3.487/0.389 ms

Bests

Reiner</ip>

Reiner030

@Reiner030:

Hi,

upgrading can't help - I have same problem with pfSense 2.1. Beta1 … :(

I have a DMZ Setup for 2 buildings... the DMZ area is an public AS.
"Public" router pair on building 1 has the .1 and works great from all firewalls.
"Public" router pair on building 2 has the .254 and works lousely...

I first noticed it when my master router on building2 crashed and slave router was using the .254.
The only maschine who get a ping to the CARP IP was the slave itself. All other firewalls get not response.

Now when master is up again they got an answer but with different loss percentages between 18% and 52%.
The only packet-lossy machine is the holder of the .254. (even the slave has losses) :(
Important: other way works all completely packet-lossy so there can't be local networking problems
(it's an VLAN, all other normal traffic has also no problems).

Sorry, found out my problem of this post…
Other admin transferred my testing VM to another ESX server which wasn't "fixed" several days before this errror behavior so I didn't remembered it:
http://doc.pfsense.org/index.php/CARP_Configuration_Troubleshooting#VMware_ESX.2FESXi_Users

Perhaps this troubleshooting page helps origin poster, too ...

Bests

Reiner