Bug in ICMP monitoring for failover



  • Hi evryone,

    I had this problem with pfSense 1.2.2.
    I decided to give a try to 1.2.3-RC1, and it's still there.

    I have 2 DSL connexions. I'm using the load balancer in failover mode.

    In the load balancer I have two gateways : one that uses WAN and fails over WAN2 (I called it WANifupWAN2else), and one that uses WAN2 and fails over WAN (I called it WAN2ifupWANelse).

    I use the first hops on my ISPs to poll the 2 connexions :
    x1.x2.x3.x4 to check WAN
    y1.y2.y3.y4 to check WAN2

    both of them respond perfectly to echo requests (I ran a ping on both from the pfSense machine for many hours with no packet loss)

    Though. I got this in the system logs :

    Jun 23 17:56:39 slbd[436]: Service WAN2ifupWANelse changed status, reloading filter policy
    Jun 23 17:56:39 slbd[436]: Service WANifupWAN2else changed status, reloading filter policy
    Jun 23 17:56:39 slbd[436]: ICMP poll succeeded for y1.y2.y3.y4, marking service UP
    Jun 23 17:56:39 slbd[436]: ICMP poll succeeded for x1.x2.x3.x4, marking service UP
    Jun 23 17:56:29 slbd[436]: Service WAN2ifupWANelse changed status, reloading filter policy
    Jun 23 17:56:29 slbd[436]: Service WANifupWAN2else changed status, reloading filter policy
    Jun 23 17:56:29 slbd[436]: ICMP poll failed for x1.x2.x3.x4, marking service DOWN
    Jun 23 17:56:29 slbd[436]: ICMP poll failed for x1.x2.x3.x4, marking service DOWN
    Jun 23 17:55:28 slbd[436]: Service WAN2ifupWANelse changed status, reloading filter policy
    Jun 23 17:55:28 slbd[436]: Service WANifupWAN2else changed status, reloading filter policy
    Jun 23 17:55:28 slbd[436]: ICMP poll succeeded for y1.y2.y3.y4, marking service UP
    Jun 23 17:55:28 slbd[436]: ICMP poll succeeded for x1.x2.x3.x4, marking service UP
    Jun 23 17:55:23 slbd[436]: Service WAN2ifupWANelse changed status, reloading filter policy
    Jun 23 17:55:23 slbd[436]: Service WANifupWAN2else changed status, reloading filter policy
    Jun 23 17:55:23 slbd[436]: ICMP poll failed for y1.y2.y3.y4, marking service DOWN
    Jun 23 17:55:23 slbd[436]: ICMP poll failed for y1.y2.y3.y4, marking service DOWN
    Jun 23 17:54:17 slbd[436]: Service WAN2ifupWANelse changed status, reloading filter policy
    Jun 23 17:54:17 slbd[436]: Service WANifupWAN2else changed status, reloading filter policy
    Jun 23 17:54:17 slbd[436]: ICMP poll succeeded for y1.y2.y3.y4, marking service UP
    Jun 23 17:54:17 slbd[436]: ICMP poll succeeded for x1.x2.x3.x4, marking service UP
    Jun 23 17:54:12 slbd[436]: Service WAN2ifupWANelse changed status, reloading filter policy
    Jun 23 17:54:12 slbd[436]: Service WANifupWAN2else changed status, reloading filter policy
    Jun 23 17:54:11 slbd[436]: ICMP poll failed for x1.x2.x3.x4, marking service DOWN
    Jun 23 17:54:11 slbd[436]: ICMP poll failed for x1.x2.x3.x4, marking service DOWN

    As you can see the failover services keep changing status, with no reason.

    I decided to take a look under the hood and found this script : /usr/local/sbin/slbd.sh, which takes an IP address as an argument and pings it.
    I modified the script to generate some custom logs and foud the bug :

    Most of the time, the argument which is passed to this script is either x1.x2.x3.x4 or y1.y2.y3.y4 (I think it should be nothing else).
    BUT from time to time (every minute or so) the argument is a corrupt IP address. Something like "x1.x2..y3.y4".
    So I get this error in my generated logs "ping: cannot resolve x1.x2..y3.y4: Unknown server error"
    And the service is marked DOWN at this exact moment.

    I don't know who calls slbd.sh but I believe it has a bug.

    Can you tell me where to look ? I would be happy to help correct this bug !





  • Reda: what you're hitting, it sounds like, is an odd FreeBSD bug with missing ICMP replies. It appears we have a work around switching from slbd to a custom patched apinger. Try a 1.2.3 snapshot from http://snapshots.pfsense.org and report back.


Log in to reply