Bug in apinger halts load balancing and failover



  • on pfSense 1.2.3-RELEASE

    Running in Failover mode between WAN and OPT1 I noticed that once in a while monitoring of the pool stopped after the following error is displayed in the slbd.log

    Dec 18 01:15:53 webxaccelerator apinger: 208.67.222.222: Lost packet count mismatch (-20!=0)!
    Dec 18 01:15:53 webxaccelerator apinger: 208.67.222.222: Received packets buffer: ################################################## ####################

    ps aux | grep apinger shows that apinger is no longer running. This causes failover and loadbalacing to stop working since there is no process monitoring the interfaces.

    Looking at the source code to apinger.c we see on line 854 we note that apinger exits on error.

    if (t->recently_lost!=really_lost){
                            fprintf(f,"  lost packet count mismatch (%i!=%i)!\n",t->recently_lost,really_lost);
                            logit("%s: Lost packet count mismatch (%i!=%i)!",t->name,t->recently_lost,really_lost);
                            logit("%s: Received packets buffer: %s %s\n",t->name,buf2,buf1);
                            err=1;
                    }
                    free(buf1);
                    free(buf2);

    fprintf(f,"\n");
            }
            fclose(f);
            if (err) abort();

    Patching apinger.c as follows

    vmmail3# diff apinger.c apinger.c.orig
    858,859c858
    < t->recently_lost = really_lost = 0;
    < // err=1;
    –-

    err=1;

    prevents apinger from exiting on error. Load balancing and failover now work as expected even when a condition occurs to flag this error.

    Dec 18 20:52:37 webxaccelerator apinger: 208.67.222.222: Lost packet count mismatch (-21!=0)!
    Dec 18 20:52:37 webxaccelerator apinger: 208.67.222.222: Received packets buffer: ################################################## ####################
    Dec 18 21:05:55 webxaccelerator apinger: ALARM: 208.67.220.220(208.67.220.220)  *** down ***
    Dec 18 21:06:03 webxaccelerator apinger: alarm canceled: 208.67.220.220(208.67.220.220)  *** down ***

    Hope this helps.

    --luis



  • So I have looked at this a little more closely.  The version of apinger included in the pfPorts seems to have the same issue.  Basically if an inconsistency is found in the number of packets lost then apinger exits.  In my mind apinger should ** NEVER ** exit.

    It seems that the apinger in pfPorts is used when building pfSense 2.0.  1.2.3-RELEASE uses the FreeBSD ports version.

    Following is a patch against the FreeBSD ports version of apinger that resolves my issues with failover pools halting when inconsistent packet loss is detected.  I don't currently do any work with 2.0 but it would be good if one of the maintainers applied the following patch to apinger included in pfPorts.

    –- apinger.c  2010-12-21 08:47:22.000000000 +0000
    +++ apinger.c.new      2010-12-21 08:47:15.000000000 +0000
    @@ -787,7 +787,6 @@
    time_t tm;
    int i,qp,really_lost;
    char *buf1,*buf2;
    -int err=0;

    if (config->status_file==NULL) return;

    @@ -855,7 +854,7 @@
                            fprintf(f,"  lost packet count mismatch (%i!=%i)!\n",t->recently_lost,really_lost);
                            logit("%s: Lost packet count mismatch (%i!=%i)!",t->name,t->recently_lost,really_lost);
                            logit("%s: Received packets buffer: %s %s\n",t->name,buf2,buf1);
    -                      err=1;
    +                      t->recently_lost = really_lost = 0;
                    }
                    free(buf1);
                    free(buf2);
    @@ -863,7 +862,6 @@
                    fprintf(f,"\n");
            }
            fclose(f);
    -      if (err) abort();
    }

    #ifdef FORKED_RECEIVER



  • Isoltero,

    and no wonder and BETA4 that I've been testing resorted to only one link after a while. I hope the ports being patched soonest possible. Thank you for identifying this.

    regards ..



  • I posted a bug report here.
    http://redmine.pfsense.org/issues/1127

    hopefully someone will look at it and take appropriate action.  Otherwise your only option is to build a DevISO and then patch apinger yourself.

    Good luck.

    –luis



  • I use pfSense 1.2.3 embedded - can you recommend a workaround to automate restart of apinger once it has exited?

    Regards,
    gergero



  • the best way to address this is to apply the patch to apinger.c, recompile, and then swap out the apinger executable that comes with 1.2.3-RELEASE with the new version.    This is the simplest way to fix this issue.

    I have uploaded a version of the patch apinger for pfSense 1.2.3 to here…

    http://www.globalmarinenet.com/downloads/wxa/apinger

    you are welcome to use that if you like.  Download the new apinger and copy it to /usr/local/sbin on your box.

    My understanding is that maintenance on 1.2.3 has stopped so you will need to apply the patch manually.

    take care.

    --luis



  • Isoltero,

    by any chance, can I apply that file you uploaded to v2BETA4?

    regards,

    Najib



  • I don't think so but you can try…  i have no experience with 2.0 but i did look at the apinger.c code in pfTools and it is quite different than that found in the FreeBSD ports.  I did patch and compile the 2.0 version and tried to run it on 1.2.3 and watched it core dump on startup.

    here is the patch for the 2.0 version in pfTools


    --- apinger.c 2010-12-21 08:41:44.000000000 +0000
    +++ apinger.c.new 2010-12-23 15:54:35.000000000 +0000
    @@ -805,7 +805,6 @@
    time_t tm;
    int i,qp,really_lost;
    char *buf1,*buf2;
    -int err=0;

    if (config->status_file==NULL) return;

    @@ -867,7 +866,7 @@
    if (t->recently_lost!=really_lost){
    logit("Target "%s": Lost packet count mismatch (%i(recently_lost) != %i(really_lost))!",t->name,t->recently_lost,really_lost);
    logit("Target "%s": Received packets buffer: %s %s\n",t->name,buf2,buf1);

    • err=1;
    • t->recently_lost = really_lost = 0;
      }
      free(buf1);
      free(buf2);
      @@ -875,7 +874,6 @@
      fprintf(f,"\n");
      }
      fclose(f);
    • if (err) abort();
      }

    void main_loop(void){


    You can download a binary version of this from
    http://www.globalmarinenet.com/downloads/wxa/apinger2.0

    you will need to download this and copy it to /usr/local/sbin/apinger on your box.  Note that i can't test this since I am not running under 2.0.  This version of apinger was compiled under FreeBSD 7.3 so not sure if it will run under 8.0.  Try it out and let us know if it works...

    --luis



  • luis,

    I saw your bug post on redmine. Perhaps the next upgrade image (tomorrow) this issue would be rectified. Waiting for that, I shall :-)



  • @lsoltero:

    I have uploaded a version of the patch apinger for pfSense 1.2.3 to here…

    THANK YOU!!!


Locked