False "gateway down" alarms in syslog



  • Hello,

    I updated to 2.2.4 a couple of weeks ago.  I was using 2.1.5.  Not long after the update, my syslog began filling up with alarms that both my WAN gateways are dropping…simultaneously.  And then in the same exact second, they come back up.  See below -

    Sep 16 08:45:18 apinger: alarm canceled: GW_WAN(8.8.4.4) *** GW_WANdown ***
    Sep 16 08:45:18 apinger: alarm canceled: TELSTRA_DSL_DHCP(8.8.8.8) *** down ***
    Sep 16 08:45:18 apinger: ALARM: TELSTRA_DSL_DHCP(8.8.8.8) *** down ***
    Sep 16 08:45:18 apinger: ALARM: GW_WAN(8.8.4.4) *** GW_WANdown ***
    Sep 16 08:40:18 apinger: alarm canceled: GW_WAN(8.8.4.4) *** GW_WANdown ***
    Sep 16 08:40:18 apinger: alarm canceled: TELSTRA_DSL_DHCP(8.8.8.8) *** down ***
    Sep 16 08:40:18 apinger: ALARM: TELSTRA_DSL_DHCP(8.8.8.8) *** down ***
    Sep 16 08:40:18 apinger: ALARM: GW_WAN(8.8.4.4) *** GW_WANdown ***

    I tried changing the IP address they are pinging, but the same thing happens. I'm not noticing any interruption in service, but it is annoying because I no longer get an accurate alert if/when the internet connection really does drop.  Also, I checked the gateway status and it was showing ping replies less than 1ms from Google's DNS. My internet connection is not that fast :)  Average for us is 40ms-80ms.  I restarted apinger and now the response ping times are back in the 40-80ms range like they should.  But, I'm still getting the false alarms about the gateways dropping and coming back up within the same second.  Any insight into this would be greatly appreciated.  I've always been so grateful for all the help this forum provides.

    Cheers



  • Hello, I had the very same problem this morning after changing my provider gateway.

    Turns out that there was a very aggressive "ping" setting in the gateway monitoring section. It was set to 10, where the default suggest 1.

    After I've changed it back to 1, I have no more false alarms.

    Go to "System -> Routing". Edit your gateway, and click on the "Advanced" button. You'll have access to the "Probe Interval" setting, which was the culprit for me.

    HTH.

    Nicolas



  • Thank you for your response.  I checked my settings and probe interval was set to the default of 1.  The unknown culprit appears to be something different for me.



  • Here's the settings I've been using for like ever with no troubles.  But I have rock solid FiOS (FTTH) service.




  • I truly do appreciate your responses, but I do have a question. As you can see on my initial post, I have a dual-wan configuration with two separate ISPs.  Within the same second, both WAN connections report they are down, and then again report they are back up.  I'm more than willing to change my config, but what would cause them to both have this odd behavior?  This is did only begin happening after upgrading from 2.1.5 to 2.2.4.



  • @NOYB:

    Here's the settings I've been using for like ever with no troubles.  But I have rock solid FiOS (FTTH) service.

    Hello, for my understanding, I'd like to know how you calculated those values? Is there any method, or is this by trial and error?

    Thank you.


  • Banned

    @nikolaii:

    I'd like to know how you calculated those values? Is there any method, or is this by trial and error?

    You keep experimenting till you make apinger STFU. :D



  • I calculated them.  Though it was a long time ago so can't really give much detail.  Think I started with the maximum down time desired before alarm (35).  Then decided on how many consecutive failed probes to require (3).  So 15 second probe interval and 35 seconds down.
    Probe1@T0, Probe2@T15, Probe3@T30, Down@T35.

    Way back then I added the "use calculated value" options to the form which are calculated based on probe interval and down settings.  Don't recall the details of the calculation.  You'd have to look at the code for that.

    I'd say start by deciding on decent probe interval and down settings.  Then select use calculated options to see what it provides and try that for a while.  Then adjust from there.



  • Nice, I'll try it, thank you for the explanation.


    Nicolas


Log in to reply