Apinger only working on wan 8/6/13 64bit snapshot



  • On the gateway status widget, the WANGW first gives a crazy high latency value (screenshot attached), then a few seconds later it goes to "pending". OPT1 always shows 0ms. Ping from a LAN client routing through either interface gives normal values.
    2.1-RC1 (i386)
    built on Tue Aug 6 16:41:59 EDT 2013
    FreeBSD 8.3-RELEASE-p9
    apinger.conf:

    
    # pfSense apinger configuration file. Automatically Generated!
    
    ## User and group the pinger should run as
    user "root"
    group "wheel"
    
    ## Mailer to use (default: "/usr/lib/sendmail -t")
    #mailer "/var/qmail/bin/qmail-inject"
    
    ## Location of the pid-file (default: "/var/run/apinger.pid")
    pid_file "/var/run/apinger.pid"
    
    ## Format of timestamp (%s macro) (default: "%b %d %H:%M:%S")
    #timestamp_format "%Y%m%d%H%M%S"
    
    status {
    	## File where the status information should be written to
    	file "/var/run/apinger.status"
    	## Interval between file updates
    	## when 0 or not set, file is written only when SIGUSR1 is received
    	interval 5s
    }
    
    ########################################
    # RRDTool status gathering configuration
    # Interval between RRD updates
    rrd interval 60s;
    
    ## These parameters can be overridden in a specific alarm configuration
    alarm default {
    	command on "/usr/local/sbin/pfSctl -c 'service reload dyndns %T' -c 'service reload ipsecdns' -c 'service reload openvpn %T' -c 'filter reload' "
    	command off "/usr/local/sbin/pfSctl -c 'service reload dyndns %T' -c 'service reload ipsecdns' -c 'service reload openvpn %T' -c 'filter reload' "
    	combine 10s
    }
    
    ## "Down" alarm definition.
    ## This alarm will be fired when target doesn't respond for 30 seconds.
    alarm down "down" {
    	time 10s
    }
    
    ## "Delay" alarm definition.
    ## This alarm will be fired when responses are delayed more than 200ms
    ## it will be canceled, when the delay drops below 100ms
    alarm delay "delay" {
    	delay_low 200ms
    	delay_high 500ms
    }
    
    ## "Loss" alarm definition.
    ## This alarm will be fired when packet loss goes over 20%
    ## it will be canceled, when the loss drops below 10%
    alarm loss "loss" {
    	percent_low 10
    	percent_high 20
    }
    
    target default {
    	## How often the probe should be sent
    	interval 1s
    
    	## How many replies should be used to compute average delay
    	## for controlling "delay" alarms
    	avg_delay_samples 10
    
    	## How many probes should be used to compute average loss
    	avg_loss_samples 50
    
    	## The delay (in samples) after which loss is computed
    	## without this delays larger than interval would be treated as loss
    	avg_loss_delay_samples 20
    
    	## Names of the alarms that may be generated for the target
    	alarms "down","delay","loss"
    
    	## Location of the RRD
    	#rrd file "/var/db/rrd/apinger-%t.rrd"
    }
    alarm loss "WANGWloss" {
    	percent_low 40
    	percent_high 50
    }
    alarm delay "WANGWdelay" {
    	delay_low 4000ms
    	delay_high 5000ms
    }
    alarm down "WANGWdown" {
    	time 30s
    }
    target "8.8.4.4" {
    	description "WANGW"
    	srcip "10.49.82.1"
    	interval 2s
    	alarms override "WANGWloss","WANGWdelay","WANGWdown";
    	rrd file "/var/db/rrd/WANGW-quality.rrd"
    }
    
    alarm loss "OPT1GWloss" {
    	percent_low 40
    	percent_high 50
    }
    alarm delay "OPT1GWdelay" {
    	delay_low 4000ms
    	delay_high 5000ms
    }
    alarm down "OPT1GWdown" {
    	time 30s
    }
    target "8.8.8.8" {
    	description "OPT1GW"
    	srcip "10.49.81.1"
    	interval 2s
    	alarms override "OPT1GWloss","OPT1GWdelay","OPT1GWdown";
    	rrd file "/var/db/rrd/OPT1GW-quality.rrd"
    }
    
    




  • Working fine here.
    Try stop and restart apinger.




  • I stopped/restarted apinger on both systems that I had upgraded. The gateway status widget latencies are now showing fine. I had upgraded 3 systems to Aug 6 snapshot - the 2 with multi-gateways had these symptoms. The 1 with only 1 gateway did not have a problem. Not a big enough sample size to decide if multiple gateways being monitored is the real trigger for the "feature". Will report tomorrow if I see the latency numbers go silly again.


  • Banned

    @phil.davis:

    I had upgraded 3 systems to Aug 6 snapshot - the 2 with multi-gateways had these symptoms. The 1 with only 1 gateway did not have a problem. Not a big enough sample size to decide if multiple gateways being monitored is the real trigger for the "feature". Will report tomorrow if I see the latency numbers go silly again.

    Well, if IPv6 tunnel counts as multigateway, then count me in.


  • Rebel Alliance Developer Netgate

    It seems to be that the longer the delay to the gateway, the more likely there is to be a problem compounded over time.

    I set one of my gateways in a VM last night to 8.8.8.8 and within an hour it was into the thousands of ms in delays when in reality it was ~50ms.

    There is also still an issue with changing monitor IPs requiring a manual restart of apinger.


  • Banned

    @jimp:

    It seems to be that the longer the delay to the gateway, the more likely there is to be a problem compounded over time.
    I set one of my gateways in a VM last night to 8.8.8.8 and within an hour it was into the thousands of ms in delays when in reality it was ~50ms.

    Pretty much same here. If I really use the real GW, it does not happen. However monitoring the real GW is rather useless for me, I need to monitor real internet connectivity, not a device a couple of meters away from the firewall.



  • we have the same problem. ping default gateway normal (1.3ms). when we ping the next hop gateway, we have a ping over 600ms). a pink from terminal show a ping of 1.4ms. after a restart of apinger the values correct again.



  • hi we have now used the gateway ip for monitoring and we have the problem on the backup firewall, too. the ms stacked up from time to time. it start with 1ms and after some hours the apinger is over 2000ms. after restart apinger, it start with 1ms and from time to time the value was higher.



  • My (calculated) ping times growing, too.
    My rrd graphs from last 6 month are less than 1 pixel.



  • Same problem here.

    Is it possible to cut out the effected part of the rrd graph?



  • I've a hard feeling my problem is related: https://redmine.pfsense.org/issues/3138
    Multi wan is going fubbar since I switched from RC0 to RC1 a couple of days ago.

    I also confirm that I got the increasing of ping in a linear curve path.



  • @DrCain:

    Is it possible to cut out the effected part of the rrd graph?

    You could export the RRD to XLM, edit the XML to re-set the values of the effected part of the graph.  Then import the XLM back to RRD.

    Export / Import RRD Database
    /usr/local/bin/rrdtool dump rrddatabase xmldumpfile
    /usr/local/bin/rrdtool restore -f xmldumpfile rrddatabase



  • I had upgraded a multi-WAN site from 6 Aug 16:41:59 EDT 2013 to the latest snapshot yesterday (so I guess it would have been about a 12 Aug snapshot).
    The 6 Aug snapshot was the one when apinger was added to the Services Status list, and apinger started counting up big numbers in the latency field. I was hoping that the later snap would fix everything.
    The site was remote from me, and reported "no/intermittent internet". It did seem that OpenVPN links to it were coming and going. I couldn't get on to it long enough to see anything real. From the descriptions, it was probably constantly failing over from 1 gateway to the other and back, and/or thinking that both gateways were down…
    I got them to switch slices and reboot, so it is back on 6 Aug snapshot. When I logged in just now the latency figures on the OPT! gateway were showing silly high numbers. I have disabled gateway monitoring on both gateways, and things have stabilised. For the moment, there will be no auto-failover at this site.
    Unfortunately I can't give any better information, and for obvious reasons I don't want to roll forward at this site just now!
    How are the apinger changes going? Do others have multi-WAN test systems that can be used as guinea-pigs?


  • Rebel Alliance Developer Netgate

    I have four gateways on three interfaces on a test VM and it was OK there, but they aren't "real" WANs.

    Can you give any more information about your exact gateway config there?



  • WAN - DHCP, attached to a WiMax device that has its own private IP and NATs out to internet. (Gets an address 10.1.1.x from the WiMax DHCP server)
    OPT1 - static private IP to a TP-Link ADSL router, which again NATs out to the real internet.

    WANGW - Monitor IP 8.8.8.8 - latency thresholds 4000 to 5000ms - packet loss thresholds 40 to 50% - probe interval 2 sec - down 30 sec.

    OPT1GW - Monnitor IP 8.8.4.4 - latency thresholds 4000 to 5000ms - packet loss thresholds 40 to 50% - probe interval 2 sec - down 30 sec.

    These connections have reasonably high latency normally, and when saturating the links with downloads the latency would normally go high, hence the wacky high gateway monitoring parameters to prevent gateways from being declared down when they are in fact "working".

    Unfortunately I can't tell the exact symptoms, since it was a phone call and instructions about how to go back. The CF card multi-slice thing is very useful. As per previous post, I do know that links were coming and going, as I observed OpenVPN site-to-site links establishing for a minute or so, then dropping out.

    I am at another site with multi-WAN at the moment. If I can gain a little confidence that apinger in the latest build is working OK and seems to be controlling failover OK, then I can upgrade here this evening and will be around to monitor it the next few days. This site is on a 31 Jul snap, which was before the recent apinger changes. So I will easily be able to switch back slices if needed. (I am not at home with a real test box)


  • Rebel Alliance Developer Netgate

    I pulled up another VM that has a better multi-WAN config and it was still OK there.

    Though when I was experiencing problems before the latest round of fixes, it was worse with high-latency gateways, so it's possible that the issue is compounded by the actual latency there. To reproduce it you may have to artificially induce the same level of latency.



  • @jimp:

    I pulled up another VM that has a better multi-WAN config and it was still OK there.

    Though when I was experiencing problems before the latest round of fixes, it was worse with high-latency gateways, so it's possible that the issue is compounded by the actual latency there. To reproduce it you may have to artificially induce the same level of latency.

    Did you try to test failover?
    As I state on this thread http://forum.pfsense.org/index.php/topic,65455.0.html, on RC1-20130812 failover does not work anymore (in my case).
    Thanks
    FV



  • I have 2 pfsense with this.

    1. pfsense:
    2.1-RC1  (amd64)
    built on Thu Aug 8 14:25:22 EDT 2013
    FreeBSD 8.3-RELEASE-p9
    1 WAN (0.4ms) (always green). The apinger shows 0ms which is wrong since update (pfsense1_WAN.png).
    2 OpenVPN Server (23ms + 16ms) which have growing latencies. The corresponding clients at the other sides are green.

    2. pfsense:
    2.1-RC1  (amd64)
    built on Wed Aug 7 20:59:21 EDT 2013
    FreeBSD 8.3-RELEASE-p9
    2 WANs: static WAN (1.4ms) + DSL (22ms). The DSL has growing latency. WAN shows less latency (pfsense2_WAN.png).
    2 OpenVPN Server. Both have growing latency.
    1 OpenVPN Client which has growing latency, too.















  • Rebel Alliance Developer Netgate

    Those snapshots are known to have apinger issues, upgrade to a current snapshot.



  • I forgot to write:
    The LAN shows strange values, too.


  • Rebel Alliance Developer Netgate

    @vielfede:

    @jimp:

    I pulled up another VM that has a better multi-WAN config and it was still OK there.

    Though when I was experiencing problems before the latest round of fixes, it was worse with high-latency gateways, so it's possible that the issue is compounded by the actual latency there. To reproduce it you may have to artificially induce the same level of latency.

    Did you try to test failover?
    As I state on this thread http://forum.pfsense.org/index.php/topic,65455.0.html, on RC1-20130812 failover does not work anymore (in my case).
    Thanks
    FV

    It does appear as though the filter reload at the end of the apinger event isn't doing what it should there. I'll need to run some more tests to narrow it down though.



  • I updated pfsense1.
    While the first minutes a didn't see growing latencies.
    But WAN still has 0ms in RRD and is less than real 0.4ms.


  • Rebel Alliance Developer Netgate

    The lack of failover working seems to be this:
    http://redmine.pfsense.org/issues/3146



  • 2.1-RC1 (i386)
    built on Wed Aug 14 14:47:24 EDT 2013
    FreeBSD 8.3-RELEASE-p9

    Looking good so far. Someone was downloading on our 1Mbps link speed for an hour or so. Latency went up to around 930ms. When the download finished the latency dropped back to under 200ms. The backup link latency is hovering around 300ms. During all this time there was no "panic" from apinger, check_reload_status or anything else to failover links.

    At another site, latency on one link is changing in a range from 400 to 1100ms (people working it hard) and another 120ms (less used). apinger is coping fine.



  • After updating to the latest release
    2.1-RC1  (amd64)
    built on Thu Aug 15 03:12:29 EDT 2013
    FreeBSD 8.3-RELEASE-p9

    the fast interfaces still shows 0ms instead of 0.400 on dashboard and RRD.



  • Hi

    To ggzengel: have you  "Disable Gateway Monitoring"?



  • @ggzengel:

    After updating to the latest release
    2.1-RC1  (amd64)
    built on Thu Aug 15 03:12:29 EDT 2013
    FreeBSD 8.3-RELEASE-p9

    the fast interfaces still shows 0ms instead of 0.400 on dashboard and RRD.

    Same here, 2 of my 3 gateways show 0ms but they should show respectivly arround 14ms and 1ms.
    My main gateway (Wan) is showing 1ms, which is correct.



  • It's not exactly 0ms and it goes up on some interfaces.














  • 2.1-RC1 (i386)
    built on Thu Aug 15 16:30:19 EDT 2013
    FreeBSD 8.3-RELEASE-p9

    2 multi-WAN systems are on this snap now. Gateway status is reporting reasonable latency numbers, and RRD quality numbers also look OK. I only have IPv4 and 2 gateways on each, so I can't speak for IPv6 or more complex systems with more gateways.



  • @jimp:

    The lack of failover working seems to be this:
    http://redmine.pfsense.org/issues/3146

    2.1-RC1 (amd64)
    built on Thu Aug 15 16:30:12 EDT 2013
    FreeBSD 8.3-RELEASE-p9

    Failover (internet) is working again!

    Unfortunately squid proxy failover (http://forum.pfsense.org/index.php/topic,60977.0.html) does not.
    Maybe this is off topic (it's not due to apinger issues) but I cant find any help to get it works on 2.1.


  • Rebel Alliance Developer Netgate

    It's not related. Keep it out of this thread.



  • @ggzengel:

    After updating to the latest release
    2.1-RC1  (amd64)
    built on Thu Aug 15 03:12:29 EDT 2013
    FreeBSD 8.3-RELEASE-p9

    the fast interfaces still shows 0ms instead of 0.400 on dashboard and RRD.

    Same here.  But on today's snapshot.
    2.1-RC1  (i386)
    built on Fri Aug 16 16:28:22 EDT 2013

    Up-until now it had been working okay.
    It works with a manually entered alternate monitoring address.  But not with the interfaces' gateway.  The WAN interface is working okay.  But the optional interface is reporting 0ms.  Reality is about 0.35ms.

    Both interfaces are VLAN's on the same physical interface (bfe0).


  • Banned

    Still showing 0ms everywhere with the Aug 17 snapshot…


  • Rebel Alliance Developer Netgate

    After Ermal's last changes it should have been back to normal there, it was on my test VM. I was seeing 0.3-0.8ms reported. I rebuilt apinger again on the snapshot builders to see if maybe I missed rebuilding it one of them yesterday. Try it again later tonight/tomorrow when the next snap shows up.



  • on latest sbapshot
      2.1-RC1 (amd64)
    built on Sun Aug 18 19:11:51 EDT 2013
    FreeBSD 8.3-RELEASE-p9

    my ipv6 tunnel is wrong 0.3 ms instead of 70 ish it normally is

    was fine on the 16th snapshot


  • Banned

    Yeah the thing seems to invent completely random figures between 0-1ms. Haven't seen any sane reality-matching value since the major rewrites started.



  • @grandrivers:

    on latest sbapshot
      2.1-RC1 (amd64)
    built on Sun Aug 18 19:11:51 EDT 2013
    FreeBSD 8.3-RELEASE-p9

    my ipv6 tunnel is wrong 0.3 ms instead of 70 ish it normally is

    was fine on the 16th snapshot

    Just upgraded to
    2.1-RC1 (i386)
    built on Thu Aug 22 23:23:02 EDT 2013
    FreeBSD 8.3-RELEASE-p10

    from Aug 1 snapshot.

    My IPv6 tunnel now shows 0.2ms instead of the usual 12-14ms.



  • Sounds like whatever is being pinged, it ain’t your gateway…



  • @kejianshi:

    Sounds like whatever is being pinged, it ain’t your gateway…

    Nice catch, you're right. It should ping out on gif0 but it's just pinging itself on lo0:

    tcpdump -nvvi lo0 icmp6
    tcpdump: listening on lo0, link-type NULL (BSD loopback), capture size 96 bytes
    09:14:14.703602 IP6 (hlim 64, next-header ICMPv6 (58) payload length: 24) 2001:4dd0:xxxx:xxxx::2 > 2001:4dd0:xxxx:xxxx::2: [icmp6 sum ok] ICMP6, echo request, length 24, seq 26906
    09:14:14.703847 IP6 (hlim 64, next-header ICMPv6 (58) payload length: 24) 2001:4dd0:xxxx:xxxx::2 > 2001:4dd0:xxxx:xxxx::2: [icmp6 sum ok] ICMP6, echo reply, length 24, seq 26906
    
    

    According to the config it should not:

    target "2001:4dd0:xxxx:xxxx::1" {
            description "WAN_SixXS"
            srcip "2001:4dd0:xxxx:xxxx::2"
            alarms override "loss","delay","down";
            rrd file "/var/db/rrd/WAN_SixXS-quality.rrd"
    }
    


  • I'd say your follow-up was nicer than my guess - glad it panned out.
    Well - That makes perfect sense now.  :D
    I'd guess knowing that makes it easier for a dev to fix?


Log in to reply