Monitor ping latency and loss
-
IPv6 monitoring was and still is disabled by the checkbox "disable gateway monitoring" at https://pfs/system_gateways_edit.php?id=0
But you are right, now that I look, there are two dpingers:
[2.3-BETA][admin@pfs.dv.loc]/cf: ps Ax | grep dpinger 51961 - Is 0:01.77 /usr/local/bin/dpinger -S -r 0 -i WAN_DHCP6 -B fe80::290:fbff:fe38:8497%em1 -p /var/run/dpinger_WAN_DHCP6_fe80::290:fbff:fe38:8497%em1_fe80::5257:a8ff:fe89:b0e2%em1.pid -u /var/run/dpinger_WAN_DHCP6_fe80::290:fbff:fe38:8497%em1_fe80::5257:a8ff 62228 - Is 0:01.87 /usr/local/bin/dpinger -S -r 0 -i WAN_DHCP -B 71.233.154.224 -p /var/run/dpinger_WAN_DHCP_71.233.154.224_71.233.152.1.pid -u /var/run/dpinger_WAN_DHCP_71.233.154.224_71.233.152.1.sock -C /etc/rc.gateway_alarm -d 0 -s 250 -l 1250 -t 30000 -A 10
Sorry about that, I should have checked the actual process list to see that it wasn't running.
I'll post the results again in a few
-
IPv6 LL:
[2.3-BETA][admin@pfs.dv.loc]/cf: dpinger -f -r 10s fe80::5257:a8ff:fe89:b0e2 send_interval 250ms loss_interval 1250ms time_period 30000ms report_interval 10000ms data_len 0 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% dest_addr fe80::5257:a8ff:fe89:b0e2 bind_addr (none) identifier "" 14659 11939 0 15019 12223 0 15085 11109 0 16633 14669 0 16121 13499 0 16041 13662 0 13686 7455 0 13782 8917 0 13642 8216 0 14659 9107 0 15134 10632 0 16066 12615 0 15534 12652 8 15298 12005 8 13991 9251 6 ^C[2.3-BETA][admin@pfs.dv.loc]/cf:
IPv6 Public IP:
[2.3-BETA][admin@pfs.dv.loc]/cf: dpinger -f -r 10s 2607:f8b0:4005:801::200e send_interval 250ms loss_interval 1250ms time_period 30000ms report_interval 10000ms data_len 0 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% dest_addr 2607:f8b0:4005:801::200e bind_addr (none) identifier "" 89395 3131 0 89407 4049 0 88941 3580 0 88805 3473 0 88830 2878 0 88961 2866 0 89016 2783 0 88906 2767 0 89167 3101 0 89163 3686 0 89305 3767 0 88996 3479 0 89461 3356 0 89635 3537 0 89853 4000 0 ^C[2.3-BETA][admin@pfs.dv.loc]/cf:
-
oops forgot the -d 64
[2.3-BETA][admin@pfs.dv.loc]/cf: dpinger -f -r 10s -d 64 fe80::5257:a8ff:fe89:b0e2 send_interval 250ms loss_interval 1250ms time_period 30000ms report_interval 10000ms data_len 64 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% dest_addr fe80::5257:a8ff:fe89:b0e2 bind_addr (none) identifier "" 16337 19045 0 15300 15477 0 14745 13578 0 14373 9940 0 13590 8118 0 13658 7215 0 13473 6929 0 14460 7709 0 14571 7854 0 13612 6162 0 13185 5405 0 13867 9135 1 15311 10579 1 17019 15728 1 16142 14864 0 ^C[2.3-BETA][admin@pfs.dv.loc]/cf: dpinger -f -r 10s -d 64 2607:f8b0:4005:801::200e send_interval 250ms loss_interval 1250ms time_period 30000ms report_interval 10000ms data_len 64 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% dest_addr 2607:f8b0:4005:801::200e bind_addr (none) identifier "" 89253 2606 0 88872 2390 0 88757 2494 0 88706 3397 0 88796 3486 0 89358 4099 0 89397 3623 0 89521 4422 0 88998 3887 0 88971 3974 0 89412 3744 0 89431 3736 0 89325 3230 0 88870 2562 0 88752 2493 0 ^C[2.3-BETA][admin@pfs.dv.loc]/cf:
-
The results for the Google address looks like you would expect, with fairly consistent latency. The results for the Comcast device don't quite look as nice, with high variance and periods of packet loss. This could either be your connection, or the Cisco at Comcast dropping ICMP.
It would be great if you could run some longer tests against fe80::5257:a8ff:fe89:b0e2. Several minutes worth, both with and without the -d option. We would really like to know if -d has an effect on loss.
Another worthwhile test would be to test against the next hop or two inside of Comcast. Use traceroute to find the next hop:
traceroute6 2607:f8b0:4005:801::200e
and test against the next address in the chain. It should be a public address.
Thank you for all the testing!
-
Split this to its own thread since it's definitely different from the thread where it started.
The latency looks legit. The fact you're seeing ping loss when dpinger's running is almost certainly because you're triggering an ICMP rate limit. The default 250 ms probe interval (4 pings per second) is somewhat aggressive for targeting routers. Bump your probe interval up to 1000, and/or use something other than your gateway as your monitor IP, and I suspect the significant loss all goes away. Try using Google public DNS, or 2600:: or something else on the Internet that isn't a router as your monitor IP.
That's outside of Denny's suggestions, he knows better than me on the internals of dpinger (he wrote it, after all :)) and may see something from additional testing. But I very seriously doubt if this is indicative of anything other than your ISP router rate limiting ICMP.
-
Sorry, if you didn't already know, I am temporarily offline with another issue related to an upgrade today (another post). Once I get back up I'll continue this.
-
Since I got back online I have been running with dpinger on the IPv6 gateway but using the google.com public IP (with -d 10) and it has been very stable.
I turned it off to do these tests.
Now that I have a stable IPv6 gateway I have been able to see that one of my other problems (in another post, https://forum.pfsense.org/index.php?topic=106208.0 ) is definitely related to the interface statistics widget.
Link local IP with -d
[2.3-BETA][admin@pfs.dv.loc]/root: dpinger -f -r 10s -d 64 fe80::5257:a8ff:fe89:b0e2 send_interval 250ms loss_interval 1250ms time_period 30000ms report_interval 10000ms data_len 64 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% dest_addr fe80::5257:a8ff:fe89:b0e2 bind_addr (none) identifier "" 14008 6398 0 13836 6881 0 13453 6543 0 13113 6426 0 13494 7079 0 13378 7074 0 14127 8412 0 14910 16067 0 15107 16361 0 15555 20331 0 15092 19053 0 14871 18811 0 14499 13864 0 14847 13298 0 15149 13637 0 15347 14008 0 14326 9305 0 13545 8265 0 14023 8757 0 14381 11362 0 14038 11216 0 13192 10802 0 14114 17362 0 14945 18234 0 15362 18714 0 18882 23396 0 19619 24044 0 19839 24304 0 15871 13585 0 13975 9185 0 14085 9097 0 14304 9652 0 14512 9870 0 14329 7914 0 14387 9690 0 14375 12533 0 13599 12129 0 12693 10011 0 12395 6899 0 13920 14267 0 15117 15398 0 15645 16837 0 14264 11001 0 14895 12955 0 17974 20858 0 17758 21027 0 ^C
Link local IP without -d
[2.3-BETA][admin@pfs.dv.loc]/root: dpinger -f -r 10s fe80::5257:a8ff:fe89:b0e2 send_interval 250ms loss_interval 1250ms time_period 30000ms report_interval 10000ms data_len 0 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% dest_addr fe80::5257:a8ff:fe89:b0e2 bind_addr (none) identifier "" 12444 5070 0 11482 4919 0 13104 11378 0 13735 12420 0 15329 12942 0 14050 8850 0 16376 16276 0 16275 18579 0 16811 18699 0 14073 12078 0 13069 6939 0 12599 5807 0 12104 5173 0 12102 5230 0 12813 6943 0 13663 7752 0 14128 8018 0 13620 7172 0 13454 6745 0 13640 9103 0 14064 9150 0 15322 10590 0 16247 10187 0 17799 15492 0 17030 14685 0 15119 13693 0 13211 6178 0 13366 6446 0 15334 11999 0 15869 12207 0 14903 12175 0 14446 11830 0 15264 15047 0 16435 15549 0 17561 17752 0 16065 12439 0 16419 13778 0 14187 10027 0 14101 9178 0 15950 16336 0 16494 16696 0 16737 17325 0 14376 8842 0 14282 8341 0 14246 7509 0 15594 16329 0 16022 16985 0 15209 16735 0 ^C
Public IP with -d
[2.3-BETA][admin@pfs.dv.loc]/root: dpinger -f -r 10s -d 64 2607:f8b0:4005:801::200e send_interval 250ms loss_interval 1250ms time_period 30000ms report_interval 10000ms data_len 64 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% dest_addr 2607:f8b0:4005:801::200e bind_addr (none) identifier "" 88552 2257 0 98457 30095 0 97666 26356 0 97538 26164 0 91732 11478 0 88796 2628 0 88612 2608 0 88412 2238 0 88268 2811 0 88514 3088 0 88244 3684 0 88229 3385 0 88362 3458 0 88366 2292 0 88349 2083 0 88169 1578 0 88533 2911 0 88523 3216 0 88284 3236 0 88259 2521 0 88617 3048 0 88858 3089 0 88547 2881 0 88251 2319 0 88602 3799 0 88506 3750 0 88691 3678 0 88334 1873 0 88586 2055 0 88450 2518 0 88205 2653 0 88228 3536 0 88211 3555 0 88174 3596 0 87894 2816 0 87672 2103 0 87966 2805 0 88126 2878 0 88153 2774 0 88189 2429 0 88097 2081 0 88316 2588 0 88167 2123 0 88528 2649 0 ^C
Public IP without -d
[2.3-BETA][admin@pfs.dv.loc]/root: dpinger -f -r 10s 2607:f8b0:4005:801::200e send_interval 250ms loss_interval 1250ms time_period 30000ms report_interval 10000ms data_len 0 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% dest_addr 2607:f8b0:4005:801::200e bind_addr (none) identifier "" 87623 1885 0 87902 2028 0 88007 2101 0 87901 1914 0 87917 1950 0 87956 2063 0 88231 2210 0 88005 1933 0 88333 2739 0 88367 2756 0 88879 2978 0 89099 3454 0 88983 3516 0 88998 3557 0 88379 2795 0 88392 2900 0 88011 2778 0 88102 2599 0 88075 2453 0 88077 2650 0 88027 2346 0 88109 2267 0 88159 1838 0 87944 1816 0 87894 1961 0 87967 1981 0 88184 2174 0 88754 3252 0 88687 3424 0 88872 3531 0 88579 2821 0 88791 3121 0 88974 3240 0 88846 3169 0 88685 2995 0 88184 3024 0 88070 2816 0 87912 2479 0 88253 2735 0 88313 2857 0 88300 3051 0 87628 2027 0 87614 2399 0 87880 2646 0 88342 2707 0 88253 2438 0 88186 2450 0 88386 2802 0 88460 2861 0 88673 2852 0 88334 2698 0 88323 2506 0 87873 2078 0 ^C
Traceroute6 to public IP - normal UDP traceroute6 seems to be blocked so -I
[2.3-BETA][admin@pfs.dv.loc]/root: traceroute6 -In 2607:f8b0:4005:801::200e traceroute6 to 2607:f8b0:4005:801::200e (2607:f8b0:4005:801::200e) from 2001:558:6017:a8:7cfa:9c15:41e5:c6eb, 64 hops max, 16 byte packets 1 * 2001:558:6017:a8::1 25.958 ms 10.079 ms 2 2001:558:202:2e2::1 10.612 ms 7.810 ms 9.953 ms 3 2001:558:200:21d::2 12.619 ms 12.216 ms 14.074 ms 4 2001:558:200:25b::1 19.298 ms 16.744 ms 26.302 ms 5 2001:558:0:f6b6::1 24.025 ms * 24.267 ms 6 2001:558:0:f8d9::2 23.889 ms 24.043 ms 23.948 ms 7 2001:559::36a 22.777 ms 22.817 ms 24.350 ms 8 2001:4860::1:0:6572 25.901 ms 22.521 ms 24.048 ms 9 2001:4860::8:0:4398 23.376 ms 45.874 ms 24.486 ms 10 2001:4860::8:0:9154 88.674 ms 94.894 ms 88.993 ms 11 2001:4860::8:0:b2bb 87.796 ms 88.490 ms 87.804 ms 12 2001:4860::8:0:79e6 61.584 ms 61.960 ms 62.068 ms 13 2001:4860::8:0:8bb4 87.295 ms 89.450 ms 86.295 ms 14 2001:4860::1:0:ae01 84.998 ms 86.090 ms 87.851 ms 15 2001:4860:0:1::1317 99.077 ms 88.395 ms 88.430 ms 16 2607:f8b0:4005:801::200e 90.592 ms 87.070 ms 88.283 ms
2nd public hop with -d
[2.3-BETA][admin@pfs.dv.loc]/root: dpinger -f -r 10s -d 64 2001:558:202:2e2::1 send_interval 250ms loss_interval 1250ms time_period 30000ms report_interval 10000ms data_len 64 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% dest_addr 2001:558:202:2e2::1 bind_addr (none) identifier "" 11710 4558 2 11257 3742 1 10826 3368 0 10490 2478 0 10051 2198 0 10161 2188 0 10379 2060 0 11080 3314 0 11020 3151 0 11039 3504 0 10659 2844 0 10689 3136 0 10363 2721 0 10343 2295 0 10549 2169 0 10973 3174 0 10934 3143 0 10788 3066 0 10355 2204 0 10317 2270 0 10420 2368 0 10741 2936 0 10854 2830 0 10790 2746 0 10741 2030 0 10822 2180 0 10760 2334 0 10832 2542 0 10504 2748 0 10430 2679 0 10335 2607 0 10784 2629 0 11077 2769 0 11171 2561 0 10953 2170 0 11008 1947 0 11317 2499 0 11124 2821 0 10769 2704 0 10633 2363 0 10920 2670 0 11243 2864 0 10976 2827 0 10737 2362 0 10515 2480 0 10450 2644 0 10437 2603 0 10294 2339 0 ^C
2nd public hop without -d
[2.3-BETA][admin@pfs.dv.loc]/root: dpinger -f -r 10s 2001:558:202:2e2::1 send_interval 250ms loss_interval 1250ms time_period 30000ms report_interval 10000ms data_len 0 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% dest_addr 2001:558:202:2e2::1 bind_addr (none) identifier "" 10353 2435 0 10283 2120 0 10453 2062 0 10588 1873 0 10835 2057 0 10728 2267 0 11157 3068 0 10996 3066 0 11199 2931 0 11078 2971 0 11230 3771 0 11170 3759 0 10742 3166 0 10645 2395 0 10496 2418 0 10769 2862 0 10697 2519 0 10772 2547 0 10428 2070 0 10771 2777 0 10893 3073 0 11205 3374 0 10983 2884 0 11173 2908 0 11346 3172 0 11319 3332 0 10837 2721 0 10600 2049 0 10703 2032 0 10911 2153 0 11050 2391 0 10826 2471 0 10757 2529 0 10589 2117 0 10944 2584 0 10809 2508 0 10714 2689 0 10594 2617 0 10675 2805 0 10526 2600 0 10838 3033 0 ^C
-
Wouldn't you know it… as soon as you start serious testing it's stable. :)
I found a consumer grade Comcast install that I can test with, and am running some extended tests. Back to you soon.
-
Using the consumer Comcast connection I was able to reproduce the same behavior you were seeing yesterday. With the first hop, I see high latency standard deviation, and occasional periods of substantial packet loss. However with the second hop, I see much better results: lower average latency, much lower standard deviation, and no packet loss. In short, the second hop was a much better target for monitoring than the first hop.
The longer version… I believe that the ICMP loss is the result of overrunning the CPU in the first hop unit when it is the target. In extended runs, with the target being the first hop, I start to see dramatic increases in standard deviation and mild packet loss beginning around 350ms for the send interval. When I get down to 200ms, the average latency is up substantially (+20%), and loss hovers around 25%.
On the other hand, if I use the second hop as the target, I see very consistent results all the way down to 1ms (1000 request/replies per second). The reason that this works is that while the packets are traversing the first hop unit, they do not need evaluation by its CPU. In other words, the first hop unit has no problems forwarding the ICMP packets, but it cannot process and reply to them because of CPU limitations.
In order to choose a target to monitor, what I would recommend is to test the first, second or even third hop and choose the most consistent one. A good way to compare would be to run dpinger like so:
dpinger -f -t 120s -r 10s <target_ip>This will average results over two minutes. Let it run for a few minutes and watch it. What you are looking for is zero or near zero loss, and a consistent standard deviation as far below the average latency as possible.
This is an example of what you do not want to see (from first hop testing):
18899 12755 0 18569 10800 0 18199 9901 0 18117 11468 0 17802 11091 4 17952 11818 5 17501 11216 5 17267 10700 6 17734 12981 6 17411 12493 7 17538 12494 7 17705 13386 8 17664 13276 10 17653 13341 11 17689 13552 11 17583 13522 12 17776 14050 12 17522 13601 12 17584 13590 13 17567 13667 13 17205 11873 13 17391 12016 14 17471 12113 14 17703 12193 15 17324 10729 14 16927 10586 14
This is an example of what you would like to see (from second hop testing):
13361 4333 0 12690 3368 0 12988 3836 0 13268 4182 0 13240 4090 0 13174 3949 0 13121 4195 0 13170 4193 0 13131 4280 0 13285 4523 0 13328 4473 0 13243 4345 0 13153 4247 0 13350 4473 0 13368 4494 0 13330 4549 0 13338 4489 0 13353 4646 0 13398 4533 0 13469 4585 0 13390 4444 0 13278 4176 0 13180 4151 0 13260 4252 0 13271 4332 0 13203 4224 0 13236 4316 0 13312 4313 0
Note that the average latency to the second hop is 25% less than to the first hop. :)
Based what you posted previously, your second hop looked good. I would start with that as my monitor address, and keep an eye on the quality graphs over time.</target_ip>
-
In order to choose a target to monitor, what I would recommend is to test the first, second or even third hop and choose the most consistent one.
If you do choose something a couple of hops away but not on the "general internet" then you need to be sure that it will be consistently in the path from you through your ISP's network to the internet. If the ISP changes their router addresses or routing topology then things are going to go bad on that WAN for no real reason.
I would either choose a close monitor IP that can be relied on, or something out on the general internet.
Actually you usually want to know that the whole path from WANx through the ISP's network to the general internet is good - so using a monitor IP that should always be reachable and respond is generally "a good thing". -
Phil makes a very good point. If you choose something at Comcast other than your first hop, you will want to check periodically to ensure it's still an appropriate monitor address.
For what it's worth, I have this in my own home situation. I am on Comcast business, and with the equipment setup I have the first hop is on-premise. Over the years there have been various issues that required Comcast to swap out their on-site equipment or re-provision the service, and each time the second hop address has changed. So far, I haven't had it happen without a service call being involved, but I still periodically check with traceroute.
-
I have the same situation at work, and since the on site equipment is normally going to be fine in an outage we want to monitor an off-site IP to determine if the WAN is up.
I have wondered how hard it would be to use two hosts or even an array of hosts that you know are close for up/down monitoring, and even one or two more remote IPs, give each a weight, and have rules for things like if more than %50 are highly latent/packet loss above a certain percent, etc.
It seems like it would be about as easy/fun as juggling kittens but it would be pretty useful if you could do it right…
Anyway, I have found what I think is one of Comcast's more central neighborhood/city gateways, that should be a reliable host to use, and I will try using that for dpinger monitoring.
A nice thing that I have found with Comcast is that they have reverse hosts for much of their infrastructure and you can increment/decrement the numbers in those hostnames to find other IP's to investigate. and some of them have IPv4 and IPv6 IP addresses for the same hostname too!
Good stuff, thanks for the help!
-
This suggests that there is some value in being able to configure a gateway monitor IP of 'nth hop to a given IP address', with the actual IPv4/IPv6 address to monitor on the interface being determined after interface creation and again after every link down to link up transition. This approach is necessarily imperfect, as routing can change dynamically, but it would be more foolproof than periodic manual redetermination of the monitor address.
I can't monitor my ISP's IPv6 gateway address because it doesn't respond to ICMPv6 echo requests. In scenarios like this, I look for a recursive DNS server that is topologically close to the gateway and monitor that instead. DNS servers are unlikely to change address because the address will be 'baked in' to so many configurations.
Ironically, this approach means that I land up monitoring the IPv6 address of another ISP's DNS server. My connection terminates on my ISP's London gateways, but my ISP's DNS servers are all in Rochdale, around 200 miles from London.