NanoBSD CPU usage high on MultiWan, PHP processes abound
-
Running 2.1 nanobsd on an Alix 2D13 with two WAN's both connected to cable modems. Currently not trying to do any load balancing, just routing a single LAN address through the secondary WAN (60/5) to seed a cloud backup to Backblaze using a gateway assignment in the firewall rules. The other WAN is also a cable connection (20/2) that is picked up by the default firewall rule.
When I am just doing speedtests on the secondary WAN, everything is fine. However, when I saturate the upstream pipe with over 5mbps of bandwidth from that client, a bunch of additional PHP processes appear and hog my CPU.
If I use the client to limit its own upstream bandwidth to less than 5mbps, this does not seem to happen. No traffic shaping is on, and I have disabled VPN's so only a minimum of services are running. The same behavior is present when viewed over SSH without any clients connected to the pfSense web server.
I'm open to ideas…
arri
Service Description apinger Gateway Monitoring Daemon dhcpd DHCP Service dnsmasq DNS Forwarder ntpd NTP clock sync
Unsaturated Upstream TOP:
PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND 71888 root 1 57 0 36724K 15704K piperd 0:30 8.98% php 301 root 1 76 20 3352K 476K kqread 17.2H 0.00% check_reload_status 29800 root 1 44 0 3412K 788K select 62:56 0.00% syslogd 33099 root 1 44 0 6980K 2724K kqread 12:13 0.00% lighttpd 64439 root 1 44 0 3264K 584K piperd 1:48 0.00% logger 22566 root 1 44 0 3264K 872K select 1:37 0.00% apinger 7316 root 1 64 20 3264K 660K piperd 1:37 0.00% logger 64427 root 1 44 0 10988K 6060K bpf 1:07 0.00% tcpdump 7105 root 1 64 20 9964K 5640K RUN 0:57 0.00% tcpdump 30621 root 1 44 0 3356K 404K nanslp 0:35 0.00% cron
Saturated Upstream TOP:
PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND 76893 root 1 123 20 26632K 15340K RUN 0:01 26.95% php 76987 root 1 123 20 26632K 14820K RUN 0:01 26.95% php 76960 root 1 123 20 26632K 14760K RUN 0:01 26.95% php 77121 root 1 123 20 26632K 14692K RUN 0:01 26.95% php 301 root 1 123 20 3352K 476K RUN 17.2H 25.98% check_reload_status 71888 root 1 57 0 36724K 15216K piperd 0:03 12.99% php 29800 root 1 44 0 3412K 788K select 62:56 0.00% syslogd 33099 root 1 44 0 6980K 2712K kqread 12:13 0.00% lighttpd 37426 root 1 44 0 3300K 1304K select 4:27 0.00% miniupnpd 64439 root 1 44 0 3264K 584K piperd 1:48 0.00% logger
-
It appears possible that the issue may be related to check_reload_status as my logs show it is trying to restart ipsec and openvpn despite both services being disabled currently.
arri
Feb 28 15:05:07 check_reload_status: Restarting ipsec tunnels Feb 28 15:05:07 check_reload_status: Restarting OpenVPN tunnels/interfaces Feb 28 15:05:07 check_reload_status: Reloading filter Feb 28 15:05:13 kernel: pid 96238 (php), uid 0, was killed: out of swap space Feb 28 15:05:15 kernel: pid 14856 (php), uid 0, was killed: out of swap space Feb 28 15:05:17 php: rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN_STUDIO_DHCP. Feb 28 15:05:20 kernel: pid 15154 (php), uid 0, was killed: out of swap space Feb 28 15:05:22 kernel: pid 16934 (php), uid 0, was killed: out of swap space Feb 28 15:05:23 php: rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN_STUDIO_DHCP. Feb 28 15:06:23 check_reload_status: updating dyndns WAN_STUDIO_DHCP Feb 28 15:06:23 check_reload_status: Restarting ipsec tunnels Feb 28 15:06:23 check_reload_status: Restarting OpenVPN tunnels/interfaces Feb 28 15:06:23 check_reload_status: Reloading filter Feb 28 15:06:29 kernel: pid 17453 (php), uid 0, was killed: out of swap space Feb 28 15:06:31 kernel: pid 39488 (php), uid 0, was killed: out of swap space
-
Mount read/write
/etc/rc.conf_mount_rw
Use truss to watch pid. Pastebin the log maybe.
truss -p pidofphp > /pidofphp.log
Set back to readonly after you after done troubleshooting
/etc/rc.conf_mount_ro
It appears possible that the issue may be related to check_reload_status as my logs show it is trying to restart ipsec and openvpn despite both services being disabled currently.
arri
Feb 28 15:05:07 check_reload_status: Restarting ipsec tunnels Feb 28 15:05:07 check_reload_status: Restarting OpenVPN tunnels/interfaces Feb 28 15:05:07 check_reload_status: Reloading filter Feb 28 15:05:13 kernel: pid 96238 (php), uid 0, was killed: out of swap space Feb 28 15:05:15 kernel: pid 14856 (php), uid 0, was killed: out of swap space Feb 28 15:05:17 php: rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN_STUDIO_DHCP. Feb 28 15:05:20 kernel: pid 15154 (php), uid 0, was killed: out of swap space Feb 28 15:05:22 kernel: pid 16934 (php), uid 0, was killed: out of swap space Feb 28 15:05:23 php: rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN_STUDIO_DHCP. Feb 28 15:06:23 check_reload_status: updating dyndns WAN_STUDIO_DHCP Feb 28 15:06:23 check_reload_status: Restarting ipsec tunnels Feb 28 15:06:23 check_reload_status: Restarting OpenVPN tunnels/interfaces Feb 28 15:06:23 check_reload_status: Reloading filter Feb 28 15:06:29 kernel: pid 17453 (php), uid 0, was killed: out of swap space Feb 28 15:06:31 kernel: pid 39488 (php), uid 0, was killed: out of swap space
That looks like a pretty good lead on what is going wrong. I would go with this before using truss.
Well what do you get when you do a ps aux? What are all the processes?
-
The OpenVPN etc messages are just telling that it is looking through OpenVPN etc for anything that needs restarting. Probably those should not be emitted on systems that have zero OpenVPN server/clients etc, as it worries people.
It does indicate that a link is being failed out. Probably your WAN2 is getting high latency on its gateway monitoring when you saturate its uplink. pfSense "apinger" sees the high latency and decides the link has "failed" and takes a whole bunch of actions. On hardware with only 256MB of memory (like yours) too many things happen at once, and some processes get "killed". I believe that a lighter PHP is being used in 2.2 and that something will happen to serialize actions more on low-memory systems. Your problem is a problem I have and live with at the moment, when it happens.
The main thing to do is System->Routing and edit each gateway, click Advanced and increase the latency and packet loss limits. I put "crazy" latency figures like 4000ms to 5000ms and packet loss 40% to 50%. That way "apinger" only takes a WAN down if it really is way outside of reasonableness.
And in your case you can control the user flow on that WAN, so keep it to a level that does not quite saturate the link. -
The OpenVPN etc messages are just telling that it is looking through OpenVPN etc for anything that needs restarting. Probably those should not be emitted on systems that have zero OpenVPN server/clients etc, as it worries people.
It does indicate that a link is being failed out. Probably your WAN2 is getting high latency on its gateway monitoring when you saturate its uplink. pfSense "apinger" sees the high latency and decides the link has "failed" and takes a whole bunch of actions. On hardware with only 256MB of memory (like yours) too many things happen at once, and some processes get "killed". I believe that a lighter PHP is being used in 2.2 and that something will happen to serialize actions more on low-memory systems. Your problem is a problem I have and live with at the moment, when it happens.
The main thing to do is System->Routing and edit each gateway, click Advanced and increase the latency and packet loss limits. I put "crazy" latency figures like 4000ms to 5000ms and packet loss 40% to 50%. That way "apinger" only takes a WAN down if it really is way outside of reasonableness.
And in your case you can control the user flow on that WAN, so keep it to a level that does not quite saturate the link.You get php pinning the CPU as well?
maybe if those log entries are benign try doing a truss of those php processes and give us the logs. So we can see what the heck they are doing or getting stuck doing.
-
Phil, you are correct regarding the secondary WAN ping times when it is saturated, I should have mentioned that upfront. I'll try your trick in the Gateway dialog. As you'll find at the end of this reply, ps aux found some action additionally supporting your theory.
Bryan, I am unable to "catch" the spawning pid's in action. They appear in top, and in a separate terminal I try the truss as quickly as possible but all I ever capture is
SIGNAL 9 (SIGKILL) process exit, rval = 0
or
truss: can not attach to target process: No such process
I did run it against the persistent PHP process that is driving the web configurator and noted that if I touched the configurator in a web client it would spit out a bunch of stuff; however, that seems to be expected and it idles down without action on the configurator unlike these spawning php processes.
ps aux provided a bit more info however; not really knowing what those processes were for I found this on the forum: https://forum.pfsense.org/index.php?topic=65696.0 which implies they may be function of interface up/down events which harkens back to the apinger solution. Yet another reason to try that.
Thanks everyone.
arri
ps aux
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 10 48.0 0.0 0 8 ?? RL 26Sep13 212662:36.55 [idle] root 31622 15.0 0.5 3352 1180 ?? RNs 2:44PM 2:46.81 /usr/local/sbin/check_reload_sta root 84862 14.0 5.8 25468 14080 ?? RN 3:02PM 0:00.56 /usr/local/bin/php -f /etc/rc.dy root 85483 14.0 6.1 26632 14668 ?? RN 3:02PM 0:00.61 /usr/local/bin/php -f /etc/rc.fi root 85084 13.0 6.2 26632 14900 ?? RN 3:02PM 0:00.64 /usr/local/bin/php -f /etc/rc.ne root 85300 13.0 6.0 26632 14512 ?? RN 3:02PM 0:00.60 /usr/local/bin/php -f /etc/rc.op
-
Phil, you are correct regarding the secondary WAN ping times when it is saturated, I should have mentioned that upfront. I'll try your trick in the Gateway dialog. As you'll find at the end of this reply, ps aux found some action additionally supporting your theory.
Bryan, I am unable to "catch" the spawning pid's in action. They appear in top, and in a separate terminal I try the truss as quickly as possible but all I ever capture is
SIGNAL 9 (SIGKILL) process exit, rval = 0
or
truss: can not attach to target process: No such process
I did run it against the persistent PHP process that is driving the web configurator and noted that if I touched the configurator in a web client it would spit out a bunch of stuff; however, that seems to be expected and it idles down without action on the configurator unlike these spawning php processes.
ps aux provided a bit more info however; not really knowing what those processes were for I found this on the forum: https://forum.pfsense.org/index.php?topic=65696.0 which implies they may be function of interface up/down events which harkens back to the apinger solution. Yet another reason to try that.
Thanks everyone.
arri
ps aux
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 10 48.0 0.0 0 8 ?? RL 26Sep13 212662:36.55 [idle] root 31622 15.0 0.5 3352 1180 ?? RNs 2:44PM 2:46.81 /usr/local/sbin/check_reload_sta root 84862 14.0 5.8 25468 14080 ?? RN 3:02PM 0:00.56 /usr/local/bin/php -f /etc/rc.dy root 85483 14.0 6.1 26632 14668 ?? RN 3:02PM 0:00.61 /usr/local/bin/php -f /etc/rc.fi root 85084 13.0 6.2 26632 14900 ?? RN 3:02PM 0:00.64 /usr/local/bin/php -f /etc/rc.ne root 85300 13.0 6.0 26632 14512 ?? RN 3:02PM 0:00.60 /usr/local/bin/php -f /etc/rc.op
Could you make your terminal wider so we can see the full command also if you paste the full output of all processes.
Trying killing apinger and testing again? Do you see the same issue?
-
- Add the combine 10s lines to each alarm that don't have them currently in /etc/inc/gwlb.inc
- killall apinger
- System -> Routing -> WAN Gateway -> Save -> Apply Changes should regenerate your apinger conf and start apinger again if you killed it above.
/etc/inc/gwlb.inc
## These parameters can be overridden in a specific alarm configuration alarm default { command on "/usr/local/sbin/pfSctl -c 'service reload dyndns %T' -c 'service reload ipsecdns' -c 'service reload openvpn %T' -c 'filter reload' " command off "/usr/local/sbin/pfSctl -c 'service reload dyndns %T' -c 'service reload ipsecdns' -c 'service reload openvpn %T' -c 'filter reload' " combine 10s } ## "Down" alarm definition. ## This alarm will be fired when target doesn't respond for 30 seconds. alarm down "down" { time {$apinger_default['down']}s } ## "Delay" alarm definition. ## This alarm will be fired when responses are delayed more than 200ms ## it will be canceled, when the delay drops below 100ms alarm delay "delay" { delay_low {$apinger_default['latencylow']}ms delay_high {$apinger_default['latencyhigh']}ms combine 10s } ## "Loss" alarm definition. ## This alarm will be fired when packet loss goes over 20% ## it will be canceled, when the loss drops below 10% alarm loss "loss" { percent_low {$apinger_default['losslow']} percent_high {$apinger_default['losshigh']} combine 10s }
With zero combine lines on Loss and Delay it will act on it every failing probe, in this case every second by default. This means those processes will spawn every second. I have set the combine to 10s which is 1/10th of the load but you may see fit to change it further.
Can you post your Status -> System Logs -> Gateway ? It would help to see how often you are seeing the alarms generated.
-
Could you make your terminal wider so we can see the full command also if you paste the full output of all processes.
Trying killing apinger and testing again? Do you see the same issue?
With apinger running, default gateway settings (WAN_STUDIO_DHCP is my secondary WAN) and saturated upstream on secondary WAN:
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 10 23.0 0.0 0 8 ?? RL 26Sep13 214570:47.75 [idle] root 31622 20.0 0.5 3352 1124 ?? RNs Sat02PM 101:23.76 /usr/local/sbin/check_reload_status root 46807 20.0 6.9 29612 16756 ?? RN 8:53AM 0:00.82 /usr/local/bin/php -f /etc/rc.newipsecdns root 47089 20.0 6.8 28552 16500 ?? RN 8:53AM 0:00.79 /usr/local/bin/php -f /etc/rc.filter_configure_sync root 46727 19.0 6.6 28604 15868 ?? RN 8:53AM 0:00.72 /usr/local/bin/php -f /etc/rc.dyndns.update WAN_STUDIO_DHCP root 47057 19.0 6.9 29612 16724 ?? RN 8:53AM 0:00.82 /usr/local/bin/php -f /etc/rc.openvpn WAN_STUDIO_DHCP root 11 9.9 0.0 0 104 ?? WL 26Sep13 2586:06.04 [intr] root 45930 5.0 0.0 0 0 ?? Z< 8:52AM 0:01.14 <defunct>root 9 1.0 0.0 0 8 ?? DL 26Sep13 135:44.08 [pagedaemon] root 0 0.0 0.0 0 72 ?? DLs 26Sep13 31:40.50 [kernel] root 1 0.0 0.0 1888 104 ?? RLs 26Sep13 0:10.52 /sbin/init -- root 2 0.0 0.0 0 8 ?? DL 26Sep13 0:09.09 [g_event] root 3 0.0 0.0 0 8 ?? DL 26Sep13 4:19.84 [g_up] root 4 0.0 0.0 0 8 ?? DL 26Sep13 8:59.75 [g_down] root 5 0.0 0.0 0 8 ?? DL 26Sep13 0:00.00 [crypto] root 6 0.0 0.0 0 8 ?? DL 26Sep13 0:00.00 [crypto returns] root 7 0.0 0.0 0 8 ?? DL 26Sep13 8:34.62 [pfpurge] root 8 0.0 0.0 0 8 ?? DL 26Sep13 0:00.00 [xpt_thrd] root 12 0.0 0.0 0 8 ?? DL 26Sep13 0:00.00 [ng_queue] root 13 0.0 0.0 0 8 ?? DL 26Sep13 1070:47.50 [yarrow] root 14 0.0 0.0 0 64 ?? DL 26Sep13 4:21.51 [usb] root 15 0.0 0.0 0 8 ?? DL 26Sep13 0:00.44 [vmdaemon] root 16 0.0 0.0 0 8 ?? DL 26Sep13 0:31.04 [idlepoll] root 17 0.0 0.0 0 8 ?? DL 26Sep13 0:00.34 [pagezero] root 18 0.0 0.0 0 8 ?? DL 26Sep13 1:45.12 [bufdaemon] root 19 0.0 0.0 0 8 ?? DL 26Sep13 1:42.96 [vnlru] root 20 0.0 0.0 0 8 ?? DL 26Sep13 13:54.07 [syncer] root 21 0.0 0.0 0 8 ?? DL 26Sep13 1:54.32 [softdepflush] root 44 0.0 0.0 0 8 ?? DL 26Sep13 2:31.01 [md0] root 49 0.0 0.0 0 8 ?? DL 26Sep13 42:32.58 [md1] root 320 0.0 0.3 3936 672 ?? Is 26Sep13 0:00.11 /sbin/devd root 7105 0.0 3.4 13036 8252 ?? SN Tue12AM 1:50.67 /usr/sbin/tcpdump -s 256 -v -S -l -n -e -ttt -i pflog0 root 7316 0.0 0.2 3264 580 ?? SN Tue12AM 3:04.83 logger -t pf -p local0.info root 10090 0.0 0.4 5284 940 ?? Is 26Sep13 0:00.02 /usr/sbin/sshd root 10260 0.0 0.2 3480 584 ?? Is 26Sep13 0:00.23 /usr/local/sbin/sshlockout_pf 15 root 12250 0.0 0.8 8036 2008 ?? Ss 8:45AM 0:00.39 sshd: admin@pts/0 (sshd) root 15511 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.14 <defunct>root 17614 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.15 <defunct>root 19285 0.0 0.0 0 0 ?? Z< Sat02PM 0:00.95 <defunct>root 19966 0.0 0.5 3264 1264 ?? Ss 8:51AM 0:00.05 /usr/local/sbin/apinger -c /var/etc/apinger.conf root 20286 0.0 0.5 4556 1236 ?? I 8:51AM 0:00.02 /usr/local/bin/rrdtool - root 22304 0.0 0.2 3384 500 ?? Is 26Sep13 0:06.30 /usr/sbin/inetd -wW -R 0 -a 127.0.0.1 /var/etc/inetd.conf root 22716 0.0 0.0 3276 0 ?? IWs - 0:00.00 /usr/local/sbin/qstats -p /var/run/qstats.pid root 27886 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.15 <defunct>root 28812 0.0 0.3 3324 824 ?? INs Wed08AM 0:00.17 dhclient: vr2 [priv] (dhclient) root 28889 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.14 <defunct>root 29800 0.0 0.3 3412 784 ?? Ss 26Sep13 68:59.23 /usr/sbin/syslogd -s -c -c -l /var/dhcpd/var/run/log -f /var/etc/syslog.conf root 30621 0.0 0.2 3356 404 ?? Ss 26Sep13 0:36.08 /usr/sbin/cron -s root 31959 0.0 0.0 3352 0 ?? IWN - 0:00.00 check_reload_status: Monitoring daemon of check_reload_status root 33099 0.0 1.0 6980 2444 ?? S 26Sep13 12:24.38 /usr/local/sbin/lighttpd -f /var/etc/lighty-webConfigurator.conf root 33128 0.0 0.1 24436 344 ?? Is 26Sep13 0:00.61 /usr/local/bin/php root 33841 0.0 0.0 24436 0 ?? IWs - 0:00.00 /usr/local/bin/php _dhcp 35784 0.0 0.4 3324 952 ?? SNs Wed08AM 0:07.59 dhclient: vr2 (dhclient) dhcpd 42173 0.0 2.4 11456 5856 ?? Ss Fri02PM 0:34.93 /usr/local/sbin/dhcpd -user dhcpd -group _dhcp -chroot /var/dhcpd -cf /etc/dhcpd.conf -pf /var/run/dhcpd.p root 42891 0.0 3.0 30708 7248 ?? I 8:50AM 0:19.96 /usr/local/bin/php root 45336 0.0 0.3 1512 768 ?? IN 8:52AM 0:00.00 sleep 60 root 49780 0.0 0.0 3264 0 ?? IWs - 0:00.00 /usr/local/bin/minicron 240 /var/run/ping_hosts.pid /usr/local/bin/ping_hosts.sh root 49928 0.0 0.1 3264 232 ?? I 26Sep13 0:29.68 minicron: helper /usr/local/bin/ping_hosts.sh (minicron) root 50095 0.0 0.0 3264 0 ?? IWs - 0:00.00 /usr/local/bin/minicron 3600 /var/run/expire_accounts.pid /etc/rc.expireaccounts root 50711 0.0 0.0 3264 0 ?? IW - 0:00.00 minicron: helper /etc/rc.expireaccounts (minicron) root 50841 0.0 0.0 3264 0 ?? IWs - 0:00.00 /usr/local/bin/minicron 86400 /var/run/update_alias_url_data.pid /etc/rc.update_alias_url_data root 50977 0.0 0.0 3264 0 ?? IW - 0:00.00 minicron: helper /etc/rc.update_alias_url_data (minicron) root 53184 0.0 0.3 3480 832 ?? Is 26Sep13 0:00.23 /usr/local/sbin/sshlockout_pf 15 root 57050 0.0 0.8 8036 2036 ?? Ss 8:46AM 0:00.41 sshd: admin@pts/1 (sshd) root 59543 0.0 0.5 3324 1128 ?? Is Fri12PM 0:00.01 dhclient: vr1 [priv] (dhclient) _dhcp 64959 0.0 0.5 3324 1268 ?? Ss Fri12PM 0:03.86 dhclient: vr1 (dhclient) root 67059 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.14 <defunct>root 67915 0.0 0.0 3264 0 ?? IWs - 0:00.00 /usr/local/sbin/dhcpleases -l /var/dhcpd/var/db/dhcpd.leases -d key.lan -p /var/run/dnsmasq.pid -h /var/et root 68923 0.0 0.0 0 0 ?? Z< Sat02PM 0:00.93 <defunct>nobody 69584 0.0 1.0 5512 2328 ?? S Fri12PM 0:51.65 [dnsmasq] root 71162 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.14 <defunct>root 73060 0.0 0.0 0 0 ?? Z< Sat02PM 0:00.95 <defunct>root 74771 0.0 0.2 3644 588 ?? IN Fri12PM 1:31.40 /bin/sh /var/db/rrd/updaterrd.sh root 77144 0.0 0.0 32628 0 ?? IWN - 0:00.00 /usr/local/bin/php -f /etc/rc.filter_configure_sync root 78118 0.0 0.0 32628 0 ?? IWN - 0:00.00 /usr/local/bin/php -f /etc/rc.filter_configure_sync root 79237 0.0 0.0 32628 0 ?? IWN - 0:00.00 /usr/local/bin/php -f /etc/rc.filter_configure_sync root 81062 0.0 0.9 24436 2116 ?? I 8:51AM 0:00.00 /usr/local/bin/php root 89469 0.0 2.6 6280 6300 ?? SNs Fri06AM 0:28.27 /usr/local/bin/ntpd -g -c /var/etc/ntpd.conf -p /var/run/ntpd.pid root 90300 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.22 <defunct>root 52851 0.0 0.0 3720 0 u0 IWs - 0:00.00 login [pam] (login) root 53321 0.0 0.0 3644 0 u0 IW - 0:00.00 -sh (sh) root 54058 0.0 0.2 3644 440 u0 I+ 26Sep13 0:00.02 /bin/sh /etc/rc.initial root 12597 0.0 0.0 3644 0 0 IWs - 0:00.00 /bin/sh /etc/rc.initial root 23302 0.0 0.0 4696 0 0 IW - 0:00.00 /bin/tcsh root 56804 0.0 0.8 3660 1828 0 S+ 8:46AM 0:01.38 top root 47353 0.0 0.5 3404 1244 1 R+ 8:53AM 0:00.01 ps aux root 57546 0.0 0.0 3644 0 1 IWs - 0:00.00 /bin/sh /etc/rc.initial root 60441 0.0 0.9 4696 2164 1 S 8:46AM 0:00.07 /bin/tcsh</defunct></defunct></defunct></defunct></defunct></defunct></defunct></defunct></defunct></defunct></defunct>
killall apinger:
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 10 91.0 0.0 0 8 ?? RL 26Sep13 214577:21.24 [idle] root 11 9.9 0.0 0 104 ?? WL 26Sep13 2586:40.74 [intr] root 0 0.0 0.0 0 72 ?? DLs 26Sep13 31:40.88 [kernel] root 1 0.0 0.0 1888 104 ?? ILs 26Sep13 0:10.52 /sbin/init -- root 2 0.0 0.0 0 8 ?? DL 26Sep13 0:09.09 [g_event] root 3 0.0 0.0 0 8 ?? DL 26Sep13 4:19.84 [g_up] root 4 0.0 0.0 0 8 ?? DL 26Sep13 8:59.76 [g_down] root 5 0.0 0.0 0 8 ?? DL 26Sep13 0:00.00 [crypto] root 6 0.0 0.0 0 8 ?? DL 26Sep13 0:00.00 [crypto returns] root 7 0.0 0.0 0 8 ?? DL 26Sep13 8:34.64 [pfpurge] root 8 0.0 0.0 0 8 ?? DL 26Sep13 0:00.00 [xpt_thrd] root 9 0.0 0.0 0 8 ?? DL 26Sep13 135:45.19 [pagedaemon] root 12 0.0 0.0 0 8 ?? DL 26Sep13 0:00.00 [ng_queue] root 13 0.0 0.0 0 8 ?? DL 26Sep13 1070:52.75 [yarrow] root 14 0.0 0.0 0 64 ?? DL 26Sep13 4:21.52 [usb] root 15 0.0 0.0 0 8 ?? DL 26Sep13 0:00.44 [vmdaemon] root 16 0.0 0.0 0 8 ?? DL 26Sep13 0:31.04 [idlepoll] root 17 0.0 0.0 0 8 ?? DL 26Sep13 0:00.34 [pagezero] root 18 0.0 0.0 0 8 ?? DL 26Sep13 1:45.13 [bufdaemon] root 19 0.0 0.0 0 8 ?? DL 26Sep13 1:42.96 [vnlru] root 20 0.0 0.0 0 8 ?? DL 26Sep13 13:54.14 [syncer] root 21 0.0 0.0 0 8 ?? DL 26Sep13 1:54.32 [softdepflush] root 44 0.0 0.0 0 8 ?? DL 26Sep13 2:31.02 [md0] root 49 0.0 0.0 0 8 ?? DL 26Sep13 42:32.68 [md1] root 320 0.0 0.3 3936 672 ?? Is 26Sep13 0:00.11 /sbin/devd root 1746 0.0 0.3 1512 772 ?? IN 9:01AM 0:00.00 sleep 60 root 7105 0.0 3.5 13036 8328 ?? SN Tue12AM 1:50.80 /usr/sbin/tcpdump -s 256 -v -S -l -n -e -ttt -i pflog0 root 7316 0.0 0.2 3264 580 ?? SN Tue12AM 3:05.05 logger -t pf -p local0.info root 10090 0.0 0.4 5284 940 ?? Is 26Sep13 0:00.02 /usr/sbin/sshd root 10260 0.0 0.2 3480 584 ?? Is 26Sep13 0:00.23 /usr/local/sbin/sshlockout_pf 15 root 12250 0.0 0.8 8036 2008 ?? Ss 8:45AM 0:00.48 sshd: admin@pts/0 (sshd) root 15511 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.14 <defunct>root 17614 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.15 <defunct>root 19285 0.0 0.0 0 0 ?? Z< Sat02PM 0:00.95 <defunct>root 22304 0.0 0.2 3384 500 ?? Is 26Sep13 0:06.30 /usr/sbin/inetd -wW -R 0 -a 127.0.0.1 /var/etc/inetd.conf root 22716 0.0 0.0 3276 0 ?? IWs - 0:00.00 /usr/local/sbin/qstats -p /var/run/qstats.pid root 27886 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.15 <defunct>root 28812 0.0 0.3 3324 824 ?? INs Wed08AM 0:00.17 dhclient: vr2 [priv] (dhclient) root 28889 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.14 <defunct>root 29800 0.0 0.3 3412 784 ?? Ss 26Sep13 68:59.77 /usr/sbin/syslogd -s -c -c -l /var/dhcpd/var/run/log -f /var/etc/syslog.conf root 30621 0.0 0.2 3356 536 ?? Is 26Sep13 0:36.08 /usr/sbin/cron -s root 31622 0.0 0.5 3352 1124 ?? INs Sat02PM 101:46.00 /usr/local/sbin/check_reload_status root 31959 0.0 0.0 3352 0 ?? IWN - 0:00.00 check_reload_status: Monitoring daemon of check_reload_status root 33099 0.0 1.0 6980 2436 ?? S 26Sep13 12:24.40 /usr/local/sbin/lighttpd -f /var/etc/lighty-webConfigurator.conf root 33128 0.0 0.1 24436 344 ?? Is 26Sep13 0:00.61 /usr/local/bin/php root 33841 0.0 0.0 24436 0 ?? IWs - 0:00.00 /usr/local/bin/php _dhcp 35784 0.0 0.4 3324 960 ?? SNs Wed08AM 0:07.60 dhclient: vr2 (dhclient) dhcpd 42173 0.0 2.5 11456 5968 ?? Is Fri02PM 0:34.97 /usr/local/sbin/dhcpd -user dhcpd -group _dhcp -chroot /var/dhcpd -cf /etc/dhcpd.conf -pf /var/run/dhcpd.p root 42891 0.0 2.9 30708 7036 ?? I 8:50AM 0:19.96 /usr/local/bin/php root 49780 0.0 0.0 3264 0 ?? IWs - 0:00.00 /usr/local/bin/minicron 240 /var/run/ping_hosts.pid /usr/local/bin/ping_hosts.sh root 49928 0.0 0.1 3264 244 ?? I 26Sep13 0:29.68 minicron: helper /usr/local/bin/ping_hosts.sh (minicron) root 50095 0.0 0.0 3264 0 ?? IWs - 0:00.00 /usr/local/bin/minicron 3600 /var/run/expire_accounts.pid /etc/rc.expireaccounts root 50711 0.0 0.0 3264 0 ?? IW - 0:00.00 minicron: helper /etc/rc.expireaccounts (minicron) root 50841 0.0 0.0 3264 0 ?? IWs - 0:00.00 /usr/local/bin/minicron 86400 /var/run/update_alias_url_data.pid /etc/rc.update_alias_url_data root 50977 0.0 0.0 3264 0 ?? IW - 0:00.00 minicron: helper /etc/rc.update_alias_url_data (minicron) root 53184 0.0 0.3 3480 832 ?? Is 26Sep13 0:00.23 /usr/local/sbin/sshlockout_pf 15 root 57050 0.0 0.8 8036 2020 ?? Ss 8:46AM 0:00.46 sshd: admin@pts/1 (sshd) root 59543 0.0 0.5 3324 1128 ?? Is Fri12PM 0:00.01 dhclient: vr1 [priv] (dhclient) _dhcp 64959 0.0 0.5 3324 1268 ?? Ss Fri12PM 0:03.87 dhclient: vr1 (dhclient) root 67059 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.14 <defunct>root 67915 0.0 0.3 3264 724 ?? Is Fri12PM 0:01.09 /usr/local/sbin/dhcpleases -l /var/dhcpd/var/db/dhcpd.leases -d key.lan -p /var/run/dnsmasq.pid -h /var/et root 68923 0.0 0.0 0 0 ?? Z< Sat02PM 0:00.93 <defunct>nobody 69584 0.0 1.0 5512 2392 ?? S Fri12PM 0:51.72 [dnsmasq] root 71162 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.14 <defunct>root 72926 0.0 8.1 32628 19536 ?? IN 8:54AM 0:01.13 /usr/local/bin/php -f /etc/rc.filter_configure_sync root 73060 0.0 0.0 0 0 ?? Z< Sat02PM 0:00.95 <defunct>root 74771 0.0 0.3 3644 640 ?? IN Fri12PM 1:31.58 /bin/sh /var/db/rrd/updaterrd.sh root 77144 0.0 0.0 32628 0 ?? IWN - 0:00.00 /usr/local/bin/php -f /etc/rc.filter_configure_sync root 78118 0.0 0.0 32628 0 ?? IWN - 0:00.00 /usr/local/bin/php -f /etc/rc.filter_configure_sync root 79237 0.0 0.0 32628 0 ?? IWN - 0:00.00 /usr/local/bin/php -f /etc/rc.filter_configure_sync root 81062 0.0 0.9 24436 2116 ?? I 8:51AM 0:00.00 /usr/local/bin/php root 89469 0.0 2.6 6280 6300 ?? SNs Fri06AM 0:28.32 /usr/local/bin/ntpd -g -c /var/etc/ntpd.conf -p /var/run/ntpd.pid root 90300 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.22 <defunct>root 52851 0.0 0.0 3720 0 u0 IWs - 0:00.00 login [pam] (login) root 53321 0.0 0.0 3644 0 u0 IW - 0:00.00 -sh (sh) root 54058 0.0 0.2 3644 440 u0 I+ 26Sep13 0:00.02 /bin/sh /etc/rc.initial root 12597 0.0 0.0 3644 0 0 IWs - 0:00.00 /bin/sh /etc/rc.initial root 23302 0.0 0.0 4696 0 0 IW - 0:00.00 /bin/tcsh root 56804 0.0 0.8 3660 1828 0 S+ 8:46AM 0:02.90 top root 2131 0.0 0.5 3404 1240 1 R+ 9:01AM 0:00.01 ps aux root 57546 0.0 0.0 3644 0 1 IWs - 0:00.00 /bin/sh /etc/rc.initial root 60441 0.0 0.9 4696 2248 1 S 8:46AM 0:00.09 /bin/tcsh</defunct></defunct></defunct></defunct></defunct></defunct></defunct></defunct></defunct></defunct>
-
Can you post your Status -> System Logs -> Gateway ? It would help to see how often you are seeing the alarms generated.
This is a few minutes of my log utilizing the default Gateway settings with apinger running and saturated upstream bandwidth on the secondary WAN; seems to fire around three times per minute:
Mar 3 09:06:30 apinger: Starting Alarm Pinger, apinger(17477) Mar 3 09:06:40 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:06:45 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:06:59 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:07:03 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:07:14 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:07:21 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:07:32 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:07:38 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:07:48 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:07:55 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:08:06 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:08:12 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:08:22 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:08:28 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:08:39 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:08:45 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:08:55 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:09:01 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:09:11 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:09:18 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:09:29 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:09:34 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:09:45 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:09:51 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:10:01 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:10:07 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:10:17 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:10:23 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:10:33 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:10:40 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:10:50 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:10:56 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:11:07 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:11:13 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down ***
-
Could you make your terminal wider so we can see the full command also if you paste the full output of all processes.
Trying killing apinger and testing again? Do you see the same issue?
With apinger running, default gateway settings (WAN_STUDIO_DHCP is my secondary WAN) and saturated upstream on secondary WAN:
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 10 23.0 0.0 0 8 ?? RL 26Sep13 214570:47.75 [idle] root 31622 20.0 0.5 3352 1124 ?? RNs Sat02PM 101:23.76 /usr/local/sbin/check_reload_status root 46807 20.0 6.9 29612 16756 ?? RN 8:53AM 0:00.82 /usr/local/bin/php -f /etc/rc.newipsecdns root 47089 20.0 6.8 28552 16500 ?? RN 8:53AM 0:00.79 /usr/local/bin/php -f /etc/rc.filter_configure_sync root 46727 19.0 6.6 28604 15868 ?? RN 8:53AM 0:00.72 /usr/local/bin/php -f /etc/rc.dyndns.update WAN_STUDIO_DHCP root 47057 19.0 6.9 29612 16724 ?? RN 8:53AM 0:00.82 /usr/local/bin/php -f /etc/rc.openvpn WAN_STUDIO_DHCP root 11 9.9 0.0 0 104 ?? WL 26Sep13 2586:06.04 [intr] root 45930 5.0 0.0 0 0 ?? Z< 8:52AM 0:01.14 <defunct>root 9 1.0 0.0 0 8 ?? DL 26Sep13 135:44.08 [pagedaemon] root 0 0.0 0.0 0 72 ?? DLs 26Sep13 31:40.50 [kernel] root 1 0.0 0.0 1888 104 ?? RLs 26Sep13 0:10.52 /sbin/init -- root 2 0.0 0.0 0 8 ?? DL 26Sep13 0:09.09 [g_event] root 3 0.0 0.0 0 8 ?? DL 26Sep13 4:19.84 [g_up] root 4 0.0 0.0 0 8 ?? DL 26Sep13 8:59.75 [g_down] root 5 0.0 0.0 0 8 ?? DL 26Sep13 0:00.00 [crypto] root 6 0.0 0.0 0 8 ?? DL 26Sep13 0:00.00 [crypto returns] root 7 0.0 0.0 0 8 ?? DL 26Sep13 8:34.62 [pfpurge] root 8 0.0 0.0 0 8 ?? DL 26Sep13 0:00.00 [xpt_thrd] root 12 0.0 0.0 0 8 ?? DL 26Sep13 0:00.00 [ng_queue] root 13 0.0 0.0 0 8 ?? DL 26Sep13 1070:47.50 [yarrow] root 14 0.0 0.0 0 64 ?? DL 26Sep13 4:21.51 [usb] root 15 0.0 0.0 0 8 ?? DL 26Sep13 0:00.44 [vmdaemon] root 16 0.0 0.0 0 8 ?? DL 26Sep13 0:31.04 [idlepoll] root 17 0.0 0.0 0 8 ?? DL 26Sep13 0:00.34 [pagezero] root 18 0.0 0.0 0 8 ?? DL 26Sep13 1:45.12 [bufdaemon] root 19 0.0 0.0 0 8 ?? DL 26Sep13 1:42.96 [vnlru] root 20 0.0 0.0 0 8 ?? DL 26Sep13 13:54.07 [syncer] root 21 0.0 0.0 0 8 ?? DL 26Sep13 1:54.32 [softdepflush] root 44 0.0 0.0 0 8 ?? DL 26Sep13 2:31.01 [md0] root 49 0.0 0.0 0 8 ?? DL 26Sep13 42:32.58 [md1] root 320 0.0 0.3 3936 672 ?? Is 26Sep13 0:00.11 /sbin/devd root 7105 0.0 3.4 13036 8252 ?? SN Tue12AM 1:50.67 /usr/sbin/tcpdump -s 256 -v -S -l -n -e -ttt -i pflog0 root 7316 0.0 0.2 3264 580 ?? SN Tue12AM 3:04.83 logger -t pf -p local0.info root 10090 0.0 0.4 5284 940 ?? Is 26Sep13 0:00.02 /usr/sbin/sshd root 10260 0.0 0.2 3480 584 ?? Is 26Sep13 0:00.23 /usr/local/sbin/sshlockout_pf 15 root 12250 0.0 0.8 8036 2008 ?? Ss 8:45AM 0:00.39 sshd: admin@pts/0 (sshd) root 15511 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.14 <defunct>root 17614 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.15 <defunct>root 19285 0.0 0.0 0 0 ?? Z< Sat02PM 0:00.95 <defunct>root 19966 0.0 0.5 3264 1264 ?? Ss 8:51AM 0:00.05 /usr/local/sbin/apinger -c /var/etc/apinger.conf root 20286 0.0 0.5 4556 1236 ?? I 8:51AM 0:00.02 /usr/local/bin/rrdtool - root 22304 0.0 0.2 3384 500 ?? Is 26Sep13 0:06.30 /usr/sbin/inetd -wW -R 0 -a 127.0.0.1 /var/etc/inetd.conf root 22716 0.0 0.0 3276 0 ?? IWs - 0:00.00 /usr/local/sbin/qstats -p /var/run/qstats.pid root 27886 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.15 <defunct>root 28812 0.0 0.3 3324 824 ?? INs Wed08AM 0:00.17 dhclient: vr2 [priv] (dhclient) root 28889 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.14 <defunct>root 29800 0.0 0.3 3412 784 ?? Ss 26Sep13 68:59.23 /usr/sbin/syslogd -s -c -c -l /var/dhcpd/var/run/log -f /var/etc/syslog.conf root 30621 0.0 0.2 3356 404 ?? Ss 26Sep13 0:36.08 /usr/sbin/cron -s root 31959 0.0 0.0 3352 0 ?? IWN - 0:00.00 check_reload_status: Monitoring daemon of check_reload_status root 33099 0.0 1.0 6980 2444 ?? S 26Sep13 12:24.38 /usr/local/sbin/lighttpd -f /var/etc/lighty-webConfigurator.conf root 33128 0.0 0.1 24436 344 ?? Is 26Sep13 0:00.61 /usr/local/bin/php root 33841 0.0 0.0 24436 0 ?? IWs - 0:00.00 /usr/local/bin/php _dhcp 35784 0.0 0.4 3324 952 ?? SNs Wed08AM 0:07.59 dhclient: vr2 (dhclient) dhcpd 42173 0.0 2.4 11456 5856 ?? Ss Fri02PM 0:34.93 /usr/local/sbin/dhcpd -user dhcpd -group _dhcp -chroot /var/dhcpd -cf /etc/dhcpd.conf -pf /var/run/dhcpd.p root 42891 0.0 3.0 30708 7248 ?? I 8:50AM 0:19.96 /usr/local/bin/php root 45336 0.0 0.3 1512 768 ?? IN 8:52AM 0:00.00 sleep 60 root 49780 0.0 0.0 3264 0 ?? IWs - 0:00.00 /usr/local/bin/minicron 240 /var/run/ping_hosts.pid /usr/local/bin/ping_hosts.sh root 49928 0.0 0.1 3264 232 ?? I 26Sep13 0:29.68 minicron: helper /usr/local/bin/ping_hosts.sh (minicron) root 50095 0.0 0.0 3264 0 ?? IWs - 0:00.00 /usr/local/bin/minicron 3600 /var/run/expire_accounts.pid /etc/rc.expireaccounts root 50711 0.0 0.0 3264 0 ?? IW - 0:00.00 minicron: helper /etc/rc.expireaccounts (minicron) root 50841 0.0 0.0 3264 0 ?? IWs - 0:00.00 /usr/local/bin/minicron 86400 /var/run/update_alias_url_data.pid /etc/rc.update_alias_url_data root 50977 0.0 0.0 3264 0 ?? IW - 0:00.00 minicron: helper /etc/rc.update_alias_url_data (minicron) root 53184 0.0 0.3 3480 832 ?? Is 26Sep13 0:00.23 /usr/local/sbin/sshlockout_pf 15 root 57050 0.0 0.8 8036 2036 ?? Ss 8:46AM 0:00.41 sshd: admin@pts/1 (sshd) root 59543 0.0 0.5 3324 1128 ?? Is Fri12PM 0:00.01 dhclient: vr1 [priv] (dhclient) _dhcp 64959 0.0 0.5 3324 1268 ?? Ss Fri12PM 0:03.86 dhclient: vr1 (dhclient) root 67059 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.14 <defunct>root 67915 0.0 0.0 3264 0 ?? IWs - 0:00.00 /usr/local/sbin/dhcpleases -l /var/dhcpd/var/db/dhcpd.leases -d key.lan -p /var/run/dnsmasq.pid -h /var/et root 68923 0.0 0.0 0 0 ?? Z< Sat02PM 0:00.93 <defunct>nobody 69584 0.0 1.0 5512 2328 ?? S Fri12PM 0:51.65 [dnsmasq] root 71162 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.14 <defunct>root 73060 0.0 0.0 0 0 ?? Z< Sat02PM 0:00.95 <defunct>root 74771 0.0 0.2 3644 588 ?? IN Fri12PM 1:31.40 /bin/sh /var/db/rrd/updaterrd.sh root 77144 0.0 0.0 32628 0 ?? IWN - 0:00.00 /usr/local/bin/php -f /etc/rc.filter_configure_sync root 78118 0.0 0.0 32628 0 ?? IWN - 0:00.00 /usr/local/bin/php -f /etc/rc.filter_configure_sync root 79237 0.0 0.0 32628 0 ?? IWN - 0:00.00 /usr/local/bin/php -f /etc/rc.filter_configure_sync root 81062 0.0 0.9 24436 2116 ?? I 8:51AM 0:00.00 /usr/local/bin/php root 89469 0.0 2.6 6280 6300 ?? SNs Fri06AM 0:28.27 /usr/local/bin/ntpd -g -c /var/etc/ntpd.conf -p /var/run/ntpd.pid root 90300 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.22 <defunct>root 52851 0.0 0.0 3720 0 u0 IWs - 0:00.00 login [pam] (login) root 53321 0.0 0.0 3644 0 u0 IW - 0:00.00 -sh (sh) root 54058 0.0 0.2 3644 440 u0 I+ 26Sep13 0:00.02 /bin/sh /etc/rc.initial root 12597 0.0 0.0 3644 0 0 IWs - 0:00.00 /bin/sh /etc/rc.initial root 23302 0.0 0.0 4696 0 0 IW - 0:00.00 /bin/tcsh root 56804 0.0 0.8 3660 1828 0 S+ 8:46AM 0:01.38 top root 47353 0.0 0.5 3404 1244 1 R+ 8:53AM 0:00.01 ps aux root 57546 0.0 0.0 3644 0 1 IWs - 0:00.00 /bin/sh /etc/rc.initial root 60441 0.0 0.9 4696 2164 1 S 8:46AM 0:00.07 /bin/tcsh</defunct></defunct></defunct></defunct></defunct></defunct></defunct></defunct></defunct></defunct></defunct>
killall apinger:
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 10 91.0 0.0 0 8 ?? RL 26Sep13 214577:21.24 [idle] root 11 9.9 0.0 0 104 ?? WL 26Sep13 2586:40.74 [intr] root 0 0.0 0.0 0 72 ?? DLs 26Sep13 31:40.88 [kernel] root 1 0.0 0.0 1888 104 ?? ILs 26Sep13 0:10.52 /sbin/init -- root 2 0.0 0.0 0 8 ?? DL 26Sep13 0:09.09 [g_event] root 3 0.0 0.0 0 8 ?? DL 26Sep13 4:19.84 [g_up] root 4 0.0 0.0 0 8 ?? DL 26Sep13 8:59.76 [g_down] root 5 0.0 0.0 0 8 ?? DL 26Sep13 0:00.00 [crypto] root 6 0.0 0.0 0 8 ?? DL 26Sep13 0:00.00 [crypto returns] root 7 0.0 0.0 0 8 ?? DL 26Sep13 8:34.64 [pfpurge] root 8 0.0 0.0 0 8 ?? DL 26Sep13 0:00.00 [xpt_thrd] root 9 0.0 0.0 0 8 ?? DL 26Sep13 135:45.19 [pagedaemon] root 12 0.0 0.0 0 8 ?? DL 26Sep13 0:00.00 [ng_queue] root 13 0.0 0.0 0 8 ?? DL 26Sep13 1070:52.75 [yarrow] root 14 0.0 0.0 0 64 ?? DL 26Sep13 4:21.52 [usb] root 15 0.0 0.0 0 8 ?? DL 26Sep13 0:00.44 [vmdaemon] root 16 0.0 0.0 0 8 ?? DL 26Sep13 0:31.04 [idlepoll] root 17 0.0 0.0 0 8 ?? DL 26Sep13 0:00.34 [pagezero] root 18 0.0 0.0 0 8 ?? DL 26Sep13 1:45.13 [bufdaemon] root 19 0.0 0.0 0 8 ?? DL 26Sep13 1:42.96 [vnlru] root 20 0.0 0.0 0 8 ?? DL 26Sep13 13:54.14 [syncer] root 21 0.0 0.0 0 8 ?? DL 26Sep13 1:54.32 [softdepflush] root 44 0.0 0.0 0 8 ?? DL 26Sep13 2:31.02 [md0] root 49 0.0 0.0 0 8 ?? DL 26Sep13 42:32.68 [md1] root 320 0.0 0.3 3936 672 ?? Is 26Sep13 0:00.11 /sbin/devd root 1746 0.0 0.3 1512 772 ?? IN 9:01AM 0:00.00 sleep 60 root 7105 0.0 3.5 13036 8328 ?? SN Tue12AM 1:50.80 /usr/sbin/tcpdump -s 256 -v -S -l -n -e -ttt -i pflog0 root 7316 0.0 0.2 3264 580 ?? SN Tue12AM 3:05.05 logger -t pf -p local0.info root 10090 0.0 0.4 5284 940 ?? Is 26Sep13 0:00.02 /usr/sbin/sshd root 10260 0.0 0.2 3480 584 ?? Is 26Sep13 0:00.23 /usr/local/sbin/sshlockout_pf 15 root 12250 0.0 0.8 8036 2008 ?? Ss 8:45AM 0:00.48 sshd: admin@pts/0 (sshd) root 15511 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.14 <defunct>root 17614 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.15 <defunct>root 19285 0.0 0.0 0 0 ?? Z< Sat02PM 0:00.95 <defunct>root 22304 0.0 0.2 3384 500 ?? Is 26Sep13 0:06.30 /usr/sbin/inetd -wW -R 0 -a 127.0.0.1 /var/etc/inetd.conf root 22716 0.0 0.0 3276 0 ?? IWs - 0:00.00 /usr/local/sbin/qstats -p /var/run/qstats.pid root 27886 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.15 <defunct>root 28812 0.0 0.3 3324 824 ?? INs Wed08AM 0:00.17 dhclient: vr2 [priv] (dhclient) root 28889 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.14 <defunct>root 29800 0.0 0.3 3412 784 ?? Ss 26Sep13 68:59.77 /usr/sbin/syslogd -s -c -c -l /var/dhcpd/var/run/log -f /var/etc/syslog.conf root 30621 0.0 0.2 3356 536 ?? Is 26Sep13 0:36.08 /usr/sbin/cron -s root 31622 0.0 0.5 3352 1124 ?? INs Sat02PM 101:46.00 /usr/local/sbin/check_reload_status root 31959 0.0 0.0 3352 0 ?? IWN - 0:00.00 check_reload_status: Monitoring daemon of check_reload_status root 33099 0.0 1.0 6980 2436 ?? S 26Sep13 12:24.40 /usr/local/sbin/lighttpd -f /var/etc/lighty-webConfigurator.conf root 33128 0.0 0.1 24436 344 ?? Is 26Sep13 0:00.61 /usr/local/bin/php root 33841 0.0 0.0 24436 0 ?? IWs - 0:00.00 /usr/local/bin/php _dhcp 35784 0.0 0.4 3324 960 ?? SNs Wed08AM 0:07.60 dhclient: vr2 (dhclient) dhcpd 42173 0.0 2.5 11456 5968 ?? Is Fri02PM 0:34.97 /usr/local/sbin/dhcpd -user dhcpd -group _dhcp -chroot /var/dhcpd -cf /etc/dhcpd.conf -pf /var/run/dhcpd.p root 42891 0.0 2.9 30708 7036 ?? I 8:50AM 0:19.96 /usr/local/bin/php root 49780 0.0 0.0 3264 0 ?? IWs - 0:00.00 /usr/local/bin/minicron 240 /var/run/ping_hosts.pid /usr/local/bin/ping_hosts.sh root 49928 0.0 0.1 3264 244 ?? I 26Sep13 0:29.68 minicron: helper /usr/local/bin/ping_hosts.sh (minicron) root 50095 0.0 0.0 3264 0 ?? IWs - 0:00.00 /usr/local/bin/minicron 3600 /var/run/expire_accounts.pid /etc/rc.expireaccounts root 50711 0.0 0.0 3264 0 ?? IW - 0:00.00 minicron: helper /etc/rc.expireaccounts (minicron) root 50841 0.0 0.0 3264 0 ?? IWs - 0:00.00 /usr/local/bin/minicron 86400 /var/run/update_alias_url_data.pid /etc/rc.update_alias_url_data root 50977 0.0 0.0 3264 0 ?? IW - 0:00.00 minicron: helper /etc/rc.update_alias_url_data (minicron) root 53184 0.0 0.3 3480 832 ?? Is 26Sep13 0:00.23 /usr/local/sbin/sshlockout_pf 15 root 57050 0.0 0.8 8036 2020 ?? Ss 8:46AM 0:00.46 sshd: admin@pts/1 (sshd) root 59543 0.0 0.5 3324 1128 ?? Is Fri12PM 0:00.01 dhclient: vr1 [priv] (dhclient) _dhcp 64959 0.0 0.5 3324 1268 ?? Ss Fri12PM 0:03.87 dhclient: vr1 (dhclient) root 67059 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.14 <defunct>root 67915 0.0 0.3 3264 724 ?? Is Fri12PM 0:01.09 /usr/local/sbin/dhcpleases -l /var/dhcpd/var/db/dhcpd.leases -d key.lan -p /var/run/dnsmasq.pid -h /var/et root 68923 0.0 0.0 0 0 ?? Z< Sat02PM 0:00.93 <defunct>nobody 69584 0.0 1.0 5512 2392 ?? S Fri12PM 0:51.72 [dnsmasq] root 71162 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.14 <defunct>root 72926 0.0 8.1 32628 19536 ?? IN 8:54AM 0:01.13 /usr/local/bin/php -f /etc/rc.filter_configure_sync root 73060 0.0 0.0 0 0 ?? Z< Sat02PM 0:00.95 <defunct>root 74771 0.0 0.3 3644 640 ?? IN Fri12PM 1:31.58 /bin/sh /var/db/rrd/updaterrd.sh root 77144 0.0 0.0 32628 0 ?? IWN - 0:00.00 /usr/local/bin/php -f /etc/rc.filter_configure_sync root 78118 0.0 0.0 32628 0 ?? IWN - 0:00.00 /usr/local/bin/php -f /etc/rc.filter_configure_sync root 79237 0.0 0.0 32628 0 ?? IWN - 0:00.00 /usr/local/bin/php -f /etc/rc.filter_configure_sync root 81062 0.0 0.9 24436 2116 ?? I 8:51AM 0:00.00 /usr/local/bin/php root 89469 0.0 2.6 6280 6300 ?? SNs Fri06AM 0:28.32 /usr/local/bin/ntpd -g -c /var/etc/ntpd.conf -p /var/run/ntpd.pid root 90300 0.0 0.0 0 0 ?? Z< Sat02PM 0:01.22 <defunct>root 52851 0.0 0.0 3720 0 u0 IWs - 0:00.00 login [pam] (login) root 53321 0.0 0.0 3644 0 u0 IW - 0:00.00 -sh (sh) root 54058 0.0 0.2 3644 440 u0 I+ 26Sep13 0:00.02 /bin/sh /etc/rc.initial root 12597 0.0 0.0 3644 0 0 IWs - 0:00.00 /bin/sh /etc/rc.initial root 23302 0.0 0.0 4696 0 0 IW - 0:00.00 /bin/tcsh root 56804 0.0 0.8 3660 1828 0 S+ 8:46AM 0:02.90 top root 2131 0.0 0.5 3404 1240 1 R+ 9:01AM 0:00.01 ps aux root 57546 0.0 0.0 3644 0 1 IWs - 0:00.00 /bin/sh /etc/rc.initial root 60441 0.0 0.9 4696 2248 1 S 8:46AM 0:00.09 /bin/tcsh</defunct></defunct></defunct></defunct></defunct></defunct></defunct></defunct></defunct></defunct>
Good that is as expected
Can you post your Status -> System Logs -> Gateway ? It would help to see how often you are seeing the alarms generated.
This is a few minutes of my log utilizing the default Gateway settings with apinger running and saturated upstream bandwidth on the secondary WAN; seems to fire around three times per minute:
Mar 3 09:06:30 apinger: Starting Alarm Pinger, apinger(17477) Mar 3 09:06:40 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:06:45 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:06:59 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:07:03 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:07:14 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:07:21 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:07:32 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:07:38 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:07:48 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:07:55 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:08:06 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:08:12 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:08:22 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:08:28 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:08:39 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:08:45 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:08:55 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:09:01 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:09:11 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:09:18 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:09:29 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:09:34 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:09:45 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:09:51 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:10:01 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:10:07 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:10:17 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:10:23 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:10:33 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:10:40 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:10:50 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:10:56 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:11:07 apinger: ALARM: WAN_STUDIO_DHCP(208.67.220.220) *** down *** Mar 3 09:11:13 apinger: alarm canceled: WAN_STUDIO_DHCP(208.67.220.220) *** down ***
Crazy its the down alarm. Wasn't expecting that but makes sense for the alarm actions. Do you run any traffic shaping? Can you ping the same target as apinger and do the same load on upstream and post results?
Are you WAN connections the same? Do you see the same problem with saturated upstream on your main connection?
-
Crazy its the down alarm. Wasn't expecting that but makes sense for the alarm actions. Do you run any traffic shaping? Can you ping the same target as apinger and do the same load on upstream and post results?
Are you WAN connections the same? Do you see the same problem with saturated upstream on your main connection?
I deleted my limiters when I started noticing issues, so no traffic shaping for these tests.
I'm not sure I understand the ping test you're asking for. Do you want me to just post ping results sent from the secondary WAN during upstream saturation?
WAN connections are not the same but come from the same company; I do see the same issue on both. Primary is a 20/2 cable connection from the same company as the secondary with a 60/5. However, interestingly the modems are assigned different gateways by the cable company's DHCP. I'd love to load balance the upstream so I can get this Backblaze initial seed over with, but I suspect the Backblaze client doesn't open multiple connections.
-
Crazy its the down alarm. Wasn't expecting that but makes sense for the alarm actions. Do you run any traffic shaping? Can you ping the same target as apinger and do the same load on upstream and post results?
Are you WAN connections the same? Do you see the same problem with saturated upstream on your main connection?
I deleted my limiters when I started noticing issues, so no traffic shaping for these tests.
I'm not sure I understand the ping test you're asking for. Do you want me to just post ping results sent from the secondary WAN during upstream saturation?
WAN connections are not the same but come from the same company; I do see the same issue on both. Primary is a 20/2 cable connection from the same company as the secondary with a 60/5. However, interestingly the modems are assigned different gateways by the cable company's DHCP. I'd love to load balance the upstream so I can get this Backblaze initial seed over with, but I suspect the Backblaze client doesn't open multiple connections.
Yes just do a ping or winmtr to somewhere as you saturate. It would be good to see exactly what is going on. What are you using to saturate it anyway? I cannot for the life of me reproduce this here.
-
I'm using the Mac Backblaze client to saturate the upstream bandwidth. Backblaze is a cloud based backup host. The Mac is connected to the Alix router through two unmanaged switches, and a firewall rule sends it through the secondary WAN. Everything else on the network goes through the primary WAN using the catchall default firewall rule.
Interestingly, pinging Google and OpenDNS servers using the configurator's tool using the secondary WAN as the source is unsurprisingly ugly (>3000ms and plenty of packet loss); however, pinging the same servers from the Mac that's being routed through the firewall to the secondary WAN show reasonable pings and no packet loss.
-
I'm using the Mac Backblaze client to saturate the upstream bandwidth. Backblaze is a cloud based backup host. The Mac is connected to the Alix router through two unmanaged switches, and a firewall rule sends it through the secondary WAN. Everything else on the network goes through the primary WAN using the catchall default firewall rule.
Interestingly, pinging Google and OpenDNS servers using the configurator's tool using the secondary WAN as the source is unsurprisingly ugly (>3000ms and plenty of packet loss); however, pinging the same servers from the Mac that's being routed through the firewall to the secondary WAN show reasonable pings and no packet loss.
Can you post your /var/etc/apinger.conf ?
-
Why would the pfSense pingtimes from the secondary WAN be so much crazier than the ones from a LAN side client routed through the secondary WAN? Is the Mac providing priority to those packets before they get to the router?
This is the apinger.conf I'm using right now that is keeping php from going crazy on me:
# pfSense apinger configuration file. Automatically Generated! ## User and group the pinger should run as user "root" group "wheel" ## Mailer to use (default: "/usr/lib/sendmail -t") #mailer "/var/qmail/bin/qmail-inject" ## Location of the pid-file (default: "/var/run/apinger.pid") pid_file "/var/run/apinger.pid" ## Format of timestamp (%s macro) (default: "%b %d %H:%M:%S") #timestamp_format "%Y%m%d%H%M%S" status { ## File where the status information should be written to file "/var/run/apinger.status" ## Interval between file updates ## when 0 or not set, file is written only when SIGUSR1 is received interval 5s } ######################################## # RRDTool status gathering configuration # Interval between RRD updates rrd interval 60s; ## These parameters can be overridden in a specific alarm configuration alarm default { command on "/usr/local/sbin/pfSctl -c 'service reload dyndns %T' -c 'service reload ipsecdns' -c 'service reload openvpn %T' -c 'filter reload' " command off "/usr/local/sbin/pfSctl -c 'service reload dyndns %T' -c 'service reload ipsecdns' -c 'service reload openvpn %T' -c 'filter reload' " combine 10s } ## "Down" alarm definition. ## This alarm will be fired when target doesn't respond for 30 seconds. alarm down "down" { time 10s } ## "Delay" alarm definition. ## This alarm will be fired when responses are delayed more than 200ms ## it will be canceled, when the delay drops below 100ms alarm delay "delay" { delay_low 200ms delay_high 500ms } ## "Loss" alarm definition. ## This alarm will be fired when packet loss goes over 20% ## it will be canceled, when the loss drops below 10% alarm loss "loss" { percent_low 10 percent_high 20 } target default { ## How often the probe should be sent interval 1s ## How many replies should be used to compute average delay ## for controlling "delay" alarms avg_delay_samples 10 ## How many probes should be used to compute average loss avg_loss_samples 50 ## The delay (in samples) after which loss is computed ## without this delays larger than interval would be treated as loss avg_loss_delay_samples 20 ## Names of the alarms that may be generated for the target alarms "down","delay","loss" ## Location of the RRD #rrd file "/var/db/rrd/apinger-%t.rrd" } target "208.67.222.222" { description "WAN" srcip "174.134.xxx.xx" alarms override "loss","delay","down"; rrd file "/var/db/rrd/WAN-quality.rrd" } alarm loss "WAN_STUDIO_DHCPloss" { percent_low 30 percent_high 40 } alarm delay "WAN_STUDIO_DHCPdelay" { delay_low 4000ms delay_high 5000ms } alarm down "WAN_STUDIO_DHCPdown" { time 120s } target "208.67.220.220" { description "WAN_STUDIO_DHCP" srcip "174.135.xxx.xx" interval 20s alarms override "WAN_STUDIO_DHCPloss","WAN_STUDIO_DHCPdelay","WAN_STUDIO_DHCPdown"; rrd file "/var/db/rrd/WAN_STUDIO_DHCP-quality.rrd" }
Whereas, this one is more the default configuration which causes the extra processes to take over:
# pfSense apinger configuration file. Automatically Generated! ## User and group the pinger should run as user "root" group "wheel" ## Mailer to use (default: "/usr/lib/sendmail -t") #mailer "/var/qmail/bin/qmail-inject" ## Location of the pid-file (default: "/var/run/apinger.pid") pid_file "/var/run/apinger.pid" ## Format of timestamp (%s macro) (default: "%b %d %H:%M:%S") #timestamp_format "%Y%m%d%H%M%S" status { ## File where the status information should be written to file "/var/run/apinger.status" ## Interval between file updates ## when 0 or not set, file is written only when SIGUSR1 is received interval 5s } ######################################## # RRDTool status gathering configuration # Interval between RRD updates rrd interval 60s; ## These parameters can be overridden in a specific alarm configuration alarm default { command on "/usr/local/sbin/pfSctl -c 'service reload dyndns %T' -c 'service reload ipsecdns' -c 'service reload openvpn %T' -c 'filter reload' " command off "/usr/local/sbin/pfSctl -c 'service reload dyndns %T' -c 'service reload ipsecdns' -c 'service reload openvpn %T' -c 'filter reload' " combine 10s } ## "Down" alarm definition. ## This alarm will be fired when target doesn't respond for 30 seconds. alarm down "down" { time 10s } ## "Delay" alarm definition. ## This alarm will be fired when responses are delayed more than 200ms ## it will be canceled, when the delay drops below 100ms alarm delay "delay" { delay_low 200ms delay_high 500ms } ## "Loss" alarm definition. ## This alarm will be fired when packet loss goes over 20% ## it will be canceled, when the loss drops below 10% alarm loss "loss" { percent_low 10 percent_high 20 } target default { ## How often the probe should be sent interval 1s ## How many replies should be used to compute average delay ## for controlling "delay" alarms avg_delay_samples 10 ## How many probes should be used to compute average loss avg_loss_samples 50 ## The delay (in samples) after which loss is computed ## without this delays larger than interval would be treated as loss avg_loss_delay_samples 20 ## Names of the alarms that may be generated for the target alarms "down","delay","loss" ## Location of the RRD #rrd file "/var/db/rrd/apinger-%t.rrd" } target "208.67.222.222" { description "WAN" srcip "174.134.xxx.xx" alarms override "loss","delay","down"; rrd file "/var/db/rrd/WAN-quality.rrd" } target "208.67.220.220" { description "WAN_STUDIO_DHCP" srcip "174.135.xxx.xx" alarms override "loss","delay","down"; rrd file "/var/db/rrd/WAN_STUDIO_DHCP-quality.rrd" }
-
I am not sure but this is getting interesting! May be some sort of priority thing. I won't be able to do too much more work as I am flying across the country to do some IT work.
-
Safe travels. Feel free to assume that I'm making some rookie mistake and float it out there for me to check.
I rebooted the router and sent another client into the secondary WAN to test the ping through it and found the same erratic behaviour that performing the ping directly from the secondary WAN interface on the router was causing. Then I went back to the original client that was saturating that WAN and had previously not had issue with pings to find that now it was also consistent with the poor ping performance.
I am unable to replicate the state now where I had good ping times from the client that was saturating the upstream while seeing bad ones from the router on that interface. The good news is that they are all inline with each other at least!
arri
-
Safe travels. Feel free to assume that I'm making some rookie mistake and float it out there for me to check.
I rebooted the router and sent another client into the secondary WAN to test the ping through it and found the same erratic behaviour that performing the ping directly from the secondary WAN interface on the router was causing. Then I went back to the original client that was saturating that WAN and had previously not had issue with pings to find that now it was also consistent with the poor ping performance.
I am unable to replicate the state now where I had good ping times from the client that was saturating the upstream while seeing bad ones from the router on that interface. The good news is that they are all inline with each other at least!
arri
Indeed. I missed two flights. I have hate airlines. Canceled that contract. Horrible day.
- Raise ICMP priority
- Lower interval or use combine to lower alarm repetitions
- Traffic shape so you leave a bit of space for the ICMP
- Try setting the down time higher.
-
You might try a 2.1.1 snapshot that has many fixes for various things in it.
https://doc.pfsense.org/index.php/2.1.1_New_Features_and_Changes
The are still pre-release snapshots though so new bugs may be added. ;)Steve
-
The Mac is connected to the Alix router through two unmanaged switches, and a firewall rule sends it through the secondary WAN.
Maybe the rule is only for TCP/UDP and ICMP ping from the Mac is still going out the default gateway? That would explain the ping being so good. You could traceroute from Mac to an external host to see which way the ICMP is really going.
It does sound like the link is so saturated that the apinger monitoring is struggling to see any decent numbers and is deciding the link is bad. -
The Mac is connected to the Alix router through two unmanaged switches, and a firewall rule sends it through the secondary WAN.
Maybe the rule is only for TCP/UDP and ICMP ping from the Mac is still going out the default gateway? That would explain the ping being so good. You could traceroute from Mac to an external host to see which way the ICMP is really going.
It does sound like the link is so saturated that the apinger monitoring is struggling to see any decent numbers and is deciding the link is bad.You are correct, I noticed before the reboot that my rule was limited to TCP/UDP, and updated it to "any." That explains the new consistency. Nice catch!
It's working fine with the ludicrous numbers for monitoring delays and decreasing the number of checks that gateway does. If I find the time, I'll give a snapshot a try as this is the box for doing that on.
arri
-
The Mac is connected to the Alix router through two unmanaged switches, and a firewall rule sends it through the secondary WAN.
Maybe the rule is only for TCP/UDP and ICMP ping from the Mac is still going out the default gateway? That would explain the ping being so good. You could traceroute from Mac to an external host to see which way the ICMP is really going.
It does sound like the link is so saturated that the apinger monitoring is struggling to see any decent numbers and is deciding the link is bad.You are correct, I noticed before the reboot that my rule was limited to TCP/UDP, and updated it to "any." That explains the new consistency. Nice catch!
It's working fine with the ludicrous numbers for monitoring delays and decreasing the number of checks that gateway does. If I find the time, I'll give a snapshot a try as this is the box for doing that on.
arri
I am not sure if its going to change anything. Turn off gateway monitor on that wan and then test with backblaze and ping out and see if you see a difference. There could be something going on where the added load on the box is degrading the connection further or something. If you get the same loss/latency without apinger running on that WAN then you are dealing with normal saturation? This will need to be fixed using one of the things listed in my above post.
-
I am not sure if its going to change anything. Turn off gateway monitor on that wan and then test with backblaze and ping out and see if you see a difference. There could be something going on where the added load on the box is degrading the connection further or something. If you get the same loss/latency without apinger running on that WAN then you are dealing with normal saturation? This will need to be fixed using one of the things listed in my above post.
I finally got around to trying this for you. With the gateway monitor off on the secondary wan, and fully saturated upstream traffic out of it, ping times from both the configurator page using the secondary wan gateway and from the client routed through the firewall (this time all traffic not just tcp/udp;) are from two to five seconds.
I think it's clearly just a congested interface the way Backblaze saturates it through their SSL tunnel.
-
I am not sure if its going to change anything. Turn off gateway monitor on that wan and then test with backblaze and ping out and see if you see a difference. There could be something going on where the added load on the box is degrading the connection further or something. If you get the same loss/latency without apinger running on that WAN then you are dealing with normal saturation? This will need to be fixed using one of the things listed in my above post.
I finally got around to trying this for you. With the gateway monitor off on the secondary wan, and fully saturated upstream traffic out of it, ping times from both the configurator page using the secondary wan gateway and from the client routed through the firewall (this time all traffic not just tcp/udp;) are from two to five seconds.
I think it's clearly just a congested interface the way Backblaze saturates it through their SSL tunnel.
Is that same, worse or better than results with apinger on? Would have been interesting if apinger perpetuates the problem with the highly increased load.
-
Having the same CPU rising behaviour on our PCEngine ALIX with Nano Intall (2.1.4 stable).
In France, we experiment loads of trouble over xDSL connections. Mainly loss, caused even by a bad synchro or by a user that get the line to saturate because of big downloads / uploads.
This causes pf to experience a hight CPU load when GW is considered as offline by PF.We did the trick of gateway polling in "Routing->Gateways->Edit gateway" :
- Advanced->Packet Loss Thresholds = 20% / 40% (default 10% / 20%)
- Probe Interval = 5s (default = 1s)
- Down = 60s (default = 10s)For what we experienced so far with those values is a better responsivness of PHP UI, and RRD graph shows a fall of CPU load. Still having those settings for test for few hours on PF that are experimenting DSL sync difficulties.
It looks good so far, and looks like increasing apinger tests and faillure decision, gives the ALIX more time to execute what it has to execute, and CPU graph falls dramatically (so far…).
Sounds to be a good and quick idea to play with the values above.