SLBD using entire CPU
-
I know there is a thread which discusses this but I was unable to post to it (inactive for too long?)
I have been trying to configure multiple load balancing with multiple wans. Everything seems to be functioning ok. However SLBD seems to get stuck at 100% use. This doesn't take down the machine as I have 2 cpu's.
I saw in a thread that this "would be fixed in 1.2 beta 1" but I am running 1.2 RC3 and still having this problem
If someone could point me towards any information or offer any advice it would be greatly appreciated.
-
if it's any help i've included the output from the top command. The last time this happened, a few hours ago, i kill -9'ed the process and then started it again by hand.
The system is functioning and it seems like everything is ok (although the network load is very low at the moment anyway)$ top
last pid: 15825; load averages: 1.06, 1.01, 1.00 up 0+17:23:12 20:40:52
35 processes: 2 running, 33 sleepingMem: 27M Active, 9316K Inact, 29M Wired, 16M Buf, 675M Free
Swap: 2048M Total, 2048M FreePID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
7073 root 1 115 0 1924K 1104K CPU0 0 57:17 98.93% slbd
57045 root 1 -8 0 15228K 13288K piperd 0 0:08 0.93% php
411 root 1 4 0 3976K 3536K kqread 1 2:06 0.44% lighttpd
54436 root 1 4 0 14768K 12340K accept 0 0:11 0.39% php
85342 root 1 8 20 1864K 1304K wait 0 0:15 0.00% sh
188 root 1 96 0 1440K 1072K select 1 0:08 0.00% syslogd
705 root 1 8 20 1272K 716K nanslp 1 0:07 0.00% check_reload_status
324 root 1 -58 0 3716K 1912K bpf 0 0:06 0.00% tcpdump
1584 dhcpd 1 96 0 2540K 2172K select 0 0:04 0.00% dhcpd
325 root 1 -8 0 1276K 728K piperd 1 0:03 0.00% logger
9271 root 6 20 0 1924K 1140K kserel 0 0:01 0.00% slbd
83441 nobody 1 116 20 1472K 1128K select 0 0:01 0.00% dnsmasq
683 _ntp 1 96 0 1340K 1052K select 0 0:00 0.00% ntpd
380 proxy 1 4 0 704K 452K kqread 0 0:00 0.00% pftpx
694 root 1 8 0 1384K 1016K nanslp 0 0:00 0.00% cron
83681 proxy 1 4 20 704K 504K kqread 0 0:00 0.00% pftpx
109 root 1 96 0 504K 360K select 1 0:00 0.00% devd
684 root 1 96 0 1376K 1048K select 0 0:00 0.00% ntpd -
Just to confirm that it seems still to be a problem here's the output from our pfSense firewall.
Dual Pentium III Xeon
256MB RDRAM
3 Intel Pro/100 NICs# top last pid: 19867; load averages: 41.03, 40.61, 40.37 up 2+23:14:45 08:16:23 83 processes: 39 running, 43 sleeping, 1 zombie CPU states: 7.6% user, 0.0% nice, 91.7% system, 0.7% interrupt, 0.0% idle Mem: 58M Active, 14M Inact, 30M Wired, 15M Buf, 137M Free Swap: 512M Total, 512M Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 87681 root 1 132 0 2064K 1120K RUN 0 10:58 5.13% slbd 33521 root 1 132 0 2060K 1116K RUN 0 245:02 5.03% slbd 22389 root 1 132 0 2064K 1120K RUN 1 120:53 5.03% slbd 5023 root 1 132 0 2068K 1124K RUN 0 273:17 4.98% slbd 18288 root 1 132 0 2068K 1124K RUN 0 178:08 4.98% slbd 74050 root 1 132 0 2064K 1120K RUN 0 143:04 4.93% slbd 47444 root 1 132 0 2064K 1120K RUN 0 337:37 4.83% slbd 60666 root 1 132 0 2064K 1120K RUN 0 321:13 4.83% slbd 98236 root 1 132 0 2068K 1124K RUN 0 193:23 4.79% slbd 68726 root 1 132 0 2064K 1120K RUN 0 146:49 4.79% slbd 55706 root 1 132 0 2068K 1124K RUN 1 549:09 4.74% slbd 80004 root 1 131 0 2068K 1128K RUN 0 462:31 4.74% slbd 8586 root 1 131 0 2068K 1128K RUN 0 397:24 4.74% slbd 17741 root 1 131 0 2064K 1124K RUN 0 259:00 4.74% slbd 69227 root 1 132 0 2064K 1124K RUN 0 214:44 4.74% slbd 7107 root 1 131 0 2064K 1120K RUN 0 186:13 4.74% slbd 22395 root 1 132 0 2064K 1120K RUN 1 174:30 4.74% slbd 61055 root 1 132 0 2064K 1120K RUN 0 151:43 4.74% slbd 29996 root 1 131 0 2064K 1120K RUN 0 116:23 4.74% slbd 80622 root 1 131 0 2064K 1120K RUN 0 92:01 4.74% slbd 4439 root 1 132 0 2068K 1124K RUN 0 41:14 4.74% slbd 88900 root 1 131 0 2068K 1128K RUN 0 437:57 4.69% slbd 25584 root 1 131 0 2064K 1120K RUN 0 171:57 4.69% slbd 78407 root 1 132 0 2064K 1120K RUN 1 140:01 4.69% slbd 6488 root 1 132 0 2064K 1120K RUN 1 40:14 4.69% slbd 1442 root 1 131 0 2068K 1124K RUN 0 411:39 4.64% slbd 40966 root 1 132 0 2064K 1120K RUN 0 348:16 4.64% slbd 48921 root 1 131 0 2068K 1128K RUN 0 230:18 4.64% slbd 23749 root 1 131 0 2064K 1120K RUN 0 71:48 4.64% slbd 4887 root 1 132 0 2064K 1120K CPU1 0 40:59 4.64% slbd 32569 root 1 132 0 2068K 1124K RUN 1 29:48 4.59% slbd 89364 root 1 132 0 2064K 1120K RUN 0 10:18 4.59% slbd 53988 root 1 132 0 2068K 1124K RUN 1 556:30 4.54% slbd 80933 root 1 131 0 2064K 1120K RUN 0 91:50 4.54% slbd 2640 root 1 132 20 2068K 1132K RUN 0 420:57 2.29% slbd 2794 root 1 132 20 2068K 1132K RUN 1 418:49 2.15% slbd 19867 root 1 128 0 2444K 1664K CPU0 0 0:00 0.51% top 667 root 1 96 0 1280K 716K select 0 0:40 0.00% choparp 607 root 1 4 0 3408K 2548K kqread 0 0:36 0.00% lighttpd 1344 root 1 -8 20 1868K 1308K piperd 1 0:22 0.00% sh 1138 root 1 8 0 1720K 1156K wait 0 0:16 0.00% sh 309 root 1 -58 0 4308K 2588K bpf 0 0:14 0.00% tcpdump 192 root 1 96 0 1460K 1092K select 0 0:12 0.00% syslogd 225 root 1 96 0 2804K 1792K select 0 0:06 0.00% mpd 17433 root 1 8 0 14752K 11796K nanslp 0 0:06 0.00% php 1137 root 1 96 0 1372K 1056K select 0 0:05 0.00% miniupnpd 616 root 1 4 0 22912K 20528K accept 0 0:03 0.00% php 310 root 1 -8 0 1276K 728K piperd 0 0:03 0.00% logger 1286 root 1 116 20 2880K 2372K select 0 0:03 0.00% racoon 570 proxy 1 4 0 704K 452K kqread 0 0:02 0.00% pftpx
Relevant part of ps output:
# ps wwaux USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 25584 5.6 0.4 2064 1120 ?? R Mon02PM 172:04.36 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 19970 5.1 0.5 2064 1124 ?? R 8:16AM 0:10.30 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 22389 5.0 0.4 2064 1120 ?? R Tue12AM 121:01.22 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 87681 4.8 0.4 2064 1120 ?? R 4:51AM 11:05.18 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 89364 4.8 0.4 2064 1120 ?? R 5:03AM 10:25.35 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 40966 4.8 0.4 2064 1120 ?? R Sun08PM 348:23.30 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 80622 4.8 0.4 2064 1120 ?? R Tue07AM 92:08.38 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 4439 4.7 0.5 2068 1124 ?? R 7:58PM 41:21.34 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 88900 4.7 0.5 2068 1128 ?? R Sun03PM 438:04.66 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 7107 4.6 0.4 2064 1120 ?? R Mon12PM 186:20.32 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 17741 4.6 0.5 2064 1124 ?? R Mon03AM 259:07.83 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 61055 4.6 0.4 2064 1120 ?? R Mon06PM 151:50.78 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 78407 4.6 0.4 2064 1120 ?? R Mon08PM 140:08.98 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 5023 4.5 0.5 2068 1124 ?? R Mon02AM 273:24.07 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 22395 4.5 0.4 2064 1120 ?? R Mon02PM 174:37.21 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 74050 4.5 0.4 2064 1120 ?? R Mon07PM 143:11.92 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 6488 4.5 0.4 2064 1120 ?? RL 8:13PM 40:21.46 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 23749 4.5 0.4 2064 1120 ?? R 12:04PM 71:55.50 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 29996 4.5 0.4 2064 1120 ?? R Tue01AM 116:30.97 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 8586 4.4 0.5 2068 1128 ?? R Sun05PM 397:31.74 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 33521 4.4 0.4 2060 1116 ?? R Mon05AM 245:09.41 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 53988 4.4 0.5 2068 1124 ?? R Sun12PM 556:37.39 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 55706 4.4 0.5 2068 1124 ?? R Sun12PM 549:16.78 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 68726 4.4 0.4 2064 1120 ?? R Mon07PM 146:56.54 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 80933 4.4 0.4 2064 1120 ?? R Tue07AM 91:57.58 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 1442 4.4 0.5 2068 1124 ?? R Sun04PM 411:46.58 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 47444 4.4 0.4 2064 1120 ?? R Sun08PM 337:44.57 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 48921 4.4 0.5 2068 1128 ?? R Mon06AM 230:26.46 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 69227 4.4 0.5 2064 1124 ?? R Mon08AM 214:51.33 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 80004 4.4 0.5 2068 1128 ?? R Sun02PM 462:38.85 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 98236 4.4 0.5 2068 1124 ?? R Mon11AM 193:30.23 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 18288 4.3 0.5 2068 1124 ?? R Mon01PM 178:15.61 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 60666 4.3 0.4 2064 1120 ?? R Sun10PM 321:20.62 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 32569 4.2 0.5 2068 1124 ?? R 11:15PM 29:55.74 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 4887 4.2 0.4 2064 1120 ?? R 8:01PM 41:06.42 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 2640 2.2 0.5 2068 1132 ?? RN Sun09AM 421:01.26 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 root 2794 1.8 0.5 2068 1132 ?? RN Sun09AM 418:53.32 /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000
I noticed in cvstrac that there was a script created to kill and restart slbd every 5 hours, but it doesn't seem to be actually working on my system, as the multiple slbd process from the ps output seem to indicate.
Thanks
-
I found the ticket number, 1316, and the check-ins, 18733,18734 and 18735.
The script does look like it would work. Was something broken is RC3? I didn't try to load balance with previous versions.I'm going to try to manually execute that script and see if that fixes something.
Maybe the script isn't being called? -
I am using RC3 and I do not see this problem.
-
-
Status:
Loadb gateway opt2 Online Last change Nov 15
wan Online Last change Nov 15Failoverlb gateway opt2 Online Last change Nov 15
wan Online Last change Nov 15Config:
Loadb gateway opt2 192.168.100.1 (this is a cable modem and we need to change the monitor ip)
wan a.b.c.1Failoverlb gateway opt2 192.168.100.1
wan a.b.c.1top:
last pid: 52013; load averages: 0.25, 0.12, 0.09 up 0+02:29:12 11:31:10
32 processes: 1 running, 31 sleepingMem: 32M Active, 8572K Inact, 25M Wired, 14M Buf, 114M Free
Swap:PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND
51634 root 1 -8 0 14796K 12004K piperd 0:03 4.07% php
861 root 1 4 0 3328K 2464K kqread 0:14 0.00% lighttpd
6391 root 5 20 0 1908K 1128K kserel 0:08 0.00% slbd
866 root 1 4 0 22384K 19896K accept 0:05 0.00% php
250 root 1 96 0 1440K 1072K select 0:04 0.00% syslogd
1129 root 1 8 20 1768K 1208K wait 0:03 0.00% sh
366 root 1 -58 0 4208K 2492K bpf 0:02 0.00% tcpdump
367 root 1 -8 0 1276K 728K piperd 0:02 0.00% logger
919 nobody 1 96 0 1472K 1108K select 0:02 0.00% dnsmasq
1274 root 1 8 20 1272K 716K nanslp 0:01 0.00% check_reload_status
1155 dhcpd 1 96 0 2268K 1892K select 0:00 0.00% dhcpd
285 root 1 96 0 2804K 1788K select 0:00 0.00% mpd
1245 _ntp 1 96 0 1340K 1052K select 0:00 0.00% ntpd
872 root 1 8 0 14200K 4644K wait 0:00 0.00% php
786 proxy 1 4 0 704K 452K kqread 0:00 0.00% pftpx
808 proxy 1 4 0 704K 504K kqread 0:00 0.00% pftpx
1248 root 1 8 0 1384K 1032K nanslp 0:00 0.00% cron
862 root 1 8 0 14200K 4644K wait 0:00 0.00% phpThe cpu is a VIA (probably a C7 or maybe a C3).
-
I've been trying to execute the killall slbd command. In its defaul state (sending the TERM command) the processes don't exit. The killall -9 slbd command does seem to work (it is kill -9 !!! )
Anyway, anyone have any advice on changing the script to killall -9 slbd?
i'm not sure that this is a good idea…
-
maybe you should reinstall using the latest image? probably some update went wrong which is why the script you have is not working.
-
This install was done to a clean hard drive and configured from scratch. I currently have RC3 installed also.
I'm not sure what you're recommending. Would you like me to put a newer snapshot on?
-
I was recommending a clean install from scratch
:)
-
I have to say that my system is a complete clean install from the released 1.2RC3, and I'm seeing the same problems. I tried the killall -9 slbd and that worked on my system as well. But the regular script doesn't work at all.
-
I think i'm going to try modifying that scrpt and see what happens.
I'm not to happy about using kill -9 every 5 houts to fix a problem though…
any other ideas? -
I changed the script so that it kill -9's
I ran it by hand and it worked. now its time to wait a few hours and see if the problem is "fixed".$ cat /usr/local/sbin/reset_slbd.sh
#!/bin/sh if [ `ps awux | grep slbd | wc -l` -gt 0 ]; then killall slbd killall -9 slbd /usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000 fi
-
I have the same problem here…
http://forum.pfsense.org/index.php/topic,6852.0.html
I have two boxes, I change the script on the main one, and let the other one with the old script.
After few hour, I can see only 2 slbd processes on the main one, and 14 one the second one... So, that did the trick.
Script was changed as this, as the second killall command is not needed.
#!/bin/sh
if [
ps awux | grep slbd | wc -l
-gt 0 ]; then
killall -9 slbd
killall slbd
/usr/local/sbin/slbd -c/var/etc/slbd.conf -r5000
fiRegards,
-
Things appear stable although I have very little traffic being routed through the pool.
It looks like this 'fix' works.I'm definatly not a developer but should something like this be considered for integration into the source tree?
I'm not sure who to talk to even to mention this…
-
I've started moving more and more traffic back into the load balancing pools…
SLBD is getting stuck at full usage again! even with the modified scriptI think the script isn't being run often enough.
This is rapidly turning into less of a fix and more of a workaround. I want to make this right.Is anyone having this problem still? or am I going nuts??
I currently have the majority of my traffic going into my primary wan port without going through a pool. The rest (light web browsing from a few users) goes into a pool which has its own two wan ports.
-
It's not a fix, it's a work around until we can properly test and implement an alternative to slbd. We know what the problem is, unfortunately it's pretty much impossible to solve. The solution is ditching slbd for hoststated, which will be done in a future version.
-
Also, this work around does seem to work for the vast majority of people.
wjs: how much load are you pushing to cause it to break down so easily?
-
cmb,
Thanks very much for pushing that change, I saw it on the cvs track.right now its only my web browsing that going into the wan pool. the primary wan port, which is not part of the pool at the moment, has a good bit of traffic. last night we had about 1MB/s continuous sometimes going up to about 10MB/s when someone would pull down something big.
The cpu load hovers under 15% or 20% but i think most of that is because i've got the whole dashboard open.
I am only getting one process at a time maxing out before the script kicks in so the system never goes full load. (dual cpu system)I'm not sure this answered your question…
If there is anything I can do to help get "hoststated" working for the next version let me know.