WebGUI dying under heavy load (Internal Server Error 500 etc.) ?
-
Hi,
First of all, big big congrats for an incredibly great project, with big big potential. 8)
During some tests made today (all behaved extremely well up to this one so far), I have stressed a Soekris 4801 (128 Mb RAM) a little with a simple FTP transfer of a 200 Megs file, using pfSense as a local router/simple firewall (allow all OPT1->LAN).
Without pfSense, 80+ Mbits/s, with pfSense RC1a it's about 30 Mbits/s, so results comply with other posts found here.
But, wanting to look at CPU loads during that transfer (from the FTP server machine on LAN interface), I noticed that the built-in web server was completely dead during this transfer.
I got systematically and reproductibly:
- timeouts
- Internal Server errror 500
- network unreachable.
- "
once the FTP transfer was completed, everything went back to normal (internal web server reachable and responsive).
Even if it's pushing the Soekris to its limits, this shouldn't happen on any platform: The admin web interface should at all times remain accessible...for security reason: if you have a massive attack or transfer going through your firewall, you wana be able to access it to check what's going on... :-\
Hope this is not a repost, didn't find anything in the forum about this problem after hours of searching and reading (discovering btw, wow, great things...)
-
Are you using the traffic shaper?
-
The CPU load is 100% during that time and interrupts load the machine too much. This is not an issue if your hardware is fast enough to handle the load you put on it. However, there is a solution to this. Turn on "device polling" at system>advanced. This way you shut down the interrupt storm. However this mode is not highly tuned for optimum performance yet. See http://wiki.pfsense.com/wikka.php?wakka=Tuning for suggestions how to get more performance with this mode.
-
Wow ! Thanks for the fast and complete answers. This project rocks!
Are you using the traffic shaper?
nope…basic default setting + only 4 basic settings to be able to connect 2 computers:
- assigned + activated OPT1 port
- given 192.168.2.1/24 address to OPT1 port
- activated DCHP server on OPT1 with 192.168.2.100-199 range.
- added 1 rule ALLOW ALL OPT1->LAN.
The CPU load is 100% during that time and interrupts load the machine too much. This is not an issue if your hardware is fast enough to handle the load you put on it. However, there is a solution to this. Turn on "device polling" at system>advanced. This way you shut down the interrupt storm. However this mode is not highly tuned for optimum performance yet. See http://wiki.pfsense.com/wikka.php?wakka=Tuning for suggestions how to get more performance with this mode.
Thanks, I will give it a try tomorrow, and post back findings. ;)
I haven't tried yet on a x*1 GHz PC, but I suppose that with 100 Mbits/s trafic and small(est) packets, you would reach same issue on most platforms… ? ::)
<dreaming>Wouldn't there be a possibility to keep performances at top as well for latency in low trafic conditions as well as throughput/webGUI-reachability in high trafic conditions, to automatically switch on the fly this setting depending on trafic (e.g if interupts-rate >1000/s => go into poll mode) ?</dreaming>
Guessing that you depend on freeBSD there... But freeBSD must have had same issue long time ago, when CPUs were slow...and solved iit (how) ?
Really, <important>Would prefer that the webGUI remains accessible also under heavy (attacks) conditions.</important>
-
A tweaked device polling mode should be the best solution for this problem. However a soekris or wrap might not be the platform of your choice if you really need OPT1 to LAN throughput. Check our recommended vendor list at http://pfsense.com/index.php?id=40 for other hardware. Dynamically going to devicepollingmode is not possible afaik.
Btw, I have Via C3 Nehemia 1 GHz miniITX based systems that can deliver 100 mbit/s wirespeed at around 80% CPU load (last time I tested).
-
The CPU load is 100% during that time and interrupts load the machine too much. This is not an issue if your hardware is fast enough to handle the load you put on it. However, there is a solution to this. Turn on "device polling" at system>advanced. This way you shut down the interrupt storm. However this mode is not highly tuned for optimum performance yet. See http://wiki.pfsense.com/wikka.php?wakka=Tuning for suggestions how to get more performance with this mode.
Hi,
Here the measurements with and without poll mode:@webgui:
Use device polling
Device polling is a technique that lets the system periodically poll network devices for new data instead of relying on interrupts. This can reduce CPU load and therefore increase throughput, at the expense of a slightly higher forwarding delay (the devices are polled 1000 times per second). Not all NICs support polling; see the pfSense homepage for a list of supported cards.FTP get without polls (same as yesterday): 2910 & 2751 kbytes/s (2 measures)
CPU load in non-polls mode: 100% (when it finally display something…, getting also sometimes xml version="1.0" encoding="iso-8859-1"?> instead of load.FTP get with polls (default settings): 1578 & 1562 & 1426 kbytes/s (3 measures last one while browsing the built-in webserver successfully).
More interestingly: CPU load displayed in polls mode: about 60%… ?! That's strange...I then applied the recommended changes from the wiki to the 2 config files and got following results:
FTP get with polls: 2900 & 2914 kbytes/s
CPU load 100% (displayed only just after the FTP-transfer finished) and NO webGUI reachability during the transfer.A tweaked device polling mode should be the best solution for this problem. However a soekris or wrap might not be the platform of your choice if you really need OPT1 to LAN throughput. Check our recommended vendor list at http://pfsense.com/index.php?id=40 for other hardware. Dynamically going to devicepollingmode is not possible afaik.
Btw, I have Via C3 Nehemia 1 GHz miniITX based systems that can deliver 100 mbit/s wirespeed at around 80% CPU load (last time I tested).
Please don't get me wrong, I am not shooting for maximum performance here (30 Mbits/s, respectively 2000 max-sized packets/s is ok for "entry-level" hardware).
But I'm shooting for graceful, fair and secure system degradation under overload conditions (and overload may happen to any system of any size under normal operations or heavy hackers-attack conditions like DDoS, or other most often short but heavy packets-load ;) ). Right now pfSense seems to not reach those goals:
- graceful: no abrupt service disruption for some parts whatever happens
- fair: admin and other users having chances to access during user or hacker-generated overflow (for any system kind)
- secure: system still administrable, to at least be able to detect the attack or source of overload.
Please correct me if I'm wrong. It looks at first glance that the processes/threads needed for the web-gui hardly get any cpu cycles under heavy load. ???
Looking at the numbers, I'm confident a 1 GHz system will hold 100 Mbits/s without 100% CPU usage with maximum Ethernet 1500 bytes frame sizes, but may doubt if it will hold throughput with minimum-sized 64-byte packets. And in that case, it might also fail similarly regarding security aspect (accessible admin web interface).
Would be interesting to benchmark the throughput in packets/second and latency times for other packet sizes… Does anyone have any numbers to share :) ?
What are your thoughts on these issues ?
Btw. Kudos to the devs, the more I play with the system, the more I'm amazed by the current state and potential: great work ! :)
-
Please run this from a shell (option 8) and retest:
ps awux | grep lighttpd | awk '{ print $2 }' | xargs renice -20
-
Please run this from a shell (option 8) and retest:
ps awux | grep lighttpd | awk '{ print $2 }' | xargs renice -20
Thanks for the command, cool pipe… :)
Ok, here some more results:
case "B" (polling, reproducing as before):
ftp get: 2876, 2913, 2903, 2952 kbytes/s
here the top (hitting S to display system processes and C for CPU %) copied just after the transfer (also freezes during the transfer):
last pid: 6432; load averages: 5.04, 2.88, 2.46 up 0+00:48:23 11:41:20 76 processes: 9 running, 55 sleeping, 12 waiting CPU states: 0.1% user, 0.2% nice, 1.3% system, 98.3% interrupt, 0.0% idle Mem: 11M Active, 12M Inact, 16M Wired, 12M Buf, 78M Free Swap: PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND 14 root 1 -44 -163 0K 8K RUN 5:46 85.11% swi1: net 29 root 1 171 52 0K 8K RUN 37:22 15.38% idlepoll 15 root 1 -16 0 0K 8K - 0:03 0.05% yarrow 11 root 1 171 52 0K 8K RUN 0:21 0.00% idle 28 root 1 171 52 0K 8K pgzero 0:13 0.00% pagezero 12 root 1 -32 -151 0K 8K WAIT 0:12 0.00% swi4: clock sio 4514 root 1 96 0 2332K 1560K RUN 0:06 0.00% top 21 root 1 -64 -183 0K 8K WAIT 0:04 0.00% irq14: ata0 908 root 1 -8 20 1648K 1156K piperd 0:02 0.00% sh 2779 root 1 96 0 5564K 2588K select 0:02 0.00% sshd 4 root 1 -8 0 0K 8K - 0:01 0.00% g_down 723 root 1 4 0 3076K 2188K kqread 0:01 0.00% lighttpd 1009 root 1 8 20 240K 144K wait 0:01 0.00% check_reload_status 2 root 1 -8 0 0K 8K - 0:01 0.00% g_event 3 root 1 -8 0 0K 8K - 0:01 0.00% g_up 935 root 1 8 -88 1328K 796K nanslp 0:01 0.00% watchdogd 729 root 1 8 0 7496K 3780K wait 0:01 0.00% php 38 root 1 -16 0 0K 8K - 0:01 0.00% schedcpu 262 root 1 96 0 1344K 976K select 0:01 0.00% syslogd 58 root 1 -8 0 0K 8K mdwait 0:00 0.00% md1 50 root 1 -8 0 0K 8K mdwait 0:00 0.00% md0 794 root 1 8 0 1636K 1160K wait 0:00 0.00% sh 31 root 1 20 0 0K 8K syncer 0:00 0.00% syncer 888 root 1 -8 0 0K 8K mdwait 0:00 0.00% md2 383 root 1 -58 0 3664K 1744K bpf 0:00 0.00% tcpdump 728 root 1 8 0 7496K 3780K wait 0:00 0.00% php 3030 root 1 20 0 3544K 2500K pause 0:00 0.00% tcsh 33 root 1 -16 0 0K 8K sdflus 0:00 0.00% softdepflush 1012 root 1 8 0 1568K 1240K wait 0:00 0.00% login 2951 root 1 8 0 1648K 1200K wait 0:00 0.00% sh 30 root 1 -16 0 0K 8K psleep 0:00 0.00% bufdaemon 32 root 1 -4 0 0K 8K vlruwt 0:00 0.00% vnlru 1 root 1 8 0 568K 364K wait 0:00 0.00% init 867 proxy 1 96 0 656K 452K RUN 0:00 0.00% pftpx 765 nobody 1 96 0 1328K 948K select 0:00 0.00% dnsmasq 937 root 1 8 0 1304K 984K nanslp 0:00 0.00% cron 853 proxy 1 4 0 656K 412K kqread 0:00 0.00% pftpx 384 root 1 -8 0 1196K 684K piperd 0:00 0.00% logger 1014 root 1 5 0 1644K 1196K ttyin 0:00 0.00% sh 1013 root 1 8 0 1640K 1132K wait 0:00 0.00% sh 6429 root 1 116 20 1632K 1168K RUN 0:00 0.00% sh 930 dhcpd 1 96 0 2136K 1808K select 0:00 0.00% dhcpd 6431 root 1 116 20 1492K 624K RUN 0:00 0.00% netstat 6432 root 1 -8 20 1432K 936K piperd 0:00 0.00% awk
Compared to idle:
last pid: 6353; load averages: 1.15, 2.35, 2.27 up 0+00:46:15 11:39:12 72 processes: 4 running, 56 sleeping, 12 waiting CPU states: 0.4% user, 0.4% nice, 94.6% system, 4.7% interrupt, 0.0% idle Mem: 11M Active, 12M Inact, 15M Wired, 12M Buf, 78M Free Swap: PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND 29 root 1 171 52 0K 8K RUN 35:59 89.50% idlepoll 14 root 1 -44 -163 0K 8K RUN 5:05 3.12% swi1: net 11 root 1 171 52 0K 8K RUN 0:21 0.00% idle 28 root 1 171 52 0K 8K pgzero 0:13 0.00% pagezero 12 root 1 -32 -151 0K 8K WAIT 0:12 0.00% swi4: clock sio 4514 root 1 96 0 2264K 1492K RUN 0:05 0.00% top 21 root 1 -64 -183 0K 8K WAIT 0:04 0.00% irq14: ata0 15 root 1 -16 0 0K 8K - 0:03 0.00% yarrow 908 root 1 8 20 1648K 1156K wait 0:02 0.00% sh 2779 root 1 96 0 5564K 2588K select 0:01 0.00% sshd 4 root 1 -8 0 0K 8K - 0:01 0.00% g_down 723 root 1 4 0 3076K 2188K kqread 0:01 0.00% lighttpd 1009 root 1 8 20 240K 144K nanslp 0:01 0.00% check_reload_status 2 root 1 -8 0 0K 8K - 0:01 0.00% g_event 3 root 1 -8 0 0K 8K - 0:01 0.00% g_up 935 root 1 8 -88 1328K 796K nanslp 0:01 0.00% watchdogd 729 root 1 8 0 7496K 3780K wait 0:01 0.00% php 38 root 1 -16 0 0K 8K - 0:01 0.00% schedcpu 262 root 1 96 0 1344K 972K select 0:00 0.00% syslogd 58 root 1 -8 0 0K 8K mdwait 0:00 0.00% md1 50 root 1 -8 0 0K 8K mdwait 0:00 0.00% md0 794 root 1 8 0 1636K 1160K wait 0:00 0.00% sh 31 root 1 20 0 0K 8K syncer 0:00 0.00% syncer 888 root 1 -8 0 0K 8K mdwait 0:00 0.00% md2 383 root 1 -58 0 3664K 1744K bpf 0:00 0.00% tcpdump 728 root 1 8 0 7496K 3780K wait 0:00 0.00% php 3030 root 1 20 0 3544K 2500K pause 0:00 0.00% tcsh 1012 root 1 8 0 1568K 1240K wait 0:00 0.00% login 33 root 1 -16 0 0K 8K sdflus 0:00 0.00% softdepflush 2951 root 1 8 0 1648K 1200K wait 0:00 0.00% sh 1 root 1 8 0 568K 364K wait 0:00 0.00% init 30 root 1 -16 0 0K 8K psleep 0:00 0.00% bufdaemon 32 root 1 -4 0 0K 8K vlruwt 0:00 0.00% vnlru 867 proxy 1 4 0 656K 452K kqread 0:00 0.00% pftpx 765 nobody 1 96 0 1328K 948K select 0:00 0.00% dnsmasq 937 root 1 8 0 1304K 984K nanslp 0:00 0.00% cron 853 proxy 1 4 0 656K 412K kqread 0:00 0.00% pftpx 384 root 1 -8 0 1196K 684K piperd 0:00 0.00% logger 1014 root 1 5 0 1644K 1196K ttyin 0:00 0.00% sh 1013 root 1 8 0 1640K 1132K wait 0:00 0.00% sh 930 dhcpd 1 96 0 2136K 1808K select 0:00 0.00% dhcpd 319 root 1 96 0 2832K 2184K select 0:00 0.00% sshd 9 root 1 -16 0 0K 8K psleep 0:00 0.00% pagedaemon 155 root 1 96 0 488K 344K select 0:00 0.00% devd
I then issued your command:
ps awux | grep lighttpd | awk '{ print $2 }' | xargs renice -20ps awux | grep lighttpd | awk '{ print $2 }' | xargs renice -20
723: old priority 0, new priority -20
renice: 6545: getpriority: No such process(the process 6545 was the renice command itself ;) , so it worked).
But no changes compared to previous top (except that lighttpd had nice of -20): still system freeze (no top refresh, no web GUI).
Tried renicing the process 14, but didn't change its prio.
I then switched off the poll mode in the advanced settings and got following get throughputs:
ftp get 2904, 2895 kbytes/s.
the top looked as follows:
last pid: 8736; load averages: 10.42, 4.50, 3.17 up 0+01:02:24 11:55:21 80 processes: 10 running, 58 sleeping, 12 waiting CPU states: 19.6% user, 9.1% nice, 70.6% system, 0.8% interrupt, 0.0% idle Mem: 13M Active, 12M Inact, 16M Wired, 13M Buf, 76M Free Swap: PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND 14 root 1 -44 -163 0K 8K WAIT 8:13 58.06% swi1: net 20 root 1 -68 -187 0K 8K WAIT 0:30 37.94% irq10: sis0 sis1+ 11 root 1 171 52 0K 8K RUN 1:50 5.86% idle 8720 root 1 8 0 7656K 5172K nanslp 0:00 0.34% php 8651 root 1 -8 0 7668K 5184K piperd 0:00 0.24% php 8597 root 1 -8 0 7668K 5184K piperd 0:00 0.20% php 908 root 1 -8 20 1648K 1156K piperd 0:03 0.15% sh 8628 root 1 -8 0 7668K 5180K piperd 0:00 0.10% php 8698 root 1 139 20 1444K 972K RUN 0:00 0.05% awk 29 root 1 171 52 0K 8K RUN 45:29 0.00% idlepoll 28 root 1 171 52 0K 8K RUN 0:17 0.00% pagezero 12 root 1 -32 -151 0K 8K RUN 0:15 0.00% swi4: clock sio 21 root 1 -64 -183 0K 8K WAIT 0:05 0.00% irq14: ata0 15 root 1 -16 0 0K 8K - 0:04 0.00% yarrow 2779 root 1 96 0 5564K 2588K select 0:02 0.00% sshd 4 root 1 -8 0 0K 8K - 0:02 0.00% g_down 3 root 1 -8 0 0K 8K - 0:01 0.00% g_up 723 root 1 4 -20 3100K 2212K kqread 0:01 0.00% lighttpd 1009 root 1 8 20 240K 144K nanslp 0:01 0.00% check_reload_status 2 root 1 -8 0 0K 8K - 0:01 0.00% g_event 8259 root 1 96 0 2332K 1560K RUN 0:01 0.00% top 935 root 1 8 -88 1328K 796K nanslp 0:01 0.00% watchdogd 38 root 1 -16 0 0K 8K - 0:01 0.00% schedcpu 729 root 1 8 0 7496K 3780K wait 0:01 0.00% php 58 root 1 -8 0 0K 8K mdwait 0:01 0.00% md1 262 root 1 96 0 1344K 980K select 0:01 0.00% syslogd 50 root 1 -8 0 0K 8K mdwait 0:01 0.00% md0 794 root 1 8 0 1636K 1160K wait 0:00 0.00% sh 31 root 1 20 0 0K 8K syncer 0:00 0.00% syncer 888 root 1 -8 0 0K 8K mdwait 0:00 0.00% md2 383 root 1 -58 0 3664K 1752K bpf 0:00 0.00% tcpdump 3030 root 1 20 0 3548K 2520K pause 0:00 0.00% tcsh 728 root 1 8 0 7496K 3780K wait 0:00 0.00% php 33 root 1 -16 0 0K 8K sdflus 0:00 0.00% softdepflush 1012 root 1 8 0 1568K 1240K wait 0:00 0.00% login 30 root 1 -16 0 0K 8K psleep 0:00 0.00% bufdaemon 32 root 1 -4 0 0K 8K vlruwt 0:00 0.00% vnlru 867 proxy 1 4 0 656K 452K kqread 0:00 0.00% pftpx 2951 root 1 8 0 1648K 1200K wait 0:00 0.00% sh 853 proxy 1 4 0 656K 412K kqread 0:00 0.00% pftpx 1 root 1 8 0 568K 364K wait 0:00 0.00% init 937 root 1 8 0 1304K 984K nanslp 0:00 0.00% cron 384 root 1 -8 0 1196K 684K piperd 0:00 0.00% logger 1014 root 1 5 0 1644K 1196K ttyin 0:00 0.00% sh
So no difference in throughput, and the interrupt task not taking 100%…
I then disabled the firewall in advanced settings to check:
ftp get : 6637, 5548, 6720, 5711 kbytes/s...
But webGUI and top still blocked !(without pfSense, direct connection, I get about 8800 kbytes/s)
Here the top:
last pid: 24362; load averages: 1.87, 0.64, 0.34 up 0+02:13:39 13:06:36 72 processes: 2 running, 57 sleeping, 13 waiting CPU states: 3.0% user, 0.2% nice, 14.8% system, 74.0% interrupt, 8.0% idle Mem: 11M Active, 11M Inact, 14M Wired, 13M Buf, 80M Free Swap: PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND 20 root 1 -68 -187 0K 8K WAIT 1:16 54.25% irq10: sis0 sis1+ 11 root 1 171 52 0K 8K RUN 62:38 26.51% idle 15 root 1 -16 0 0K 8K - 0:09 2.34% yarrow 28 root 1 171 52 0K 8K pgzero 0:43 0.63% pagezero 29 root 1 171 52 0K 8K pollid 45:29 0.00% idlepoll 14 root 1 -44 -163 0K 8K WAIT 8:14 0.00% swi1: net 12 root 1 -32 -151 0K 8K WAIT 0:34 0.00% swi4: clock sio 16379 root 1 96 0 2264K 1504K RUN 0:18 0.00% top 21 root 1 -64 -183 0K 8K WAIT 0:09 0.00% irq14: ata0 908 root 1 8 20 1668K 628K wait 0:06 0.00% sh 723 root 1 4 -20 3140K 1372K kqread 0:05 0.00% lighttpd 4 root 1 -8 0 0K 8K - 0:04 0.00% g_down 1009 root 1 8 20 240K 140K nanslp 0:03 0.00% check_reload_status 3 root 1 -8 0 0K 8K - 0:03 0.00% g_up 2 root 1 -8 0 0K 8K - 0:03 0.00% g_event 16124 root 1 116 20 5564K 2660K select 0:02 0.00% sshd 935 root 1 8 -88 1328K 356K nanslp 0:02 0.00% watchdogd 729 root 1 8 0 7496K 848K wait 0:02 0.00% php 38 root 1 -16 0 0K 8K - 0:02 0.00% schedcpu 58 root 1 -8 0 0K 8K mdwait 0:01 0.00% md1 50 root 1 -8 0 0K 8K mdwait 0:01 0.00% md0 31 root 1 20 0 0K 8K syncer 0:01 0.00% syncer 262 root 1 96 0 1344K 564K select 0:01 0.00% syslogd 794 root 1 8 0 1636K 560K wait 0:01 0.00% sh 888 root 1 -8 0 0K 8K mdwait 0:01 0.00% md2 383 root 1 -58 0 3664K 472K bpf 0:01 0.00% tcpdump 728 root 1 8 0 7496K 848K wait 0:00 0.00% php 33 root 1 -16 0 0K 8K sdflus 0:00 0.00% softdepflush 32 root 1 -4 0 0K 8K vlruwt 0:00 0.00% vnlru 30 root 1 -16 0 0K 8K psleep 0:00 0.00% bufdaemon 853 proxy 1 4 0 656K 272K kqread 0:00 0.00% pftpx 867 proxy 1 4 0 656K 324K kqread 0:00 0.00% pftpx 10132 nobody 1 96 0 1328K 984K select 0:00 0.00% dnsmasq 937 root 1 8 0 1304K 520K nanslp 0:00 0.00% cron 1012 root 1 8 0 1568K 328K wait 0:00 0.00% login 16348 root 1 20 0 3452K 2424K pause 0:00 0.00% tcsh 1 root 1 8 0 568K 224K wait 0:00 0.00% init 384 root 1 -8 0 1196K 180K piperd 0:00 0.00% logger 16315 root 1 8 0 1648K 1236K wait 0:00 0.00% sh 1014 root 1 5 0 1644K 236K ttyin 0:00 0.00% sh 9 root 1 -16 0 0K 8K psleep 0:00 0.00% pagedaemon 1013 root 1 8 0 1640K 232K wait 0:00 0.00% sh 155 root 1 96 0 488K 308K select 0:00 0.00% devd 10153 dhcpd 1 96 0 2136K 1804K select 0:00 0.00% dhcpd
Then I switched on the polling (while keeping the advanced->firewall off):
ftp get: 6548, 6496, 5880 kbytes/s (without webGUI accesses)
but "top" command refreshes work and webGUI also, very slowly.
ftp get: 5244, 5118, 5063, 5177 kbytes/s with 1 webGUI system status refresh during transfer.
ftp get (without top and webGUI page open): 6606, 6617, 6152 kbytes/sHere the top output in the case without webgui:
last pid: 26417; load averages: 1.80, 1.70, 0.97 up 0+02:20:46 13:13:43 76 processes: 6 running, 58 sleeping, 12 waiting CPU states: 5.7% user, 3.4% nice, 85.7% system, 5.3% interrupt, 0.0% idle Mem: 12M Active, 11M Inact, 14M Wired, 13M Buf, 79M Free Swap: PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND 29 root 1 171 52 0K 8K RUN 46:39 45.61% idlepoll 14 root 1 -44 -163 0K 8K RUN 8:38 34.67% swi1: net 15 root 1 -16 0 0K 8K - 0:12 2.29% yarrow 28 root 1 171 52 0K 8K pgzero 0:46 0.54% pagezero 16379 root 1 96 0 2332K 1572K RUN 0:22 0.34% top 11 root 1 171 52 0K 8K RUN 64:56 0.00% idle 20 root 1 -68 -187 0K 8K WAIT 1:46 0.00% irq10: sis0 sis1+ 1014 root 1 5 0 1648K 588K ttyin 1:09 0.00% sh 12 root 1 -32 -151 0K 8K WAIT 0:37 0.00% swi4: clock sio 21 root 1 -64 -183 0K 8K WAIT 0:10 0.00% irq14: ata0 723 root 1 4 -20 3152K 1384K kqread 0:05 0.00% lighttpd 4 root 1 -8 0 0K 8K - 0:04 0.00% g_down 1009 root 1 8 20 240K 140K nanslp 0:03 0.00% check_reload_status 3 root 1 -8 0 0K 8K - 0:03 0.00% g_up 2 root 1 -8 0 0K 8K - 0:03 0.00% g_event 16124 root 1 116 20 5564K 2660K select 0:03 0.00% sshd 935 root 1 8 -88 1328K 356K nanslp 0:02 0.00% watchdogd 729 root 1 8 0 7496K 848K wait 0:02 0.00% php 38 root 1 -16 0 0K 8K - 0:02 0.00% schedcpu 58 root 1 -8 0 0K 8K mdwait 0:02 0.00% md1 50 root 1 -8 0 0K 8K mdwait 0:01 0.00% md0 262 root 1 96 0 1344K 564K select 0:01 0.00% syslogd 31 root 1 20 0 0K 8K syncer 0:01 0.00% syncer 794 root 1 8 0 1636K 560K wait 0:01 0.00% sh 888 root 1 -8 0 0K 8K mdwait 0:01 0.00% md2 383 root 1 -58 0 3664K 472K bpf 0:01 0.00% tcpdump 728 root 1 8 0 7496K 848K wait 0:00 0.00% php 33 root 1 -16 0 0K 8K sdflus 0:00 0.00% softdepflush 32 root 1 -4 0 0K 8K vlruwt 0:00 0.00% vnlru 30 root 1 -16 0 0K 8K psleep 0:00 0.00% bufdaemon 853 proxy 1 4 0 656K 272K kqread 0:00 0.00% pftpx 867 proxy 1 4 0 656K 324K kqread 0:00 0.00% pftpx 25829 root 1 -8 20 1648K 1192K piperd 0:00 0.00% sh 937 root 1 8 0 1304K 520K nanslp 0:00 0.00% cron 1012 root 1 8 0 1568K 328K wait 0:00 0.00% login 16348 root 1 20 0 3452K 2424K pause 0:00 0.00% tcsh 1 root 1 8 0 568K 224K wait 0:00 0.00% init 384 root 1 -8 0 1196K 180K piperd 0:00 0.00% logger 16315 root 1 8 0 1648K 1236K wait 0:00 0.00% sh 9 root 1 -16 0 0K 8K psleep 0:00 0.00% pagedaemon 1013 root 1 8 0 1640K 232K wait 0:00 0.00% sh 155 root 1 96 0 488K 308K select 0:00 0.00% devd 25139 root 1 139 20 1388K 1008K select 0:00 0.00% dhclient 26298 root 1 8 0 1176K 448K nanslp 0:00 0.00% sleep
finally, renicing to 0 lighttpd didn't change the accessibility of the webGUI itself (remains slowly but accessible with poll on).
Conclusions:
- polling does not change throughput under overload condition
- still no solution to keep webGUI accessible when firewall is on.
- the prioritiy of the firewalling process (swi1: net ?) seems very high, meaning that if it's busy defending an attack, the rest of the system freezes…
(renice command doesn't work to change its priority to test).
Thoughts ?, Hints ? ???
-
Thanks for doing the comprehensive tests. Let me run it by a couple of FreeBSD developers and get their input.
-
Thanks for doing the comprehensive tests. Let me run it by a couple of FreeBSD developers and get their input.
Thanks. :) You're right. It looks like it could also affect FreeBSD generally…?! ??? ...so it must certainly already have been discussed and solved. ;)
-
Solved, I doubt. What I bet money that people are going to say is that the box is under powered for the task.
-
Solved, I doubt. What I bet money that people are going to say is that the box is under powered for the task.
As you already understood ;) : I'm not looking at the throughput performance, but to get a non-hanging system in overload conditions.
They can try a 50 Mbits/s (*) 64-bytes packets load on any up-to-date multi-GHz machine and see if it stays alive ;D
Maybe then they will understand that it's not a question of horsepower, but of security design. A DDoS attack can be full line throughput of 64 bytes packets… A firewall should be able to handle that without freezing, whatever the line and size.
(*) If my math is right:
- 3000 kbytes/s of 1500 bytes packets is about 2000 packets/s handled on a 266 MHz machine.
- Means 20'000 PPS on a 2.6 GHz machine, let's say 40'000 to take in account caching and other stuff. * 64 bytes, that's only 20 Mbits/s 64-bytes-packets throughput limit before freezing the system.
[EDIT: Corrected my math, but conclusions remain same]
-
Okay, please don't take this the wrong way but you are speaking to the chorus already. I am just telling you that I know the FreeBSD community well and know how they will react to this. The box is clearly underpowered.
-
No offense, but a 266 mHz box with cheap cheasy realtek adapters being expected to handle a 64 byte packet storm at 30 mB/s is rediculous. The system is meant for 5 mBit and less WAN use. When in doubt, overbuild, don't underbuild your firewalls. You don't have the luxury of super optimized TCP offloading and handling by the NIC's unless you're using nice Intel Server grade PCI-X gigabit adapters (which I do when I can). The real issue of PPS is the fact that you are memory / copy throttled. You can only make a dozen or so memory copies so fast per packet before you run out of internal machine bandwidth. It's not so much GHz as much as memory latency, bus latency, and overhead on the packet processing. If you have specialized packet processors that do nothing but offload TCP headers and minimize copy processes you have something more in line with a Cisco / Juniper networks box.
Just speaking from personal experience here, if you want balls to the wall performance with 64 byte packets, overbuild the crap out of the box with an opteron processor or two and really fast low latency memory and a PCI-X TCP offloading nic. I've pushed over 100 mBit of 64 byte packets this way.
But yeah, the FreeBSD guys are going to laugh and say, get a bigger box.
The only way to fix this behavior at all is to study how many copy's happen between packet input and output, then reduce that number. Or otherwise implement some packet storm throttling / protection to damp down an attack.
But yeah, why would you expect a 266 mHz box to perform at 30 mBit? Cisco puts 166-266 mHz processors in their T1 grade routers with full TCP offload and only rate them for 1-2 mBit connections. Just FYI.
You won't ever reach overload conditions on a 2 mBit pipe. If you're trying to push a 10 mBit pipe you need a bigger box, period.
-
Hi.. i´m dealing with the same "problem" and therefore i ask..:
How much hardware do you need to run +80MBit/s througput?
Im getting max. 10MBit/s with my pIII 933/512MB mem and a 3Com 3C982 dual NIC
If i try the m0n0wall with the same HW the result is 22MBit/s ??
-
Hi.. i´m dealing with the same "problem" and therefore i ask..:
How much hardware do you need to run +80MBit/s througput?
Im getting max. 10MBit/s with my pIII 933/512MB mem and a 3Com 3C982 dual NIC
If i try the m0n0wall with the same HW the result is 22MBit/s ??
Something is wrong here. I'm able to get about 87 mbit/s throughput from one of my C3 1GHz LAN->WAN in factory default configs. If I remove the pfSense I only get about 2 mbit/s more and that is with crappy viarhine onboard nics. Your system should push much more.
-
Actually, the throughput should not be measured in Mbits/s but in packets/s (pps), like in routers.
This will be putting better light on the true performance, as the packet size usually matters less than the packet handling overhead.
Typically, with lots of 1500 bytes packets one way, and few 64 bytes ack packets the other way, if you are well below wirespeed, you can take the Mbits/s figure and divide it roughly by 1500*8 to get the pps throughput. e.g. 22 Mbits/s = 1'830 pps.
The true throughput performances are usually made with minimum size packets of 64 bytes, and I would be curious to see some of such performance figures :)
Please don't take this as critisism, pfSense is great work and runs great :)
-
hmm wierd..
in bone stock config i get 45MBit/s with m0n0 and 22 with pfSense..
Any ideas?
My HW is P3 933CPU on MSI mainboard, 512MB PC133 SD, and a 3Com Dual NIC 3C982, well, it is the same with 2 Realtek 8139C…
-
FreebSD 6.1 is much slower than 4.11. This is the reason that m0n0wall was hesitant to switch initially. Why they want to switch now is beyond me because its going to slow down every installation.
I would suggest going back to m0n0wall.
-
yea but still the m0n0wall isnt doing the job either.. i should be able to get +80MBit througput on my existing hardware…
and the pfSense has some of the features i need