[RESOLVED] pfctl using 100% CPU, preventing clean boot-up
-
I'm having a problem that only exists when I have a lot (~150 interfaces) attached to a captive portal zone.
If I have 20 vlan interfaces on a captive portal, the box boots fine (plays the shutdown chime, 3 minutes later it plays the boot-up chime). However, as I add more vlan interfaces to the captive portal zone, boot up becomes slower and slower. I've waited hours for pfctl to finish loading the rules from /tmp/rules.debug into the firewall.
When I look at top -a, this is what I see:
last pid: 3330; load averages: 1.06, 1.08, 1.07 up 0+12:01:35 13:57:43 41 processes: 2 running, 39 sleeping CPU: 25.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 75.0% idle Mem: 296M Active, 100M Inact, 188M Wired, 91M Buf, 2629M Free Swap: 8192M Total, 8192M Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 34697 root 1 103 0 302M 292M CPU2 2 1:20 99.85% /sbin/pfctl -o basic -f /tmp/rules.debug 35164 root 1 52 20 10592K 2648K wait 0 0:18 0.00% /bin/sh /var/db/rrd/updaterrd.sh 27763 root 1 20 0 17180K 17212K select 0 0:07 0.00% /usr/local/sbin/ntpd -g -c /var/etc/ntpd.conf -p /var/r 10423 root 1 52 0 35540K 22412K piperd 3 0:04 0.00% php-fpm: pool lighty (php-fpm) 50267 dhcpd 1 20 0 28636K 22968K select 1 0:04 0.00% [dhcpd] 12083 root 1 52 0 35540K 22412K lockf 1 0:02 0.00% php-fpm: pool lighty (php-fpm) 97675 root 1 20 0 10132K 1788K select 0 0:02 0.00% /usr/local/sbin/apinger -c /var/etc/apinger.conf
So the next step was to try to manually execute the '/sbin/pfctl -o basic -f /tmp/rules.debug' command adding -vvv for verbosity. This is what happens:
... 227(0) scrub on em1_vlan227 all fragment reassemble @228(0) scrub on em1_vlan228 all fragment reassemble @229(0) scrub on em1_vlan229 all fragment reassemble @230(0) scrub on em1_vlan230 all fragment reassemble @231(0) scrub on em1_vlan231 all fragment reassemble @232(0) scrub on em1_vlan232 all fragment reassemble @233(0) scrub on em1_vlan233 all fragment reassemble @234(0) scrub on em1_vlan234 all fragment reassemble @235(0) scrub on em1_vlan235 all fragment reassemble @236(0) scrub on em1_vlan236 all fragment reassemble @237(0) scrub on em1_vlan237 all fragment reassemble @238(0) scrub on em1_vlan238 all fragment reassemble @239(0) scrub on em1_vlan239 all fragment reassemble @240(0) scrub on em1_vlan240 all fragment reassemble @241(0) scrub on em1_vlan241 all fragment reassemble @242(0) scrub on em1_vlan242 all fragment reassemble @243(0) scrub on em1_vlan243 all fragment reassemble @244(0) scrub on em1_vlan244 all fragment reassemble @245(0) scrub on em1_vlan245 all fragment reassemble @246(0) scrub on em1_vlan246 all fragment reassemble @247(0) scrub on em1_vlan247 all fragment reassemble @248(0) scrub on em1_vlan248 all fragment reassemble @249(0) scrub on em1_vlan249 all fragment reassemble @250(0) scrub on em1_vlan250 all fragment reassemble @251(0) scrub on em1_vlan251 all fragment reassemble @252(0) scrub on em1_vlan252 all fragment reassemble @253(0) scrub on em1_vlan253 all fragment reassemble @254(0) scrub on em1_vlan254 all fragment reassemble @255(0) scrub on em1_vlan255 all fragment reassemble @256(0) no nat proto carp all @258(0) nat-anchor "/*" all @259(0) nat-anchor "/*" all @260(0) nat on em0 inet from <tonatsubnets:0> to any port = isakmp -> 98.109.201.85 static-port @261(0) nat on em0 inet from <tonatsubnets:0> to any -> 98.109.201.85 port 1024:65535 @257(0) no rdr proto carp all @262(0) rdr-anchor "/*" all @263(0) rdr-anchor "/*" all @264(0) rdr-anchor "miniupnpd" all</tonatsubnets:0></tonatsubnets:0>
For some reason, it's getting stuck, it stays there and never completes. It seems like this can be a ruleset optimization issue, so I change '-o basic' to '-o none'.
That prevents it from hanging up and consuming 100% CPU. However, since optimizations have been turned off, it takes 20 or so minutes to load in the rules, far too long. So my two options right now are:
1. Leave optimization to basic, and it crashes (gets stuck for hours)
2. Turn optimization off but it takes 20 minutes to load in the ruleset everything a reload takes place.Does anybody have ideas of how I can debug this? Has anyone experienced this before? I would upload the rules.debug file, but its huge.
Let me know if theres anything else I can provide to help debug.
Thanks!
-
What pfSense version? What hardware are you running? CPU/RAM/NICs/Drives etc.
Has this just started happening or was it slow from first installed?Steve
-
This is a fresh install of pfsense-2.2. I tried with 2.1.5 with the same results.
Hardware should be more than adequate for this:
hardware info:
[2.2-RC][admin@t31.localdomain]/root: sysctl -a | egrep -i 'hw.machine|hw.model|hw.ncpu' hw.machine: i386 hw.model: Intel(R) Xeon(R) CPU E3-1220 V2 @ 3.10GHz hw.ncpu: 4 hw.machine_arch: i386 [2.2-RC][admin@t31.localdomain]/root: vmstat procs memory page disks faults cpu r b w avm fre flt re pi po fr sr md0 ad0 in sy cs us sy id 1 0 0 649M 2686M 2889 0 0 7 2829 109 0 0 21 3221 1668 27 0 73
-
Ok, I figured out why pfctl was hanging up. One of the captive portal rules was too long. I'm working on a patch to break up CP rules into smaller chunks in /etc/inc/filter.inc.
Just wanted to post this in case someone else runs into this thread with a similar problem.