[RESOLVED] pfctl using 100% CPU, preventing clean boot-up

ehayon

I'm having a problem that only exists when I have a lot (~150 interfaces) attached to a captive portal zone.

If I have 20 vlan interfaces on a captive portal, the box boots fine (plays the shutdown chime, 3 minutes later it plays the boot-up chime). However, as I add more vlan interfaces to the captive portal zone, boot up becomes slower and slower. I've waited hours for pfctl to finish loading the rules from /tmp/rules.debug into the firewall.

When I look at top -a, this is what I see:

last pid:  3330;  load averages:  1.06,  1.08,  1.07                                                up 0+12:01:35  13:57:43
41 processes:  2 running, 39 sleeping
CPU: 25.0% user,  0.0% nice,  0.0% system,  0.0% interrupt, 75.0% idle
Mem: 296M Active, 100M Inact, 188M Wired, 91M Buf, 2629M Free
Swap: 8192M Total, 8192M Free

  PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
34697 root        1 103    0   302M   292M CPU2    2   1:20  99.85% /sbin/pfctl -o basic -f /tmp/rules.debug
35164 root        1  52   20 10592K  2648K wait    0   0:18   0.00% /bin/sh /var/db/rrd/updaterrd.sh
27763 root        1  20    0 17180K 17212K select  0   0:07   0.00% /usr/local/sbin/ntpd -g -c /var/etc/ntpd.conf -p /var/r
10423 root        1  52    0 35540K 22412K piperd  3   0:04   0.00% php-fpm: pool lighty (php-fpm)
50267 dhcpd       1  20    0 28636K 22968K select  1   0:04   0.00% [dhcpd]
12083 root        1  52    0 35540K 22412K lockf   1   0:02   0.00% php-fpm: pool lighty (php-fpm)
97675 root        1  20    0 10132K  1788K select  0   0:02   0.00% /usr/local/sbin/apinger -c /var/etc/apinger.conf

So the next step was to try to manually execute the '/sbin/pfctl -o basic -f /tmp/rules.debug' command adding -vvv for verbosity. This is what happens:

...
227(0) scrub on em1_vlan227 all fragment reassemble
@228(0) scrub on em1_vlan228 all fragment reassemble
@229(0) scrub on em1_vlan229 all fragment reassemble
@230(0) scrub on em1_vlan230 all fragment reassemble
@231(0) scrub on em1_vlan231 all fragment reassemble
@232(0) scrub on em1_vlan232 all fragment reassemble
@233(0) scrub on em1_vlan233 all fragment reassemble
@234(0) scrub on em1_vlan234 all fragment reassemble
@235(0) scrub on em1_vlan235 all fragment reassemble
@236(0) scrub on em1_vlan236 all fragment reassemble
@237(0) scrub on em1_vlan237 all fragment reassemble
@238(0) scrub on em1_vlan238 all fragment reassemble
@239(0) scrub on em1_vlan239 all fragment reassemble
@240(0) scrub on em1_vlan240 all fragment reassemble
@241(0) scrub on em1_vlan241 all fragment reassemble
@242(0) scrub on em1_vlan242 all fragment reassemble
@243(0) scrub on em1_vlan243 all fragment reassemble
@244(0) scrub on em1_vlan244 all fragment reassemble
@245(0) scrub on em1_vlan245 all fragment reassemble
@246(0) scrub on em1_vlan246 all fragment reassemble
@247(0) scrub on em1_vlan247 all fragment reassemble
@248(0) scrub on em1_vlan248 all fragment reassemble
@249(0) scrub on em1_vlan249 all fragment reassemble
@250(0) scrub on em1_vlan250 all fragment reassemble
@251(0) scrub on em1_vlan251 all fragment reassemble
@252(0) scrub on em1_vlan252 all fragment reassemble
@253(0) scrub on em1_vlan253 all fragment reassemble
@254(0) scrub on em1_vlan254 all fragment reassemble
@255(0) scrub on em1_vlan255 all fragment reassemble
@256(0) no nat proto carp all
@258(0) nat-anchor "/*" all
@259(0) nat-anchor "/*" all
@260(0) nat on em0 inet from <tonatsubnets:0> to any port = isakmp -> 98.109.201.85 static-port
@261(0) nat on em0 inet from <tonatsubnets:0> to any -> 98.109.201.85 port 1024:65535
@257(0) no rdr proto carp all
@262(0) rdr-anchor "/*" all
@263(0) rdr-anchor "/*" all
@264(0) rdr-anchor "miniupnpd" all</tonatsubnets:0></tonatsubnets:0>

For some reason, it's getting stuck, it stays there and never completes. It seems like this can be a ruleset optimization issue, so I change '-o basic' to '-o none'.

That prevents it from hanging up and consuming 100% CPU. However, since optimizations have been turned off, it takes 20 or so minutes to load in the rules, far too long. So my two options right now are:

1. Leave optimization to basic, and it crashes (gets stuck for hours)
2. Turn optimization off but it takes 20 minutes to load in the ruleset everything a reload takes place.

Does anybody have ideas of how I can debug this? Has anyone experienced this before? I would upload the rules.debug file, but its huge.

Let me know if theres anything else I can provide to help debug.

Thanks!

stephenw10

What pfSense version? What hardware are you running? CPU/RAM/NICs/Drives etc.
Has this just started happening or was it slow from first installed?

Steve

ehayon

This is a fresh install of pfsense-2.2. I tried with 2.1.5 with the same results.

Hardware should be more than adequate for this:

hardware info:

[2.2-RC][admin@t31.localdomain]/root: sysctl -a | egrep -i 'hw.machine|hw.model|hw.ncpu'
hw.machine: i386
hw.model: Intel(R) Xeon(R) CPU E3-1220 V2 @ 3.10GHz
hw.ncpu: 4
hw.machine_arch: i386

[2.2-RC][admin@t31.localdomain]/root: vmstat 
 procs      memory      page                    disks     faults         cpu
 r b w     avm    fre   flt  re  pi  po    fr  sr md0 ad0   in   sy   cs us sy id
 1 0 0    649M  2686M  2889   0   0   7  2829 109   0   0   21 3221 1668 27  0 73

ehayon

Ok, I figured out why pfctl was hanging up. One of the captive portal rules was too long. I'm working on a patch to break up CP rules into smaller chunks in /etc/inc/filter.inc.

Just wanted to post this in case someone else runs into this thread with a similar problem.