[RESOLVED] pfctl using 100% CPU, preventing clean boot-up



  • I'm having a problem that only exists when I have a lot (~150 interfaces) attached to a captive portal zone.

    If I have 20 vlan interfaces on a captive portal, the box boots fine (plays the shutdown chime, 3 minutes later it plays the boot-up chime). However, as I add more vlan interfaces to the captive portal zone, boot up becomes slower and slower. I've waited hours for pfctl to finish loading the rules from /tmp/rules.debug into the firewall.

    When I look at top -a, this is what I see:

    last pid:  3330;  load averages:  1.06,  1.08,  1.07                                                up 0+12:01:35  13:57:43
    41 processes:  2 running, 39 sleeping
    CPU: 25.0% user,  0.0% nice,  0.0% system,  0.0% interrupt, 75.0% idle
    Mem: 296M Active, 100M Inact, 188M Wired, 91M Buf, 2629M Free
    Swap: 8192M Total, 8192M Free
    
      PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
    34697 root        1 103    0   302M   292M CPU2    2   1:20  99.85% /sbin/pfctl -o basic -f /tmp/rules.debug
    35164 root        1  52   20 10592K  2648K wait    0   0:18   0.00% /bin/sh /var/db/rrd/updaterrd.sh
    27763 root        1  20    0 17180K 17212K select  0   0:07   0.00% /usr/local/sbin/ntpd -g -c /var/etc/ntpd.conf -p /var/r
    10423 root        1  52    0 35540K 22412K piperd  3   0:04   0.00% php-fpm: pool lighty (php-fpm)
    50267 dhcpd       1  20    0 28636K 22968K select  1   0:04   0.00% [dhcpd]
    12083 root        1  52    0 35540K 22412K lockf   1   0:02   0.00% php-fpm: pool lighty (php-fpm)
    97675 root        1  20    0 10132K  1788K select  0   0:02   0.00% /usr/local/sbin/apinger -c /var/etc/apinger.conf
    

    So the next step was to try to manually execute the '/sbin/pfctl -o basic -f /tmp/rules.debug' command adding -vvv for verbosity. This is what happens:

    ...
    227(0) scrub on em1_vlan227 all fragment reassemble
    @228(0) scrub on em1_vlan228 all fragment reassemble
    @229(0) scrub on em1_vlan229 all fragment reassemble
    @230(0) scrub on em1_vlan230 all fragment reassemble
    @231(0) scrub on em1_vlan231 all fragment reassemble
    @232(0) scrub on em1_vlan232 all fragment reassemble
    @233(0) scrub on em1_vlan233 all fragment reassemble
    @234(0) scrub on em1_vlan234 all fragment reassemble
    @235(0) scrub on em1_vlan235 all fragment reassemble
    @236(0) scrub on em1_vlan236 all fragment reassemble
    @237(0) scrub on em1_vlan237 all fragment reassemble
    @238(0) scrub on em1_vlan238 all fragment reassemble
    @239(0) scrub on em1_vlan239 all fragment reassemble
    @240(0) scrub on em1_vlan240 all fragment reassemble
    @241(0) scrub on em1_vlan241 all fragment reassemble
    @242(0) scrub on em1_vlan242 all fragment reassemble
    @243(0) scrub on em1_vlan243 all fragment reassemble
    @244(0) scrub on em1_vlan244 all fragment reassemble
    @245(0) scrub on em1_vlan245 all fragment reassemble
    @246(0) scrub on em1_vlan246 all fragment reassemble
    @247(0) scrub on em1_vlan247 all fragment reassemble
    @248(0) scrub on em1_vlan248 all fragment reassemble
    @249(0) scrub on em1_vlan249 all fragment reassemble
    @250(0) scrub on em1_vlan250 all fragment reassemble
    @251(0) scrub on em1_vlan251 all fragment reassemble
    @252(0) scrub on em1_vlan252 all fragment reassemble
    @253(0) scrub on em1_vlan253 all fragment reassemble
    @254(0) scrub on em1_vlan254 all fragment reassemble
    @255(0) scrub on em1_vlan255 all fragment reassemble
    @256(0) no nat proto carp all
    @258(0) nat-anchor "/*" all
    @259(0) nat-anchor "/*" all
    @260(0) nat on em0 inet from <tonatsubnets:0> to any port = isakmp -> 98.109.201.85 static-port
    @261(0) nat on em0 inet from <tonatsubnets:0> to any -> 98.109.201.85 port 1024:65535
    @257(0) no rdr proto carp all
    @262(0) rdr-anchor "/*" all
    @263(0) rdr-anchor "/*" all
    @264(0) rdr-anchor "miniupnpd" all</tonatsubnets:0></tonatsubnets:0>
    

    For some reason, it's getting stuck, it stays there and never completes. It seems like this can be a ruleset optimization issue, so I change '-o basic' to '-o none'.

    That prevents it from hanging up and consuming 100% CPU. However, since optimizations have been turned off, it takes 20 or so minutes to load in the rules, far too long. So my two options right now are:

    1. Leave optimization to basic, and it crashes (gets stuck for hours)
    2. Turn optimization off but it takes 20 minutes to load in the ruleset everything a reload takes place.

    Does anybody have ideas of how I can debug this? Has anyone experienced this before? I would upload the rules.debug file, but its huge.

    Let me know if theres anything else I can provide to help debug.

    Thanks!


  • Netgate Administrator

    What pfSense version? What hardware are you running? CPU/RAM/NICs/Drives etc.
    Has this just started happening or was it slow from first installed?

    Steve



  • This is a fresh install of pfsense-2.2. I tried with 2.1.5 with the same results.

    Hardware should be more than adequate for this:

    hardware info:

    [2.2-RC][admin@t31.localdomain]/root: sysctl -a | egrep -i 'hw.machine|hw.model|hw.ncpu'
    hw.machine: i386
    hw.model: Intel(R) Xeon(R) CPU E3-1220 V2 @ 3.10GHz
    hw.ncpu: 4
    hw.machine_arch: i386
    
    [2.2-RC][admin@t31.localdomain]/root: vmstat 
     procs      memory      page                    disks     faults         cpu
     r b w     avm    fre   flt  re  pi  po    fr  sr md0 ad0   in   sy   cs us sy id
     1 0 0    649M  2686M  2889   0   0   7  2829 109   0   0   21 3221 1668 27  0 73
    
    


  • Ok, I figured out why pfctl was hanging up. One of the captive portal rules was too long. I'm working on a patch to break up CP rules into smaller chunks in /etc/inc/filter.inc.

    Just wanted to post this in case someone else runs into this thread with a similar problem.


Log in to reply