High Pings times when Captive Portal is enabled.

michaelcox1

hardware information

CPU Type Intel(R) Xeon(R) CPU E3-1280 v3 @ 3.60GHz
Current: 3600 MHz, Max: 3601 MHz
8 CPUs: 1 package(s) x 4 core(s) x 2 SMT threads

16 Gb Ram

interfaces used - 10Gbit deul port Broadcom Copper NIC

13 Vlans are configured on the captive portal

Version of Pfsense - 2.3.2

First wanted to say thank you to anyone who gives any light into this issue.

The issue i have , I'm running an Intel Xeon 4 physical and 4 virtual SMT cores and when i have the captive portal enabled the GUI becomes unresponsive (501 Error bad gateway), the CPU temps go up almost 20c and the ping times to google go from 9ms to over 200ms

the server is connected to a 1GB fiber circuit and the load with CP on is around 500mbps and with it turned off total traffic load is around 700mbps

when i look at the system activity this is what i see when the portal is turned on.

PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
11 root 155 ki31 0K 128K RUN 4 580:42 99.76% [idle{idle: cpu4}]
11 root 155 ki31 0K 128K CPU5 5 581:40 99.66% [idle{idle: cpu5}]
11 root 155 ki31 0K 128K CPU6 6 581:44 94.97% [idle{idle: cpu6}]
11 root 155 ki31 0K 128K CPU7 7 582:13 92.29% [idle{idle: cpu7}]
12 root -92 - 0K 608K RUN 1 319:47 66.70% [intr{irq266: bxe0:fp0}]
12 root -92 - 0K 608K RUN 0 322:12 61.18% [intr{irq265: bxe0:fp0}]
12 root -92 - 0K 608K RUN 3 323:51 59.47% [intr{irq268: bxe0:fp0}]
12 root -92 - 0K 608K RUN 2 319:26 55.18% [intr{irq267: bxe0:fp0}]
12 root -92 - 0K 608K CPU2 2 213:22 46.58% [intr{irq272: bxe1:fp0}]
12 root -92 - 0K 608K CPU3 3 206:58 39.16% [intr{irq273: bxe1:fp0}]
12 root -92 - 0K 608K CPU1 1 204:19 39.16% [intr{irq271: bxe1:fp0}]
12 root -92 - 0K 608K CPU0 0 213:37 38.48% [intr{irq270: bxe1:fp0}]
65361 root 52 0 280M 54080K piperd 5 0:05 9.77% php-fpm: pool nginx (php-fpm)
34927 root 76 0 12276K 5392K RUN 3 0:00 7.28% /sbin/sysctl -a
35002 root 29 0 18740K 2292K piperd 7 0:00 6.98% grep temperature
62347 root 45 0 280M 52044K piperd 7 0:03 6.79% php-fpm: pool nginx (php-fpm)
34745 root 47 0 17000K 2484K wait 4 0:00 6.79% sh -c /sbin/sysctl -a | grep temperatur
11 root 155 ki31 0K 128K RUN 0 68:44 3.96% [idle{idle: cpu0}]

from this i can see four of the cores are not really doing much so im not sure why it would be slugglish or cause the GUI to crash.

any help would be much appreciated.

Thank you.

michaelcox1

last pid: 37334; load averages: 0.45, 1.85, 4.22 up 1+03:39:06 23:22:33
209 processes: 9 running, 162 sleeping, 38 waiting

Mem: 75M Active, 790M Inact, 639M Wired, 3257M Buf, 14G Free
Swap: 32G Total, 32G Free

PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
11 root 155 ki31 0K 128K CPU7 7 26.6H 100.00% [idle{idle: cpu7}]
11 root 155 ki31 0K 128K CPU6 6 26.5H 100.00% [idle{idle: cpu6}]
11 root 155 ki31 0K 128K CPU5 5 26.5H 100.00% [idle{idle: cpu5}]
11 root 155 ki31 0K 128K RUN 4 26.5H 99.37% [idle{idle: cpu4}]
11 root 155 ki31 0K 128K CPU0 0 647:14 98.29% [idle{idle: cpu0}]
11 root 155 ki31 0K 128K CPU1 1 675:11 97.36% [idle{idle: cpu1}]
11 root 155 ki31 0K 128K CPU3 3 680:18 95.90% [idle{idle: cpu3}]
11 root 155 ki31 0K 128K CPU2 2 661:39 95.75% [idle{idle: cpu2}]
30617 root 22 0 25132K 12060K select 7 37:34 4.88% /usr/local/sbin/miniupnpd -f /var/etc/m
12 root -92 - 0K 608K WAIT 1 599:09 2.49% [intr{irq266: bxe0:fp0}]
12 root -92 - 0K 608K WAIT 0 598:25 2.20% [intr{irq265: bxe0:fp0}]
12 root -92 - 0K 608K WAIT 2 603:58 1.95% [intr{irq267: bxe0:fp0}]
12 root -92 - 0K 608K WAIT 3 591:06 1.56% [intr{irq268: bxe0:fp0}]
12 root -92 - 0K 608K WAIT 0 391:23 1.27% [intr{irq270: bxe1:fp0}]
12 root -92 - 0K 608K WAIT 2 386:54 1.17% [intr{irq272: bxe1:fp0}]
12 root -92 - 0K 608K WAIT 3 381:04 1.17% [intr{irq273: bxe1:fp0}]
12 root -92 - 0K 608K WAIT 1 379:19 1.07% [intr{irq271: bxe1:fp0}]
35499 root 21 0 280M 48420K piperd 4 0:00 0.49% php-fpm: pool nginx (php-fpm)

This is with the portal off.

michaelcox1

After doing some messing around with the Portal settings , it seems to be doing much better when individual bandwidth per user is turned off , anyone else run into this ?

michaelcox1

nevermind , same issue with portal enabled.

now i have changed the Maximum Concurrent connection to 3 and checked the disable Concurrent user logins

will update if anything changes.

heper

looks like an interrupt storm to me, not always easy to fix

try nic tuning wiki: https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards

michaelcox1

Thank you very much for replying,

I am running Broadcom 10Gb NICS so i did implement this fix :

Broadcom bce(4) Cards
Several users have noted issues with certain Broadcom network cards, especially those built into Dell hardware. If the bce cards in the firewall are behaving erratically, dropping packets, or causing system crashes, then the following tweaks may help, especially on amd64.

In /boot/loader.conf.local - Add the following (or create the file if it does not exist):

kern.ipc.nmbclusters="131072"
hw.bce.tso_enable=0
hw.pci.enable_msix=0

i did edit the BCE for BXE which is what my cards are labeled as.

I will monitor it and see if we get any improvements , Thank you again !

michaelcox1

Monitored over night with very little change , the only issue was i never rebooted the servers after making the change so i have that done now and again will monitor.

michaelcox1

After rebooting i got the 502 error had to use the option 16 to reboot PHP

michaelcox1

So last night went very well after the reboot , i was getting a constant ping to google between 9-10 ms
CPU utilization was between 15-20%
temperatures on the cores were between 69-71

so far what i have done for this issue -

I Applied the interface file fix that Heper had suggested
in the portal i have mac concurrent connections set to 3
and also i had checked - Disable Concurrent user logins
If enabled only the most recent login per username will be active. Subsequent logins will cause machines previously logged in with the same username to be disconnected.

but i think this last one forced my users to keep re-registering so i turned that one back off.

I will be making the changes to another server that had the same issue and again i will report back and if it solves it i will mark the post SOLVED !

michaelcox1

still having issues , seems to be anything over 400mb and its struggling , im going to install a second server to drop the vlans down to 6 per server

i keep getting these errors too nginx: [alert] 75850#100188: send() failed (40: Message too long)

Harvy66

I guess my question would be, what is the Captive portal changing that interrupts go crazy? Do you have it set to do any special things per user or something?

michaelcox1

Users Authenticate to a Radius server then in the portal i have bandwidth per user configured other than that nothing special

Harvy66

Try disabling bandwidth shaping just to see if it makes a difference.

michaelcox1

I did that , just disabled the bandwidth per user option and it was still the same.

so far i have split the site into two - one of the servers Temps are running at 60c the second server is now at 80c so im really wondering if i have a device in one of the vlans that just hammering the portal

here the CPU activity on both servers and by the way both servers are identical only difference is the Vlan numbers hardware ect is the exact same , config was a clone

this server is running good and temps are at 60

PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
11 root 155 ki31 0K 128K RUN 2 75.6H 100.00% [idle{idle: cpu2}]
11 root 155 ki31 0K 128K CPU3 3 75.6H 100.00% [idle{idle: cpu3}]
11 root 155 ki31 0K 128K CPU5 5 75.5H 100.00% [idle{idle: cpu5}]
11 root 155 ki31 0K 128K CPU1 1 81.2H 99.56% [idle{idle: cpu1}]
11 root 155 ki31 0K 128K CPU4 4 75.5H 98.78% [idle{idle: cpu4}]
11 root 155 ki31 0K 128K CPU7 7 76.7H 97.27% [idle{idle: cpu7}]
11 root 155 ki31 0K 128K CPU6 6 75.5H 97.27% [idle{idle: cpu6}]
11 root 155 ki31 0K 128K CPU0 0 41.0H 53.27% [idle{idle: cpu0}]
12 root -92 - 0K 480K WAIT 0 36.9H 35.25% [intr{irq264: bxe0}]
12 root -92 - 0K 480K WAIT 0 24.5H 22.75% [intr{irq265: bxe1}]
77899 root 28 0 276M 48848K piperd 2 0:00 1.07% php-fpm: pool nginx (php-fpm)
35775 root 20 0 29228K 15472K select 6 82:21 0.39% /usr/local/sbin/miniupnpd -f /var/etc/m
0 root -92 - 0K 272K - 5 148:13 0.00% [kernel{dummynet}]
12 root -88 - 0K 480K WAIT 7 8:38 0.00% [intr{irq267: ahci0}]
6783 root 20 0 14508K 2316K select 4 3:19 0.00% /usr/sbin/syslogd -s -c -c -l /var/dhcp
12 root -60 - 0K 480K WAIT 3 2:28 0.00% [intr{swi4: clock}]
70215 unbound 20 0 239M 152M kqread 7 2:01 0.00% /usr/local/sbin/unbound -c /var/unbound
70215 unbound 20 0 239M 152M kqread 5 2:01 0.00% /usr/local/sbin/unbound -c /var/unbound

This server is running at 80c

PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
11 root 155 ki31 0K 128K CPU7 7 31.8H 100.00% [idle{idle: cpu7}]
11 root 155 ki31 0K 128K CPU6 6 31.7H 100.00% [idle{idle: cpu6}]
11 root 155 ki31 0K 128K CPU5 5 31.6H 100.00% [idle{idle: cpu5}]
11 root 155 ki31 0K 128K RUN 4 31.6H 100.00% [idle{idle: cpu4}]
11 root 155 ki31 0K 128K CPU1 1 24.1H 71.29% [idle{idle: cpu1}]
11 root 155 ki31 0K 128K CPU0 0 23.8H 70.26% [idle{idle: cpu0}]
11 root 155 ki31 0K 128K CPU2 2 23.6H 70.07% [idle{idle: cpu2}]
11 root 155 ki31 0K 128K CPU3 3 23.1H 65.87% [idle{idle: cpu3}]
12 root -92 - 0K 608K WAIT 0 297:07 23.88% [intr{irq265: bxe0:fp0}]
12 root -92 - 0K 608K WAIT 3 361:53 20.65% [intr{irq268: bxe0:fp0}]
12 root -92 - 0K 608K WAIT 1 318:48 17.58% [intr{irq266: bxe0:fp0}]
12 root -92 - 0K 608K WAIT 2 334:15 16.36% [intr{irq267: bxe0:fp0}]
12 root -92 - 0K 608K WAIT 2 195:49 15.58% [intr{irq272: bxe1:fp0}]
12 root -92 - 0K 608K WAIT 3 197:09 15.09% [intr{irq273: bxe1:fp0}]
12 root -92 - 0K 608K WAIT 1 182:28 11.67% [intr{irq271: bxe1:fp0}]
12 root -92 - 0K 608K WAIT 0 197:09 7.86% [intr{irq270: bxe1:fp0}]
80217 root 52 0 276M 47392K piperd 5 0:01 3.66% php-fpm: pool nginx (php-fpm)
0 root -92 - 0K 368K - 0 42:01 0.00% [kernel{dummynet}]

not sure what the two unbound are the bottom of the first results were.

so far pings times are way better so splitting the vlans up onto two servers helped for sure just curious as too why one is at 60c and the other is at 80c

also the server now at 60c was the original server that i posted about so thats the first set of results above.

Harvy66

"Unbound" is a play on "Bind", another DNS server.

I guess I'm with you wondering if something is hammering the server when the portal is enabled. Try a packet dump.