Dell R430 bge (Built-in Port Hangs ) & Captive Portal Stop working
-
I have purchased brand New Dell R430 Server with built in Broadcom Extreme Quad-Port network card. bge0 and bge1 has WAN1 & WAN2 600Mbps each with load balance. bge2 is Captive portal interface. There is only one service, captive portal and up to 2000 Users connected. ( Voucher System )
I got problem more often the portal stop working, restarting services never worked and we have to restart server then it works. but problem 2000 users have to re enter vouchers to get connected. I have notice today the interface itself is hanging . Just disable interface of portal and enable it again works.
can someone guide what is issue casing port hang. Server Specs given belowIntel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
20 CPUs: 1 package(s) x 10 core(s) x 2 hardware threads
AES-NI CPU Crypto: Yes (inactive)RAM : 16GB
2.4.4-RELEASE-p3 (amd64)
built on Wed May 15 18:53:44 EDT 2019
FreeBSD 11.2-RELEASE-p10 -
what do you mean by "the interface itself is hanging" ?
When you will encounter this issue again, could you login to pfsense, run the command
ipfw table all list
(using Diagnostics->Command prompt) and take a screenshot from the Status->Captive portal page, then post here anonymized results? -
@free4 said in Dell R430 bge (Built-in Port Hangs ) & Captive Portal Stop working:
ipfw table all list
I have figured out the issue, it was Looping in the network done by one cable technician while installing new Switch in building that was causing issue in network and giving so many ERRORS IN on Captive portal interface. Looping is removed and everything went to normal.
as you requested i have given command and few output as sample given below, they are more than 2000 entries.--- table(cp_ifaces), set(0) ---
bge2 2100 762351592 613953125392 1563031392
--- table(campco_auth_up), set(0) ---
10.10.0.14/32 34:8b:75:46:de:75 0 146835 29536066 1563031391
10.10.0.16/32 a0:10:81:f7:53:13 0 472796 39421378 1563031392
10.10.0.18/32 c4:0b:cb:46:1f:71 0 23711 4117029 1563031390
10.10.0.19/32 58:c9:35:18:d1:fc 0 46230 8248466 1563031392
10.10.0.25/32 34:12:f9:64:37:c8 0 303492 23989321 1563031322
10.10.0.28/32 14:d1:69:39:20:26 0 230396 19411495 1563031383
10.10.0.29/32 a8:34:6a:3d:d9:9a 0 229645 18612800 1563031382
10.10.0.30/32 ac:cf:85:28:82:fd 0 282582 22431072 1563031392
10.10.0.206/32 84:55:a5:d3:13:f8 0 86282 7594169 1563030789
10.10.1.141/32 48:13:7e:a4:c2:56 0 125043 19849771 1563031282
10.10.1.228/32 28:27:bf:02:f9:61 0 36518 4524318 1563031301
10.10.2.29/32 80:4e:70:50:4d:26 0 438739 32961951 1563031390
10.10.2.43/32 a8:81:95:08:84:2c 0 935 173015 1563031392
10.10.2.59/32 88:70:8c:dd:f8:13 0 77127 9614889 1563031349
10.10.2.112/32 18:3a:2d:29:0f:eb 0 155636 16597812 1563029749
10.10.2.174/32 60:f1:89:4f:ad:e0 0 509779 42625581 1563031384
10.10.2.247/32 7c:f9:0e:fb:a8:b1 0 249171 30071088 1563031358
10.10.3.44/32 1c:23:2c:12:bb:31 0 4109 762733 1563029435
10.10.3.74/32 90:97:f3:03:8b:74 0 286687 28529815 1563031392
10.10.3.158/32 60:a4:d0:93:c7:76 0 202737 16625515 1563023094
10.10.3.163/32 80:ce:b9:7f:94:09 0 23629 3237202 1563026480
10.10.3.166/32 60:8f:5c:d6:c6:fa 0 5455 870722 1563031392
10.10.3.173/32 a8:81:95:8f:ee:32 0 150434 15159826 1563031328
10.10.3.236/32 00:12:36:55:37:41 0 70651 12760679 1563028289
10.10.4.0/32 b4:74:43:c6:37:af 0 19153 5920430 1563025061
10.10.4.22/32 bc:a5:8b:e0:f9:9a 0 208878 20037657 1563031392
10.10.4.79/32 60:af:6d:86:e7:11 0 183319 16732340 1563031351
10.10.4.144/32 44:78:3e:58:a9:ea 0 45302 4834766 1563031389
10.10.4.146/32 dc:74:a8:d4:fd:9b 0 499269 39491085 1563031389 -
@free4 said in Dell R430 bge (Built-in Port Hangs ) & Captive Portal Stop working:
ipfw table all list
Once again issue came after 2 days even loop was disabled. now again network port getting hanged and only way to make it work is disable and enable again from interface and it work. when network get down i keep getting error in. once interface down i get error from WAN interface connecting remotely to WEB GUI saying 502 BAD GATEWAY. I got output from command while it was down "ipfw table all list"
ipfw table all list
--- table(cp_ifaces), set(0) ---
bge2 2100 6824547305 5704924933364 1563210003
--- table(campco_auth_up), set(0) ---
10.10.25.80/32 34:8b:75:d3:8a:5e 0 12580 3288176 1563210003
10.10.25.157/32 a0:cb:fd:0e:04:b8 0 22395 1907512 1563210003
--- table(campco_host_ips), set(0) ---
--- table(campco_pipe_mac), set(0) ---
--- table(campco_auth_down), set(0) ---
10.10.25.80/32 0 13409 3655823 1563210003
10.10.25.157/32 0 35937 45843082 1563210003
--- table(campco_allowed_up), set(0) ---
--- table(campco_allowed_down), set(0) ---I have attached interface screenshot you can see error coming on Portal interface.
-
"3547" => time to get a close look at the NIC - or cable - or the hookup up switch and/or AP because something is faulting.
-
@Gertjan Cable already changed with 3M branded Patch Cord. but still issue. there is one Fiber connectivity from core Zyxel switch to another Zyxel Switch if we plug that then mostly error starts increasing and network get down in few hours. we have removed the cable up to 12 hours its fine but sometimes 2 or 3 error increased. still sorting out what is issue.
-
@Gertjan Right now network port hanged if i try to ping any host from portal interface it gives me this error "ping: sendto: Permission denied" if i disable and enable interface it will work again.
-
Enter console mode, option 8.
The dmesg command (log) tells you something ? -
@Gertjan have a look on dmesg longs
sonewconn: pcb 0xfffff80130a2f0f0: Listen queue overflow: 193 already in queue awaiting acceptance (232 occurrences)
sonewconn: pcb 0xfffff80130a2f0f0: Listen queue overflow: 193 already in queue awaiting acceptance (426 occurrences)
arp: 10.10.12.254 moved from 1c:c3:eb:91:d3:f7 to 40:40:a7:5c:55:53 on bge2
sonewconn: pcb 0xfffff80130a2f0f0: Listen queue overflow: 193 already in queue awaiting acceptance (857 occurrences)
sonewconn: pcb 0xfffff80130a2f0f0: Listen queue overflow: 193 already in queue awaiting acceptance (945 occurrences)
sonewconn: pcb 0xfffff8035758b1d0: Listen queue overflow: 193 already in queue awaiting acceptance (1086 occurrences)
sonewconn: pcb 0xfffff80130a2f0f0: Listen queue overflow: 193 already in queue awaiting acceptance (274 occurrences)
sonewconn: pcb 0xfffff80130a2f0f0: Listen queue overflow: 193 already in queue awaiting acceptance (312 occurrences)
sonewconn: pcb 0xfffff80130a2f0f0: Listen queue overflow: 193 already in queue awaiting acceptance (649 occurrences)
sonewconn: pcb 0xfffff80130a2f0f0: Listen queue overflow: 193 already in queue awaiting acceptance (907 occurrences)
sonewconn: pcb 0xfffff80130a2f0f0: Listen queue overflow: 193 already in queue awaiting acceptance (107 occurrences) -
Queuing happens if the receiving application cannot process the number of connections fast enough. So you either get too many connections or your application is too slow to handle them
to see what it isnetstat -Aan | grep fffff80130a2f0f0
sockstat -l | grep socketname
netstat -Lan | grep 193
probably will be the captive portal
basically that error means that the NIC can no longer keep up
there are some tunables that you can check to solve the problem
there is a section for bge/bce
https://docs.netgate.com/pfsense/en/latest/hardware/tuning-and-troubleshooting-network-cards.html
some suggest to increse kern.ipc.soacceptqueue that was previusly named kern.ipc.somaxconn, increase it slowly (default is 128) until problem disappear -
@kiokoman said in Dell R430 bge (Built-in Port Hangs ) & Captive Portal Stop working:
netstat -Lan | grep 193
I have followed as said and got only output from first command
netstat -Aan | grep fffff80130a2f0f0
fffff80130a2f0f0 stream 0 0 fffff80130715ce8 0 0 0 /var/run/php-fpm.socketothers both command has no output.
secondly i have increased kern.ipc.soacceptqueue to 1024 value and also did tunning for bge NIC as given in docs creating /boot/loader.conf.localkern.ipc.nmbclusters="131072"
hw.bge.tso_enable=0
hw.pci.enable_msix=0i will keep an eye on system if that make any difference.
-
the second command in this case is
sockstat -l | grep /var/run/php-fpm.socket
as an example for me the result is
root pfctl 51866 12 stream /var/run/php-fpm.socket root pfctl 51866 13 stream /var/run/php-fpm.socket root sleep 37398 12 stream /var/run/php-fpm.socket root sleep 37398 13 stream /var/run/php-fpm.socket dhcpd dhcpd 50501 12 stream /var/run/php-fpm.socket dhcpd dhcpd 50501 13 stream /var/run/php-fpm.socket root php-fpm 29258 13 stream /var/run/php-fpm.socket root lighttpd_l 64766 12 stream /var/run/php-fpm.socket root lighttpd_l 64766 13 stream /var/run/php-fpm.socket dhcpd dhcpd 10592 12 stream /var/run/php-fpm.socket dhcpd dhcpd 10592 13 stream /var/run/php-fpm.socket root dpinger 74966 12 stream /var/run/php-fpm.socket root dpinger 74966 13 stream /var/run/php-fpm.socket root dpinger 74410 12 stream /var/run/php-fpm.socket root dpinger 74410 13 stream /var/run/php-fpm.socket root dpinger 74320 12 stream /var/run/php-fpm.socket root dpinger 74320 13 stream /var/run/php-fpm.socket root php-fpm 85281 13 stream /var/run/php-fpm.socket squid squid 37275 12 stream /var/run/php-fpm.socket squid squid 37275 13 stream /var/run/php-fpm.socket root squid 36287 12 stream /var/run/php-fpm.socket root squid 36287 13 stream /var/run/php-fpm.socket root php-fpm 24514 13 stream /var/run/php-fpm.socket root php-fpm 340 13 stream /var/run/php-fpm.socket root php-fpm 339 13 stream /var/run/php-fpm.socket root php-fpm 338 15 stream /var/run/php-fpm.socket
so we can confirm that the socket is used by squid dpinger dhcpd etc etc
well keep us updated
-
@kiokoman said in Dell R430 bge (Built-in Port Hangs ) & Captive Portal Stop working:
so we can confirm that the socket is used by squid dpinger dhcpd etc etc
Correct.
All these programs use or call scripts that are PHP based. -
@kiokoman said in Dell R430 bge (Built-in Port Hangs ) & Captive Portal Stop working:
sockstat -l | grep /var/run/php-fpm.socket
got this output
root sleep 99245 12 stream /var/run/php-fpm.socket root sleep 99245 13 stream /var/run/php-fpm.socket root sleep 97728 12 stream /var/run/php-fpm.socket root sleep 97728 13 stream /var/run/php-fpm.socket unbound unbound 20075 12 stream /var/run/php-fpm.socket unbound unbound 20075 13 stream /var/run/php-fpm.socket root php-fpm 83461 13 stream /var/run/php-fpm.socket root php-fpm 83358 13 stream /var/run/php-fpm.socket root php-fpm 83301 13 stream /var/run/php-fpm.socket root php-fpm 83152 15 stream /var/run/php-fpm.socket root nginx 55388 12 stream /var/run/php-fpm.socket root nginx 55388 13 stream /var/run/php-fpm.socket root nginx 55234 12 stream /var/run/php-fpm.socket root nginx 55234 13 stream /var/run/php-fpm.socket root nginx 54950 12 stream /var/run/php-fpm.socket root nginx 54950 13 stream /var/run/php-fpm.socket root nginx 54735 12 stream /var/run/php-fpm.socket root nginx 54735 13 stream /var/run/php-fpm.socket root nginx 54498 12 stream /var/run/php-fpm.socket root nginx 54498 13 stream /var/run/php-fpm.socket root nginx 54344 12 stream /var/run/php-fpm.socket root nginx 54344 13 stream /var/run/php-fpm.socket root nginx 54164 12 stream /var/run/php-fpm.socket root nginx 54164 13 stream /var/run/php-fpm.socket root nginx 53913 12 stream /var/run/php-fpm.socket root nginx 53913 13 stream /var/run/php-fpm.socket root nginx 53902 12 stream /var/run/php-fpm.socket root nginx 53902 13 stream /var/run/php-fpm.socket root nginx 53685 12 stream /var/run/php-fpm.socket root nginx 53685 13 stream /var/run/php-fpm.socket root nginx 53620 12 stream /var/run/php-fpm.socket root nginx 53620 13 stream /var/run/php-fpm.socket root sh 94097 12 stream /var/run/php-fpm.socket root sh 94097 13 stream /var/run/php-fpm.socket root sh 93897 12 stream /var/run/php-fpm.socket root sh 93897 13 stream /var/run/php-fpm.socket root dpinger 83648 12 stream /var/run/php-fpm.socket root dpinger 83648 13 stream /var/run/php-fpm.socket root dpinger 83295 12 stream /var/run/php-fpm.socket root dpinger 83295 13 stream /var/run/php-fpm.socket
-
ok nginx is the webserver for captive portal, check that if the problem still occurs
-
@kiokoman System was working without any problem for 28 days in heavy load 2000 plus users. suddenly started getting errors in ( portal interface ) and three times system goes down " just portal interface stop working for in/out traffic. disable it and enable it start working for sometimes one day or sometimes two days then again it happens.
i am not changing any setting and if one service is working other goes down. now captive portal is stuck on boot screen and not let system to load menu startup. I haven't change any setting in Captive portal.
-
so now we have random services that go down and captive portal stuck? I will Try fsck on the filesystem to see if it helps and a memtest
-
@Gertjan is there a way to reinstall captive portal package only. captive portal serivce show running when started from Dashboard. but if i try to save configuration of captive portal it keep waiting and finally gives error 504 Gateway time out. users are connected directly like on LAN and portal seems to be not working anymore.
-
The captive portal isn't a package.
It's using :
Another instance of nginx running on 127.0.0.1 port 80 (and 443 if uou use https).
This nginx instance uses a landing (php) page.
A helper script /etc/inc/captiveportal.incActivating the portal also activates the firewall program calle "ipfw" that can handle MAC addresses on the captive portal interface..
Actually, a captive portal as implemented by pfSense is pretty simple. No special processes.
You run the captive portal on a dedicated interface called OPTx, right ?
Swap LAN and this OPTx interface and see if the issue persists. -
you can also check Status / System Logs / Captive Portal Auth to see for any errors