Dell R430 bge (Built-in Port Hangs ) & Captive Portal Stop working



  • @free4 said in Dell R430 bge (Built-in Port Hangs ) & Captive Portal Stop working:

    ipfw table all list

    I have figured out the issue, it was Looping in the network done by one cable technician while installing new Switch in building that was causing issue in network and giving so many ERRORS IN on Captive portal interface. Looping is removed and everything went to normal.
    as you requested i have given command and few output as sample given below, they are more than 2000 entries.

    --- table(cp_ifaces), set(0) ---
    bge2 2100 762351592 613953125392 1563031392
    --- table(campco_auth_up), set(0) ---
    10.10.0.14/32 34:8b:75:46:de:75 0 146835 29536066 1563031391
    10.10.0.16/32 a0:10:81:f7:53:13 0 472796 39421378 1563031392
    10.10.0.18/32 c4:0b:cb:46:1f:71 0 23711 4117029 1563031390
    10.10.0.19/32 58:c9:35:18:d1:fc 0 46230 8248466 1563031392
    10.10.0.25/32 34:12:f9:64:37:c8 0 303492 23989321 1563031322
    10.10.0.28/32 14:d1:69:39:20:26 0 230396 19411495 1563031383
    10.10.0.29/32 a8:34:6a:3d:d9:9a 0 229645 18612800 1563031382
    10.10.0.30/32 ac:cf:85:28:82:fd 0 282582 22431072 1563031392
    10.10.0.206/32 84:55:a5:d3:13:f8 0 86282 7594169 1563030789
    10.10.1.141/32 48:13:7e:a4:c2:56 0 125043 19849771 1563031282
    10.10.1.228/32 28:27:bf:02:f9:61 0 36518 4524318 1563031301
    10.10.2.29/32 80:4e:70:50:4d:26 0 438739 32961951 1563031390
    10.10.2.43/32 a8:81:95:08:84:2c 0 935 173015 1563031392
    10.10.2.59/32 88:70:8c:dd:f8:13 0 77127 9614889 1563031349
    10.10.2.112/32 18:3a:2d:29:0f:eb 0 155636 16597812 1563029749
    10.10.2.174/32 60:f1:89:4f:ad:e0 0 509779 42625581 1563031384
    10.10.2.247/32 7c:f9:0e:fb:a8:b1 0 249171 30071088 1563031358
    10.10.3.44/32 1c:23:2c:12:bb:31 0 4109 762733 1563029435
    10.10.3.74/32 90:97:f3:03:8b:74 0 286687 28529815 1563031392
    10.10.3.158/32 60:a4:d0:93:c7:76 0 202737 16625515 1563023094
    10.10.3.163/32 80:ce:b9:7f:94:09 0 23629 3237202 1563026480
    10.10.3.166/32 60:8f:5c:d6:c6:fa 0 5455 870722 1563031392
    10.10.3.173/32 a8:81:95:8f:ee:32 0 150434 15159826 1563031328
    10.10.3.236/32 00:12:36:55:37:41 0 70651 12760679 1563028289
    10.10.4.0/32 b4:74:43:c6:37:af 0 19153 5920430 1563025061
    10.10.4.22/32 bc:a5:8b:e0:f9:9a 0 208878 20037657 1563031392
    10.10.4.79/32 60:af:6d:86:e7:11 0 183319 16732340 1563031351
    10.10.4.144/32 44:78:3e:58:a9:ea 0 45302 4834766 1563031389
    10.10.4.146/32 dc:74:a8:d4:fd:9b 0 499269 39491085 1563031389



  • @free4 said in Dell R430 bge (Built-in Port Hangs ) & Captive Portal Stop working:

    ipfw table all list

    Once again issue came after 2 days even loop was disabled. now again network port getting hanged and only way to make it work is disable and enable again from interface and it work. when network get down i keep getting error in. once interface down i get error from WAN interface connecting remotely to WEB GUI saying 502 BAD GATEWAY. I got output from command while it was down "ipfw table all list"

    ipfw table all list
    --- table(cp_ifaces), set(0) ---
    bge2 2100 6824547305 5704924933364 1563210003
    --- table(campco_auth_up), set(0) ---
    10.10.25.80/32 34:8b:75:d3:8a:5e 0 12580 3288176 1563210003
    10.10.25.157/32 a0:cb:fd:0e:04:b8 0 22395 1907512 1563210003
    --- table(campco_host_ips), set(0) ---
    --- table(campco_pipe_mac), set(0) ---
    --- table(campco_auth_down), set(0) ---
    10.10.25.80/32 0 13409 3655823 1563210003
    10.10.25.157/32 0 35937 45843082 1563210003
    --- table(campco_allowed_up), set(0) ---
    --- table(campco_allowed_down), set(0) ---

    I have attached interface screenshot you can see error coming on Portal interface.
    Interface.png



  • "3547" => time to get a close look at the NIC - or cable - or the hookup up switch and/or AP because something is faulting.



  • @Gertjan Cable already changed with 3M branded Patch Cord. but still issue. there is one Fiber connectivity from core Zyxel switch to another Zyxel Switch if we plug that then mostly error starts increasing and network get down in few hours. we have removed the cable up to 12 hours its fine but sometimes 2 or 3 error increased. still sorting out what is issue.



  • @Gertjan Right now network port hanged if i try to ping any host from portal interface it gives me this error "ping: sendto: Permission denied" if i disable and enable interface it will work again.



  • Enter console mode, option 8.
    The dmesg command (log) tells you something ?



  • @Gertjan have a look on dmesg longs

    sonewconn: pcb 0xfffff80130a2f0f0: Listen queue overflow: 193 already in queue awaiting acceptance (232 occurrences)
    sonewconn: pcb 0xfffff80130a2f0f0: Listen queue overflow: 193 already in queue awaiting acceptance (426 occurrences)
    arp: 10.10.12.254 moved from 1c:c3:eb:91:d3:f7 to 40:40:a7:5c:55:53 on bge2
    sonewconn: pcb 0xfffff80130a2f0f0: Listen queue overflow: 193 already in queue awaiting acceptance (857 occurrences)
    sonewconn: pcb 0xfffff80130a2f0f0: Listen queue overflow: 193 already in queue awaiting acceptance (945 occurrences)
    sonewconn: pcb 0xfffff8035758b1d0: Listen queue overflow: 193 already in queue awaiting acceptance (1086 occurrences)
    sonewconn: pcb 0xfffff80130a2f0f0: Listen queue overflow: 193 already in queue awaiting acceptance (274 occurrences)
    sonewconn: pcb 0xfffff80130a2f0f0: Listen queue overflow: 193 already in queue awaiting acceptance (312 occurrences)
    sonewconn: pcb 0xfffff80130a2f0f0: Listen queue overflow: 193 already in queue awaiting acceptance (649 occurrences)
    sonewconn: pcb 0xfffff80130a2f0f0: Listen queue overflow: 193 already in queue awaiting acceptance (907 occurrences)
    sonewconn: pcb 0xfffff80130a2f0f0: Listen queue overflow: 193 already in queue awaiting acceptance (107 occurrences)



  • Queuing happens if the receiving application cannot process the number of connections fast enough. So you either get too many connections or your application is too slow to handle them
    to see what it is

    netstat -Aan | grep fffff80130a2f0f0
    
    sockstat -l | grep socketname
    
    netstat -Lan | grep 193
    

    probably will be the captive portal
    basically that error means that the NIC can no longer keep up
    there are some tunables that you can check to solve the problem
    there is a section for bge/bce
    https://docs.netgate.com/pfsense/en/latest/hardware/tuning-and-troubleshooting-network-cards.html
    some suggest to increse kern.ipc.soacceptqueue that was previusly named kern.ipc.somaxconn, increase it slowly (default is 128) until problem disappear



  • @kiokoman said in Dell R430 bge (Built-in Port Hangs ) & Captive Portal Stop working:

    netstat -Lan | grep 193

    I have followed as said and got only output from first command

    netstat -Aan | grep fffff80130a2f0f0
    fffff80130a2f0f0 stream 0 0 fffff80130715ce8 0 0 0 /var/run/php-fpm.socket

    others both command has no output.
    secondly i have increased kern.ipc.soacceptqueue to 1024 value and also did tunning for bge NIC as given in docs creating /boot/loader.conf.local

    kern.ipc.nmbclusters="131072"
    hw.bge.tso_enable=0
    hw.pci.enable_msix=0

    i will keep an eye on system if that make any difference.



  • the second command in this case is

    sockstat -l | grep /var/run/php-fpm.socket
    

    as an example for me the result is

    root     pfctl      51866 12 stream /var/run/php-fpm.socket
    root     pfctl      51866 13 stream /var/run/php-fpm.socket
    root     sleep      37398 12 stream /var/run/php-fpm.socket
    root     sleep      37398 13 stream /var/run/php-fpm.socket
    dhcpd    dhcpd      50501 12 stream /var/run/php-fpm.socket
    dhcpd    dhcpd      50501 13 stream /var/run/php-fpm.socket
    root     php-fpm    29258 13 stream /var/run/php-fpm.socket
    root     lighttpd_l 64766 12 stream /var/run/php-fpm.socket
    root     lighttpd_l 64766 13 stream /var/run/php-fpm.socket
    dhcpd    dhcpd      10592 12 stream /var/run/php-fpm.socket
    dhcpd    dhcpd      10592 13 stream /var/run/php-fpm.socket
    root     dpinger    74966 12 stream /var/run/php-fpm.socket
    root     dpinger    74966 13 stream /var/run/php-fpm.socket
    root     dpinger    74410 12 stream /var/run/php-fpm.socket
    root     dpinger    74410 13 stream /var/run/php-fpm.socket
    root     dpinger    74320 12 stream /var/run/php-fpm.socket
    root     dpinger    74320 13 stream /var/run/php-fpm.socket
    root     php-fpm    85281 13 stream /var/run/php-fpm.socket
    squid    squid      37275 12 stream /var/run/php-fpm.socket
    squid    squid      37275 13 stream /var/run/php-fpm.socket
    root     squid      36287 12 stream /var/run/php-fpm.socket
    root     squid      36287 13 stream /var/run/php-fpm.socket
    root     php-fpm    24514 13 stream /var/run/php-fpm.socket
    root     php-fpm    340   13 stream /var/run/php-fpm.socket
    root     php-fpm    339   13 stream /var/run/php-fpm.socket
    root     php-fpm    338   15 stream /var/run/php-fpm.socket
    

    so we can confirm that the socket is used by squid dpinger dhcpd etc etc

    well keep us updated



  • @kiokoman said in Dell R430 bge (Built-in Port Hangs ) & Captive Portal Stop working:

    so we can confirm that the socket is used by squid dpinger dhcpd etc etc

    Correct.
    All these programs use or call scripts that are PHP based.



  • @kiokoman said in Dell R430 bge (Built-in Port Hangs ) & Captive Portal Stop working:

    sockstat -l | grep /var/run/php-fpm.socket

    got this output

    root     sleep      99245 12 stream /var/run/php-fpm.socket
    root     sleep      99245 13 stream /var/run/php-fpm.socket
    root     sleep      97728 12 stream /var/run/php-fpm.socket
    root     sleep      97728 13 stream /var/run/php-fpm.socket
    unbound  unbound    20075 12 stream /var/run/php-fpm.socket
    unbound  unbound    20075 13 stream /var/run/php-fpm.socket
    root     php-fpm    83461 13 stream /var/run/php-fpm.socket
    root     php-fpm    83358 13 stream /var/run/php-fpm.socket
    root     php-fpm    83301 13 stream /var/run/php-fpm.socket
    root     php-fpm    83152 15 stream /var/run/php-fpm.socket
    root     nginx      55388 12 stream /var/run/php-fpm.socket
    root     nginx      55388 13 stream /var/run/php-fpm.socket
    root     nginx      55234 12 stream /var/run/php-fpm.socket
    root     nginx      55234 13 stream /var/run/php-fpm.socket
    root     nginx      54950 12 stream /var/run/php-fpm.socket
    root     nginx      54950 13 stream /var/run/php-fpm.socket
    root     nginx      54735 12 stream /var/run/php-fpm.socket
    root     nginx      54735 13 stream /var/run/php-fpm.socket
    root     nginx      54498 12 stream /var/run/php-fpm.socket
    root     nginx      54498 13 stream /var/run/php-fpm.socket
    root     nginx      54344 12 stream /var/run/php-fpm.socket
    root     nginx      54344 13 stream /var/run/php-fpm.socket
    root     nginx      54164 12 stream /var/run/php-fpm.socket
    root     nginx      54164 13 stream /var/run/php-fpm.socket
    root     nginx      53913 12 stream /var/run/php-fpm.socket
    root     nginx      53913 13 stream /var/run/php-fpm.socket
    root     nginx      53902 12 stream /var/run/php-fpm.socket
    root     nginx      53902 13 stream /var/run/php-fpm.socket
    root     nginx      53685 12 stream /var/run/php-fpm.socket
    root     nginx      53685 13 stream /var/run/php-fpm.socket
    root     nginx      53620 12 stream /var/run/php-fpm.socket
    root     nginx      53620 13 stream /var/run/php-fpm.socket
    root     sh         94097 12 stream /var/run/php-fpm.socket
    root     sh         94097 13 stream /var/run/php-fpm.socket
    root     sh         93897 12 stream /var/run/php-fpm.socket
    root     sh         93897 13 stream /var/run/php-fpm.socket
    root     dpinger    83648 12 stream /var/run/php-fpm.socket
    root     dpinger    83648 13 stream /var/run/php-fpm.socket
    root     dpinger    83295 12 stream /var/run/php-fpm.socket
    root     dpinger    83295 13 stream /var/run/php-fpm.socket
    


  • ok nginx is the webserver for captive portal, check that if the problem still occurs



  • @kiokoman System was working without any problem for 28 days in heavy load 2000 plus users. suddenly started getting errors in ( portal interface ) and three times system goes down " just portal interface stop working for in/out traffic. disable it and enable it start working for sometimes one day or sometimes two days then again it happens.

    i am not changing any setting and if one service is working other goes down. now captive portal is stuck on boot screen and not let system to load menu startup. I haven't change any setting in Captive portal.



  • so now we have random services that go down and captive portal stuck? I will Try fsck on the filesystem to see if it helps and a memtest



  • @Gertjan is there a way to reinstall captive portal package only. captive portal serivce show running when started from Dashboard. but if i try to save configuration of captive portal it keep waiting and finally gives error 504 Gateway time out. users are connected directly like on LAN and portal seems to be not working anymore.



  • The captive portal isn't a package.
    It's using :
    Another instance of nginx running on 127.0.0.1 port 80 (and 443 if uou use https).
    This nginx instance uses a landing (php) page.
    A helper script /etc/inc/captiveportal.inc

    Activating the portal also activates the firewall program calle "ipfw" that can handle MAC addresses on the captive portal interface..

    Actually, a captive portal as implemented by pfSense is pretty simple. No special processes.

    You run the captive portal on a dedicated interface called OPTx, right ?
    Swap LAN and this OPTx interface and see if the issue persists.



  • you can also check Status / System Logs / Captive Portal Auth to see for any errors



  • @Gertjan yes its running on OPT1 interface. i have restarted system 3 times didn't fix the issue so only one thing i have changed in system tuneable mentioned by @kiokoman kern.ipc.soacceptqueue to 1024 . I deleted it and rebooted system and captive portal is back. Not sure if its related to captive portal.



  • @wazim4u said in Dell R430 bge (Built-in Port Hangs ) & Captive Portal Stop working:

    Listen queue overflow: 193 already in queue awaiting acceptance (232 occurrences)

    Check these.



  • I would like to update changes i have made in case someone else is facing same issue now or in future to fix similar problems. given below is status after 2 Days System Running.

    Captive Portal Status ( 2000+ Users )
    Captive_Portal_Users.jpg

    Network Status ( so far only 1 Error )
    Network_Status.jpg

    Changes I've Made
    1- In /boot/loader.conf.local - Add the following (created new file): for Broadcom NIC

    kern.ipc.nmbclusters="1000000"
    hw.bge.tso_enable=0
    hw.pci.enable_msix=0
    

    2- In Interface/Portal ( Portal Interface bge2 ) Speed & Duplex select

    1000baseT full-duplex
    

    ( Auto setting was giving some errors on Zyxel switch logs )

    3- In /usr/local/etc/php-fpm.d/www.conf

    listen.backlog = 511 ( default )
    

    to

    listen.backlog = -1
    

    php-fpm related errors gone after changing listen.backlog ( didn't get any till now in two days )
    php-fpm error

    kernel: sonewconn: pcb 0xfffff80130a2f0f0: Listen queue overflow: 193 already in queue awaiting acceptance (155 occurrences)
    

    now only one issue i am sorting out related to nginx given below.

    nginx: 2019/07/20 10:25:19 [alert] 95352#100567: send() failed (40: Message too long)
    

Log in to reply