XG-7100 goes unresponsive & core dump



  • Hello All,

    school environment - Netgate XG-7100 x 2.- 2 buildings

    The Netgate XG-7100 after putting in place 3 months ago will randomly go 'unresposive' at almost the same time some days. It may go 7 days with no problems and may go two days, but always happen when people are arriving at school.
    Also, this was run with captive portal disabled until the school year actually started with no problems. Captive portal enabled before first school day. When I say responsive no sshing, no web UI, no pinging etc. Only option is power cycle. It is located in such a way very difficult to get serial cable / laptop hooked up to it when it has hit this state to see if anything can be accessed via serial cable.
    There are about 800 users on this during the day. It always happens about 8am-8:30 am. Has never stopped after this time.
    Put the Xg-7100 on a different battery backup just to eliminate this possibility with still some unresponsive' days.

    Two days ago the Netgate finally done a core dump and rebooted on it's own. In the 'info' saved file there is a mention of 'Panic String: spin lock held too long'. I can not untar the actual text.dump.tar.0 for some reason , unsupported format with file roller. in order to post here for examination.
    An unscientific guess is going to disable the captive portal and run it for a week this way.

    We have a second Netgate XG-7100 at another school building using Captive Portal with about 600 users and this machine has been rock solid.

    Can anyone clue me on how to untar the textdump.tar,0 file for posting here. I tried renaming to textdump.tar.gz but says unsupoported format

    Thank You


  • Netgate Administrator

    You should just be able to remove the .0 extension and the result is a tar ball. PM it to me if you don't want it public.

    Steve



  • Hi Steve,

    Thanks for offering to take a look at core dump files. I am showing my stupidity. I don't see any place on a profile to PM to attach core dump file to. I only see "Start a chat"?

    Thanks



  • @brcisna said in XG-7100 goes unresponsive & core dump:

    I don't see any place on a profile to PM

    Do not pas the dump into the chat box.
    For example, use pastebin.org to upload the file, and PM him the link, something like https://pastebin.com/qxdg9QKX

    Btw : click on his name or avatar, and then :

    0946197c-8555-4e02-91ca-1002f4ee1d52-image.png



  • Hi Gertjan,

    Thanks for the response , info. I thought maybe i was missing something in the users profiles? I'm used to the old school,"pm this person". in the user profile details. I'll do exactly as you suggested here.


  • Netgate Administrator

    Yes, sorry. I'm old school too, still using 'PM'. NodeBB only has chat there.

    It would probably actually be better to open a ticket with us in support for this and attach it there: https://go.netgate.com

    Much easier to point our developers at it once it's in the ticket system.

    Steve



  • Steve,

    Ok, Thanks I will post the core dump files at the URL you have posted,
    This is one of those deals is hard to pinpoint,at least for me anyway. The XG-7100 'just stopped' and at that point it is only a re-power to revive it. If it weren't situated in such a rats nest , confined area in server room i would have liked to setup serial cable/laptop onto it before rebooting it. With the actual core dump someone there may see something, hopefully. We actually have an extra new XG-7100 to put into place for this very purpose but want to see what the root problem is with this XG-7100

    Thanks again



  • Update:

    Let the Netgate XG-7100 run for two weeks with captive portal disabled. Run like a champ.
    This week re-enabled Captive Portal ,ran fine first day,
    Second day at 8:15 am no internet and unable to ssh it. Had to re-power unit and internet back up,,no signs of any hints in logs of what might have happened.
    What is so odd every time this has happened it is at about 8:15 am,,,but did live the first day. This router is on a new heavy UPS backup dedicated to router ONLY.

    Thank You


  • LAYER 8 Netgate

    I would look at any out-of-band options for access to the node. Like the console.

    You can take a status output from the command line if you can get access to the console.

    php /usr/local/www/status.php && cp /tmp/status_output.tgz /root

    Even if console access is difficult, it will probably be required.



  • Hi Derelict,

    Thank you for the response. I am showing my stupidity. What do you mean by 'out of band options'?
    Once I do the php command at console what would i be looking for?
    I will certainly give this a try,

    Thanks again


  • LAYER 8 Netgate

    Meaning like the serial console or possibly an interface without the captive portal on it. It is unknown whether your issue has anything to do with the captive portal at this point.


Log in to reply