PfSense hung: where do I start to debug?



  • G'day you all fine people lovers of the best firewall in the world  ;D

    I have an angry WIFE. I don't want an angry WIFE. Especially not now as I am flooded with work and don't have time to work on pfSense. But I have to  :-[

    WIFE can get angry for a couple of reasons. Not much. But one of them is when she can't use the internets. Then she threathens to divorce me if I don't fix it right away (no, I am joking, I think my copy of WIFE is one they only used to make like that in the old days. She's been running steadily for 27 years now. In other words: I am very happy with WIFE  ;D).

    Anyway, the problem is: this morning she called me: internet doesn't work, pfSense can't be pinged, I can't get into the GUI, nor can I log in to the console via the keyboard and monitor that are connected to the pfSense machine. The screen appeared frozen.

    I told her to do a hard reset (power off/power on) and it worked again, but of course I need to find out why this happened. The machine had been running for I think two weeks, there were no package updates done recently, so the only thing that updates is Snort VRT during the night.

    I have no clue where I need to start troubleshooting. Would anybody perhaps tell me in which logs to look?

    Thank you in advance very much  :-X

    Bye,



  • Well I would start looking through log files. system log for sure. Anything in /var/crash/?

    It could be hardware/memory issue as well. Running a memtest on the system overnight would be a start possibly.


  • Netgate Administrator

    Was it sufficiently frozen that the capslock and/or numlock keys/leds stopped working?
    Does it run hot?
    It could just be a power glitch. Does it run through a ups or surge suppressor?

    Steve



  • Thank you to the both of you and my sincere apologies for the delay. I am overloaded with work, and since it happened only onces I put it on my 'second most important priority list' and moved on. But this morning it happened again  :-[

    [quote author=bryan.paradis link=topic=73236.msg399730#msg399730 date=1393630824]
    Well I would start looking through log files. system log for sure. Anything in /var/crash/?

    It could be hardware/memory issue as well. Running a memtest on the system overnight would be a start possibly.

    Thank you  :) /var/crash/ is empty. I am trying to figure out how to have memtest86 run, but it appears that is not a package?

    pkg_add -r memtest86 can't find a package, pkg_add -r memtest can, but that appears to be something completely different than memtest85 (?)

    @stephenw10:

    Was it sufficiently frozen that the capslock and/or numlock keys/leds stopped working?
    Does it run hot?
    It could just be a power glitch. Does it run through a ups or surge suppressor?

    Steve

    Yes, Steve, it did nothing. No key board input respons whatsoever. I doubt I runs hot, I have the temperature widget in the dashboard which remains around 38 degrees Celcius. And it is connected to an APC UPS which has 2 hours of backup power in case something happens.

    Are there perhaps other logs I should look into?

    Thank you again to the both of you very much  ;D



  • @Hollander:

    I am trying to figure out how to have memtest86 run, but it appears that is not a package?

    pkg_add -r memtest86 can't find a package, pkg_add -r memtest can, but that appears to be something completely different than memtest85 (?)

    http://www.memtest.org/

    This isn't a package in pfSense.  You will need to download the ISO or USB version and boot from it.  While you are running it your system will not be usable by pfSense.

    It has been my experience that frequent crashes (daily) which are the result of bad memory will usually show up within a few hours of running MemTest86+.  Occasional errors (one or two crashes a month on a machine used 8 hours per day), on the other hand, sometimes take a weekend.  I wouldn't call it conclusively NOT your RAM until you've run for at least 72 hours.


  • Netgate Administrator

    Yes I'd have to agree with that. Not a power glitch and not over heating, probably bad ram at some high address that you don't use too often. Perhaps check the RAM usage RRD graph to see if climbs to some point before it crashes. That would be inconclusive though. Running Memtest for a while is the only way to be sure. That or nuke it from orbit!  ;)

    Steve



  • Thank you to the both of you very much for replying  :-*

    Ok, then I might have a new problem: if I have to boot from USB and test for 72 hours I won't have internet access for 72 hours. As I explained before, in this house there is the very powerful algorithm that governs vital life functions:

    IF
      WIFE no internet
    THEN
      Hollander no food
    ELSE
      Hollander food

    ( ;D)

    No, seriously, both WIFE and me work from home, so I can not have it that there is no internet for such as long time.

    I already have dual WAN to cover a part of that risk, and I knew that I would need to have a second pfSense machine (and a second switch, for that matter), but I postponed it because I am drowning in work. But it appears I need to do that now.

    Which poses two problems for me now:
    1. Is it possible to have one pfSense be the backup/fail over for another pfSense? My guess is 'yes of course that is possible, this is pfSense you know' ( :D). But how is that concept called within pfSense so I know what to look for to study?
    2. New hardware. The problem is: in my sig is a nice motherboard that was advised to me by a great man who wants to remain anonymous. However, it appears that board is no longer sold. So now I am once again convicted to trying to find out hardware that will be suitable.

    And then, as to this hardware, I was wondering: aren't there any ready to run 'small servers' that I could buy off the shelf that would be just as good as the hardware I have now? I mean, I recall in the past I've seen several times that people buy small 'HP xyz (can't remember the series name)'-servers, or small Dell servers, or small IBM servers.

    Would anybody happen to know from their mind the series/series numbers I could look into? Ideally, I would like to have at least the same power I have now, so dual intel NIC, minimum 8GB RAM (Snort makes my current machine use 7 GB already), at least the CPU I have now. Better, if not that much more expensive, would be even nicer of course.

    As always, I am in your debt - and very grateful - for all the help  ;D

    Thank you,

    Bye,



  • I typically don't bother with MemTest any more.  Most servers with Registered ECC (and even some workstations with just ECC) can tell you exactly which stick is failing.  For boxes without Registered or ECC RAM I typically just swap out the memory first (RAM is relatively cheap compared to downtime) to see if the problem goes away.  If that doesn't do it, the PSU goes next.  If that doesn't fix it the entire system gets replaced.

    As to high availability, yes, that's an option, but you'll need 3 static IPs on each interface to do it.  If you don't have that you can still have a cold spare where you can just restore the config from your main box and swap the wires.



  • @Jason:

    I typically don't bother with MemTest any more.  Most servers with Registered ECC (and even some workstations with just ECC) can tell you exactly which stick is failing.  For boxes without Registered or ECC RAM I typically just swap out the memory first (RAM is relatively cheap compared to downtime) to see if the problem goes away.  If that doesn't do it, the PSU goes next.  If that doesn't fix it the entire system gets replaced.

    As to high availability, yes, that's an option, but you'll need 3 static IPs on each interface to do it.  If you don't have that you can still have a cold spare where you can just restore the config from your main box and swap the wires.

    Thank you Jason  ;D

    WIFE (who built the pfSense hardware box, a long story but it boils down to: I stay away from screws and cables ever since I blew up the engine of my car a long time ago. I was young and naive and thought I could save myself 1000 dollars by fixing something myself. I've learned my lesson the hard way, and ever since I don't even touch a screw driver. WIFE has to do that, I come in after that  ;D) told me the memory in the box is 'something special' (I doubt that, but I have been tought to never oppose WIFE), so easily swapping appears 'tricky' (duh, it is a fairly new motherboard, so how could this memory be 'something special'?).

    I did find a lead for the fail over pfSense machines (thanks to a reddit post):

    https://doc.pfsense.org/index.php/Configuring_pfSense_Hardware_Redundancy_(CARP)

    I will study to see if I can understand, with my limited brain, your remark about the 3 IP's.

    For now that leaves me with the suitable hardware question.

    Thank you again  ;D



  • I've read the tutorial three times, but my limited brain can't understand why I would need three WAN IPs  :-[

    I did find this, which suggests that isn't necessary:

    https://forum.pfsense.org/index.php?topic=27117.0

    This would come in handy, as I am in a SOHO-environment and have only 1 WAN-IP.

    I [b]do understand that since my modems (VDSL and Cable) are plugged into the first pfSense machine (WAN1 and WAN2), should a CRAP (small joke  ;D) CARP fail over need to occur, these cables one way or the other need to end up in the second pfSense machine.

    Now I can understand that in a normal business/corporate environment there aren't any sysadmins running around to switch cables, it should be fully automatic. As pfSense is a very professional system, I take it it can do that automatically also. So, obviously, the two modem cables don't go into the first pfSense in that setup. Which leaves, as far as I can tell, only the switch to plug them into (I don't see any other holes in my server room to plug cables into  ;D).

    But I fail to see how this will work next.I mean, if WAN1 goes into switch instead of in the VDSL-modem, how does WAN1 get an external IP? It has to get that from the modem, but it isn't plugged in there (?)

    Currently I have the most classic setup:

    WAN1/WAN2 -> pfSense -> LAN/VLAN1/VLAN2/VLAN3 -> HP switch -> LAN 'puters.

    Thank you for any help for this still Dutch and still noob  ;D



  • @Hollander:

    WIFE has to do that, I come in after that  ;D) told me the memory in the box is 'something special' (I doubt that, but I have been tought to never oppose WIFE), so easily swapping appears 'tricky' (duh, it is a fairly new motherboard, so how could this memory be 'something special'?).

    Is pfSense on the box in your sig?  Then ask your hardware vendor wife if she used DDR3 1600 MHz memory.  The MB supports that speed, but the G1610 processor in your sig does not; it only officially supports DDR3-1333 MHz.  http://ark.intel.com/products/spec/SR10K

    I did find a lead for the fail over pfSense machines (thanks to a reddit post):

    While there's no harm in doing a fully redundant fail-over system, IMHO it seems overkill compared to finding the problem with your primary machine (which you would want to fix anyway, redundant or not …)



  • @charliem:

    @Hollander:

    WIFE has to do that, I come in after that  ;D) told me the memory in the box is 'something special' (I doubt that, but I have been tought to never oppose WIFE), so easily swapping appears 'tricky' (duh, it is a fairly new motherboard, so how could this memory be 'something special'?).

    Is pfSense on the box in your sig?  Then ask your hardware vendor wife if she used DDR3 1600 MHz memory.  The MB supports that speed, but the G1610 processor in your sig does not; it only officially supports DDR3-1333 MHz.  http://ark.intel.com/products/spec/SR10K

    I did find a lead for the fail over pfSense machines (thanks to a reddit post):

    While there's no harm in doing a fully redundant fail-over system, IMHO it seems overkill compared to finding the problem with your primary machine (which you would want to fix anyway, redundant or not …)

    Thank you Charlie  :P

    I think WIFE needs an upgrade to the latest firmware. I think you have discovered a bug  ;D

    Because she checked and you are right; she did use 1600, as she looked only at the mobo.

    But: isn't it weird that this problem only manifests itself after one year?

    On another note, during the last week when logging in to pfSense GUI, there as a message about pfSense being crashed (although this is not when it hung; hanging occurs late at night and then all that remains is to hard reset the box so no logging into the GUI anymore). You can then submit the crash report to pfSense, which I did, and all I could see whas that pfBlocker crashed every time. But I have no clue how to find out why. I don't think it is related to this problem of hanging (but then again, what do I know in the first place), but I am mentioning it anyhow.



  • @Hollander:

    Because she checked and you are right; she did use 1600, as she looked only at the mobo.

    But: isn't it weird that this problem only manifests itself after one year?

    It still may not be the problem, but it gives you reason to run memtest.  My experience is that dodgy RAM can be found fairly quickly, usually a few hours or less.  Your usage pattern may have changed, now touching a bad cell.  Or more heat due to dust, or a new package or …

    On another note, during the last week when logging in to pfSense GUI, there as a message about pfSense being crashed (although this is not when it hung; hanging occurs late at night and then all that remains is to hard reset the box so no logging into the GUI anymore). You can then submit the crash report to pfSense, which I did, and all I could see whas that pfBlocker crashed every time. But I have no clue how to find out why. I don't think it is related to this problem of hanging (but then again, what do I know in the first place), but I am mentioning it anyhow.

    Both may well have the same root cause, unless you saw pfBlocker crashing before the lockups started.



  • As an update: I by now have my replacement Dell, so I could test my pfSense1 to the memory test. It has been running memtest86 for 24 hours with no problem whatsoever.

    I also discovered in the bios the motherboard will automatically scale back the frequency; the bios said 'RAM 1600, actual 1333'.

    A wise man who knows many, many things whispered in my ear I should try the PSU (thank you, wise man  ;D), so this is what I will do next. And else I will remove pfblocker, since that appeared to keep on crashing on line 262 constantly.

    Thank you for your help  ;D


Log in to reply