PfSense Not Responding - What to do before power cycle?



  • My pfSense box randomly loses all network connectivity. The web interface falls off the face of the earth, ssh doesn't respond, as does all network connectivity. The box is in our data center remotely, but before I have my datacenter power cycle it, should I have them console the box and see if there is any output on the screen? What can I do once the box is power cycled to determine the cause of the lockups?



  • If possible, you want to see if there is anything on the screen. If you can reboot from the console, that is better than power-cycling the box. Afterwards, I would check the error logs, historical traffic, state table, cpu usage, etc.



  • My provider consoled the box and here is what they said:

    The console showed an error message in Hex, and the words "BTX Halted". It appeared to be a few lines after boot, as the pfsense bootloader was still visible towards the top of the screen.

    That sounds like the box rebooted itself and then had an error at boot. I've looked at the logs in /var/log but they all seem to have reset themselves after reboot. Other than syslog, is there anything I can do to determine the cause of the reboot or why it may have halted?





  • @cmb:

    some info here:
    http://doc.pfsense.org/index.php/Unexpected_Reboot_Troubleshooting

    Thanks for that link. Installing the developers kernel will be problematic at best, but I know my data center has an IP KVM they can attach. I'll see what I can do.

    In the mean-time, I remember that when I built this box a week or so ago, everything was fine when I had 2GB of RAM installed. I dropped 4GB in (the 2GB DIMM heatsinks were too tall for the 1U chassis cover to close) and upon reboot, I received some sort of ACPI crash. I disabled ACPI by using hint.acpi.0.disabled="1" in /etc/loader.conf and everything seemed fine. I pounded on the box for 48 hours or so with iperf in a loop with no issues. Does pfSense 1.2.3-RC3 have any issues with 4GB of RAM and the SMP kernel that could cause crashes?

    I'm also not ruling out that the RAM could be faulty, although it has worked for several months in another machine here at the house. I would like to rule everything else out before I pay the data center guys to swap the RAM.


  • Banned

    I am running 4gb ram on a Xseries box from IBM with RC3 release. No problems at all.



  • @Supermule:

    I am running 4gb ram on a Xseries box from IBM with RC3 release. No problems at all.

    Thanks Supermule.

    The data center hooked a KVM up for me and unfortunately, I couldn't recreate the behavior and the hardware event log of the motherboard didn't show any events like overtemp or RAM errors; just the no keyboard connected event after the spontaneous reboot at 3AM on 10/5. I decided to make several BIOS changes however that I think may have been at fault. The Supermicro board comes with several new power saving features enabled that I think may be incompatible with either FreeBSD, my processor (E5200), or some combination of all three. So I turned the majority of them off, re-enabled ACPI, and so far things have stayed up with no crashing. I'll document them here just in case anyone runs into a similar issue:

    • Intel Thermal Management 2
    • C1 Enhanced Mode
    • Enhanced Intel Speedstep
    • Memory Remapping
    • High Precision Event Timer

    The first 3 options allow not only CPU clock modulation, but CPU voltage modulation based on load and temperature as well. Memory remapping on my Supermicro X7SBL-LN2 (Intel 3200 chipset) board moves the 4GB address space to the 5GB range. Finally, the High Precision Event Timer is a replacement for the 8254 Programmable Interval Timer, but seemed to be of benefit for desktop operating systems that play multimedia only.

    I have a remote server monitoring the box via ping, so if it crashes or network connectivity drops off, I'll know immediately. My finger are crossed, but so far so good.



  • Well, apparently the items noted above didn't help as the server dropped offline an hour ago and it looks like my data center has to power cycle it to get it back online. If I install the developer's kernel and get the panic information, is that likely to tell me what the cause is since the system BIOS isn't detecting in the DMI event logs?

    All this work has a real cost ($135/hour for remote hands at the datacenter) associated with it, not to mention the cost of renting their IP KVM.

    Edit:

    The server had rebooted and was hung right after the memory count at POST, so this has got to be a hardware issue. Eff'ing A.


Log in to reply