PfSense on a SuperMicro Atom Server Randomly Rebooting
-
No crash reports from the IP you're posting from or anything close to it, are you not getting any, or just not submitting them?
-
Couple of things to check: Is the system BIOS set to reboot if the system encounters an error (hardware)?
My pfSense box would randomly died, and I realized it was overheating. I set the BIOS to reboot so I wouldn't have a dark router. A quick addition of a fan fixed that. I realized it was overheating when I looked at the system logs.
-
I was never notified of crash reports that could be sent to the developers. Do I need to install the developer's kernel in order for this to be an option? If that's the case, I will definitely make sure to do that when I try 2.0.2.
I do have BIOS set to reboot on a hardware problem, but nothing was ever logged in the BIOS logs. I'm 99% sure it's not a cooling issue. I've got three of these boxes running other tasks (PBX, NAS) and then run like a charm. I'm running a different UTM distro on this box as well, and it doesn't have any issues. I'd just much rather run pfSense, as I think it is far superior.
-
Perhaps you might have more success with pfSense 2.1 snapshot builds. They have much more up to date device drivers than the 2.0.x series of builds.
Without some sort of crash dump or crash report it is almost imposibble to tell what is going on.
-
If you're not getting crash reports then it's pretty much a certainty it's a hardware problem (and likely not RAM since RAM problems will most always cause kernel panics). Software problems that cause a reboot will be from a kernel panic, and you'll be prompted to submit the crash report upon your next login and every login until you either choose to delete or submit it. That happens with every kernel, no need for and generally don't want the dev kernel for that. Wouldn't be a bad idea to try 2.1 also so you're trying a newer base OS.
-
Can you ssh into the box?
If you can, go into the shell and type in clog /var/log/system.log and post the logs from just prior to the reboot and following it.
I wrote a post with more info regarding grabbing logs here. Posting log info usually gets the problem identified very quickly.
-
Thanks for the info guys. I will work on getting 2.1 installed in the next day or two and see what happens from there. I'll know relatively quickly if it is going to work or not, and will post what I find from there.
Just out of curiosity…. Any idea when 2.1 is going to move to stable?
-
Well it looks like I am in the same boat with 2.1. Here's the syslog right before and after the reboot. Sure doesn't look like anything is getting logged.
Feb 21 19:56:16 atlas check_reload_status: Syncing firewall
Feb 21 19:56:49 atlas check_reload_status: Syncing firewall
Feb 21 19:56:53 atlas php: /snort/snort_alerts.php: Checking for and disabling any rules dependent upon disabled preprocessors for WAN…
Feb 21 19:57:33 atlas check_reload_status: Syncing firewall
Feb 21 19:57:37 atlas php: /snort/snort_alerts.php: Checking for and disabling any rules dependent upon disabled preprocessors for WAN...
Feb 21 20:02:02 atlas syslogd: kernel boot file is /boot/kernel/kernel
Feb 21 20:02:02 atlas kernel: Copyright (c) 1992-2012 The FreeBSD Project.
Feb 21 20:02:02 atlas kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Feb 21 20:02:02 atlas kernel: The Regents of the University of California. All rights reserved.
Feb 21 20:02:02 atlas kernel: FreeBSD is a registered trademark of The FreeBSD Foundation.
Feb 21 20:02:02 atlas kernel: FreeBSD 8.3-RELEASE-p6 #1: Thu Feb 21 11:33:28 EST 2013
Feb 21 20:02:02 atlas kernel: root@snapshots-8_3-amd64.builders.pfsense.org:/usr/obj./usr/pfSensesrc/src/sys/pfSense_SMP.8 amd64
Feb 21 20:02:02 atlas kernel: Timecounter "i8254" frequency 1193182 Hz quality 0
Feb 21 20:02:02 atlas kernel: CPU: Intel(R) Atom(TM) CPU D525 @ 1.80GHz (1807.21-MHz K8-class CPU)
Feb 21 20:02:02 atlas kernel: Origin = "GenuineIntel" Id = 0x106ca Family = 6 Model = 1c Stepping = 10
Feb 21 20:02:02 atlas kernel: Features=0xbfebfbff <fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,htt,tm,pbe>Feb 21 20:02:02 atlas kernel: Features2=0x40e31d <sse3,dtes64,mon,ds_cpl,tm2,ssse3,cx16,xtpr,pdcm,movbe>Feb 21 20:02:02 atlas kernel: AMD Features=0x20100800 <syscall,nx,lm>Feb 21 20:02:02 atlas kernel: AMD Features2=0x1 <lahf>Feb 21 20:02:02 atlas kernel: TSC: P-state invariant
Feb 21 20:02:02 atlas kernel: real memory = 8589934592 (8192 MB)
Feb 21 20:02:02 atlas kernel: avail memory = 8244371456 (7862 MB)
Feb 21 20:02:02 atlas kernel: ACPI APIC Table: <022112 APIC1550>
Feb 21 20:02:02 atlas kernel: FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
Feb 21 20:02:02 atlas kernel: FreeBSD/SMP: 1 package(s) x 2 core(s) x 2 HTT threads
Feb 21 20:02:02 atlas kernel: cpu0 (BSP): APIC ID: 0
Feb 21 20:02:02 atlas kernel: cpu1 (AP/HT): APIC ID: 1
Feb 21 20:02:02 atlas kernel: cpu2 (AP): APIC ID: 2
Feb 21 20:02:02 atlas kernel: cpu3 (AP/HT): APIC ID: 3Crash happened between 19:57 and 20:02</lahf></syscall,nx,lm></sse3,dtes64,mon,ds_cpl,tm2,ssse3,cx16,xtpr,pdcm,movbe></fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,htt,tm,pbe>
-
I'm on a customer network at a hotel running that exact same hardware right now with 80-some active users. That platform is widely used with factory defaults. Still no crash report from the sounds of it? Definitely, without question, a hardware problem of some sort if you're still not getting a crash report.
-
Still no crash report, and still nothing logged in the BIOS log.
I've ran untangle, and most recently ClearOS on this box for like 5 months and no crashes. I'd just rather run pfSense.
So you think it's something faulty? I suppose I could see if SuperMicro would be willing to replace the system board.
I've tried each drive individually, so I don't think it's the drives. And you said earlier you didn't think it was RAM, because there was no crash report.
In your similar setups are you using dual hard drives in a RAID array?
The only thing I have changed from defaults in BIOS is the IDE/SATA Config. I have it set as follows:
Configure Sata#1 as: RAID
ICH Raid CodeBase: AdaptecIf memory serves me right, I tried the CodeBase as Intel, and it wouldn't even see the raid volume.
-
Ok, thought I'd provide an update…
So I stumbled upon SuperMicro's supported OSes page. Supposedly FreeBSD is supported, but not the onboard RAID.
http://www.supermicro.com/support/resources/OS/Atom.cfmSo I installed pfSense setting up a gmirror. That didn't seem to solve it.
So I started wondering if it had something to do with ACPI. Looking in BIOS it was set to ACPI version 2.0. I switched it to 3.0. It's been up for about 24 hours now, so I'm cautiously optimistic now. It would never make it a full 24 hours before.
-
Spoke too soon, rebooted overnight. Back to the drawing board.
-
Can you bypass the RAID on the motherboard and directly connect to an IDE/SATA port?
-
Turning RAID on and off is just a BIOS setting, no jumpers or anything on the board for it. SATA ports are the same, there aren't special ones for the RAID. According to SuperMicro, AHCI mode is supported, which is what I have it on now. I'm accomplishing the RAID with gmirror now.
I just swapped the RAM out with a different brand that I happened to have, so I'm going to give this a go now and see what happens. So I went from 8GB of crucial ram to 8GB of Hynix ram that I had left over from a ram upgrade on my laptop.
I just can't think of what would be physically wrong with the board to only give me grief in FreeBSD, but work fine in other linux variants. But if the RAM doesn't do it, I think my only other option is to see if SuperMicro will send me another board. I just don't know if they will.
-
Well it's not the RAM. It rebooted in less than three hours this time.
I've submitted an RMA to SuperMicro, hopefully they'll send a replacement.
-
SuperMicro is shipping a new system board. Hope to have it in a day or two.
-
New system board is in, so we'll soon see if this is the answer.
Interesting side note. It's been up for about 3 hours now. The CPU is running about 10 degrees cooler than on the other board. It was never close to overheating. I just thought it was note worthy that the new one is running cooler.
-
Running 2 of those same boxes here with 2.0.2 amd64 on them.
Never had any issues.. I did read however that a lot of people had temperature issues with them, and to fix them they taped the vents on the FRONT (?!) closed so it forced air in from the back, across the passive CPU cooler, and out through the power supply. Seemed weird, but they say that the temps drop more than 10 degrees C when they do that..
Mine run in the mid 50's, and the chips are specced up to 100 degrees C, so I'm gonna leave mine alone for now.
They're surprisingly fast squid boxes when paired with an intel or samsung SSD.. :) Nice low budget really fast router.
-
The CPU temp on the new system board is sitting at 55-56 degrees celsius. On the only one it was always 65-70 degrees celsius. So it never got to the point that it was over heating, but the difference tells me that there was definitely something going on.
It's been up 1.5 days now without rebooting, so I'm optimistic.
-
Four days and counting. I think it's solved.