Repeatable crash under load: Soekris 4801 with 1.2-RELEASE
-
After a cool 126 days uptime with 1.2-BETA-2, suddenly I have a repeatable crash under heavy load, after updating some client boxes to support faster SSH transfers. Updating to 1.2-RELEASE doesn't fix this, nor does upping the 12 volt power supply from 1 amp to 5 amps max.
The symptom is that using SCP to copy a file of several hundred megabytes between two FreeBSD 7.0-RELEASE-p1 boxes on different network interfaces, causes the firewall to spontaneously reboot after a couple of minutes of copying. Same thing happens every time (but never in quite the same place).
During the SCP process, I'm seeing 3.2 MB/second, so the poor little Soekris box is shuffling roughly 64 megabits per second if you count both network interfaces - so it was doing really well right up until it crashed!
This is a severe test, but the pfSense box really shouldn't reboot under load. Yes, it's going to be the bottleneck, given the low power hardware (under five watts), but it shouldn't reboot.
http://www.freebsd.org/releases/6.2R/errata.html says: "[20070116, update 20070212] Systems with very heavy network activity have been observed to have some problems with the kernel memory allocator. Symptoms are processes that get stuck in zonelimit state, or system livelocks. " I wonder whether this is hanging the box, and then the Soekris hardware watchdog is kicking in and causing a reboot. Does pfSense enable the Soekris watchdog hardware? I'm logging the serial port output, but it's not giving any kernel panic messages or anything.
Since we have a nice repeatable fault, it would be nice to try a newer embedded pfSense image. Happy to give a strict non-disclosure undertaking if necessary… I'm hoping the underlying fault was fixed for FreeBSD 6.3 or 7.0.
All the best,
- Martin.
-
The link to the 6.3 based testing release was posted in a couple of other threads. Some people have reported success running it with hardware that is newly supported in 6.3 and it is generally considered stable.
http://cvs.pfsense.org/~sullrich/testing_images/6/FreeBSD_RELENG_6_3/pfSense_RELENG_1_2/ -
Thanks for the fast reply! I'm downloading that image now. Will test and post results back here ASAP.
Cheers,
- Martin.
-
Sadly the crash still happens with the FreeBSD 6.3-RELEASE-p1 test build of pfSense 1.2-RELEASE at http://cvs.pfsense.org/~sullrich/testing_images/6/FreeBSD_RELENG_6_3/pfSense_RELENG_1_2/pfSense.img.gz .
I also tried 'watchdog -t 0' in case the kernel watchdog was firing: nope…
I'd like to re-test under FreeBSD 7.0, but http://cvs.pfsense.org/~sullrich/testing_images/7/RELENG_1_2/ doesn't have a pfSense.img.gz file yet. I would need that embedded image file for the CF card on my Soekris 4801 box. Is that possible?
Many thanks,
- Martin.
-
Just retested with http://snapshots.pfsense.org/FreeBSD7/RELENG_1/pfSense-20080712-2347.img.gz . In limited testing, this Alpha-Alpha image appeared to work OK (except that it did not manage to import my OpenVPN certificates etc from the exported 1.2 config). But the same problem was there, i.e. the Soekris NET4801 rebooted when copying a huge file between two network interfaces at full speed.
The same test with pfSense 1.2 on a NET5501 Soekris box works fine, returning 100 megabits transfer speed.
I still don't know if this is down to a bug in the 'sis' NIC driver, or whether it's a more general problem with sustained high network traffic maxxing out the CPU.
I guess it's time to dump the NET4801, and use the NET5501 instead.
- Martin.
-
did you try enabling device polling ? I noticed on my soekris net4501 running m0n0 after I enabled device polling the cpu usage dropped from 99% to like 1% when maxing out the link its on
-
Thanks for the thought - nice idea, will try when I can. Got a NET5501 in there at the moment though.
-
You sure your AC adapter is functioning properly? The most common issue with Soekris and PC Engines hardware spontaneously rebooting is a flaky or weak power adapter. Alternatively it might be a flaky board that gets unstable under load.
I've seriously punished 4801s for hours straight running them all out doing various performance testing and never rebooted one. Including NAT, routing, bridging, IPsec, you name it… even under high IPsec load which really stresses such a slow proc it stays perfectly stable.
-
Thanks for the thought. Already changed the power adapter, and got the 5501 working with it.
I'm thinking the 4801 board or the NIC card must be faulty. I'll try re-cycling it as a backup server. Annoying though, I should have stress-tested the 4801 properly while it was still under warranty.
Cheers,
- Martin
-
It might be overheating as well. You didn't mention having a hard drive in the case, that's the only time I've seen 4801's have heat related issues, but occasionally adding a hard drive will increase the temp enough to make them unstable under high load as the CPU generates more heat as well.
If you've eliminated the power supply, and overheating isn't an issue, it's definitely a flaky board.