Dell R310's with 2.0-RC1 consistent kernel panic over time (only on one)

utternerd

I've got a pair of Dell R310's, with the following specifications:

1 x X3430 Xeon
4GB
SAS 6iR
1 x Broadcom 5709 Dual Port 1GbE w/TOE
2 x Broadcom 5716 1GbE (integrated)

The problem displaying itself, is edge2 will kernel panic after ~2 days of working perfectly fine. Both machines are identical, their service tags are 1 off from each other, manufacture date is identical. The "primary" CARP machine is called edge1, the "backup" is edge2.

The panic is consistent:

Fatal trap 12: page fault while in kernel mode
cpuid = 3; apc id = 06
fault virtual address = 0x5ec
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff802145a6
stack pointer = 0x28:0xffffff803c87c820
frame pointer = 0x28:0xffffff803c87c840
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 35805 (php)

jimp

Others have had similar problems, it seems to be an issue with the Broadcom NICs that Dell is putting in some of these boxes.

For giggles, try putting this into /boot/loader.conf.local (create the file if it doesn't exist)

kern.ipc.nmbclusters="98304"

utternerd

I'll give that a shot – exhausting mbufs? Great book by the way. :)

utternerd

Unfortunately after ~7 days of uptime, once again it panicked. This time on ifconfig, and a different stop point. Attaching screenshot. Is it advisable to disable these integrated Broadcom NICs, and purchase another decent PCI-X based Broadcom - or is it possible that it's not the integrated broadcoms, but the dual-port PCI-X based ones?

Another symptom I've seen, and I'm not entirely sure if this is abnormal or not, but generally a few hours before failure I'll begin to get messages (usually no more than 3 or 4 total alerts)

"A communications error occured while attempting XMLRPC sync with username admin https://192.168.169.2:443."

Which would lead me to believe it's the integrated cards, as that's what is the pfSync interface.

Any help is greatly appreciated…

edge2-panic-04282011.jpg_thumb

jimp

So applying the setting change let it stay up longer?

Did you happen to check the mbuf usage at any time during the week it was up?

utternerd

It does appear to have lived longer, usually it'd die within ~3 days.

At current:

24580/1152/25732 mbufs in use (current/cache/total)
16346/940/17286/98304 mbuf clusters in use (current/cache/total/max)
16344/424 mbuf+clusters out of packet secondary zone in use (current/cache)
0/14/14/49152 4k (page size) jumbo clusters in use (current/cache/total/max)
8160/263/8423/24576 9k jumbo clusters in use (current/cache/total/max)
0/0/0/12288 16k jumbo clusters in use (current/cache/total/max)
118425K/4879K/123304K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

Over the course of the week it was hovering around ~22500

The other machine, the CARP master, shows slightly lower mbuf usage.
This is with a single "load balancer" setup, and almost no traffic aside from the pfsync interface.

jwelter99

I've just seen this exact same issue.

Different hardware but occured on the backup with the same error.

We are using supermicro boxes with Intel NIC's (4 ports)

John

utternerd

So I've setup cronjobs to pipe out mbufs via netstat to a file every 60 seconds along with a time-stamp so events can potentially be correlated. I've also bumped mbufs down substantially to attempt a self-induced panic in shorter order.

Load is also being thrown at them via AB. I'll follow up with the results.

-Matt

the.it.dude

I have the other thread going about the problems with my R410 & R510. They both have the integrated dual broadcom nics. The R410 has an additional 2-port broadcom and the R510 has 2 additional quad port Intel nics. I've had problems with both randomly hanging either during bootup or during the install to the hard drive. Based on this thread, I went back and attempted the install again on the R510 with the broadcom nics disabled in the bios. I still experienced the same problems.

Jeff

utternerd

Sigh.

So after almost 7 days with mbufs back at their default setting, another kernel panic has ensued. This one is certainly different, so i've attached another screenshot.

First, the output of mbuf usage from netstat for the last 5 minutes before it died:

Wed May  4 23:59:00 CDT 2011
24579/9602/34181 mbufs in use (current/cache/total)
24506/1404/25910/65536 mbuf clusters in use (current/cache/total/max)
24504/712 mbuf+clusters out of packet secondary zone in use (current/cache)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

Thu May  5 00:00:01 CDT 2011
24579/9602/34181 mbufs in use (current/cache/total)
24506/1404/25910/65536 mbuf clusters in use (current/cache/total/max)
24504/712 mbuf+clusters out of packet secondary zone in use (current/cache)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

Thu May  5 00:01:00 CDT 2011
24579/9602/34181 mbufs in use (current/cache/total)
24505/1405/25910/65536 mbuf clusters in use (current/cache/total/max)
24504/712 mbuf+clusters out of packet secondary zone in use (current/cache)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

Thu May  5 00:02:01 CDT 2011
24579/9602/34181 mbufs in use (current/cache/total)
24506/1404/25910/65536 mbuf clusters in use (current/cache/total/max)
24504/712 mbuf+clusters out of packet secondary zone in use (current/cache)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

Thu May  5 00:03:01 CDT 2011
24579/9602/34181 mbufs in use (current/cache/total)
24506/1404/25910/65536 mbuf clusters in use (current/cache/total/max)
24504/712 mbuf+clusters out of packet secondary zone in use (current/cache)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

Perhaps I'm overlooking, but I don't see any exhaustion taking place. This really has me beating my head against the wall.

My next steps are going to be:

1.) Run a system stresser/burn-in
2.) Use Dell's DSET to verify there's not some obscure problem with the hardware.
3.) Flip edge1 and edge2 tasks, so the crashing device becomes the master. I doubt this will have any affect.
4.) Run screaming into the night so I don't have to tell the SVP of IT that our new High Availability Network Devices… are well... highly available devices for ~6 days at a time. :-\

(P.S. edge1 STILL hasn't had a problem, and it's been 5 weeks solid. same hardware of course.)

Help! Any other ideas welcome..

Thanks,
-Matt

edge2-panic-05052011.jpg_thumb

the.it.dude

Just for dumb, try switching keyboards between the 2 firewalls (unless you're on a KVM, then try switching cables/ports)

I believe I discovered the issue with my R410 was the keyboard (Doesn't like 100 mA USB keyboards. But does like the 1.5 A USB keyboard)

I'm going to test this theory on the R510 when I get back into work on Monday

utternerd

I'll give that a shot on the one that crashes - they're both on Avocent DSR2035's. I've thrown vanilla FreeBSD 8.1-RELEASE on them to do testing, not that I suspect any of the additions by pfSense, but, just in case. I'm running them over the weekend with ttcp maxing out all 4 interfaces non-stop, which will be followed up by stress2 plus ttcp running for 7+ days.. since the second machine has never gone without kernel panicking before day 7.

I just don't understand how our other R310, built directly after, has been running pfSense for almost a month and a half without a single hiccup… Quite maddening.

cmb

I've seen that panic in pfi_dynaddr_update on a Dell server too, but an 1850 with Intel NICs only, it's not Broadcom-specific. Only on the secondary of a CARP pair, which seems to correspond to others here. I had a back trace and accidentally deleted it… Seemed to be after a config sync, though that could be a coincidence, and I've synced the config about 50 times to that box in the past half hour and it hasn't had any issues.

opened a ticket
http://redmine.pfsense.org/issues/1511

utternerd

@cmb:

I've seen that panic in pfi_dynaddr_update on a Dell server too, but an 1850 with Intel NICs only, it's not Broadcom-specific. Only on the secondary of a CARP pair, which seems to correspond to others here. I had a back trace and accidentally deleted it… Seemed to be after a config sync, though that could be a coincidence, and I've synced the config about 50 times to that box in the past half hour and it hasn't had any issues.

opened a ticket
http://redmine.pfsense.org/issues/1511

Thanks for opening a ticket. Time permitting I'll try and find 2 more unused Dells to run 2.0-RC1 on and pull a backtrace. This probably reinforces your belief; Stress2+ttcp on vanilla 8.1-RELEASE haven't made either boxes hiccup, and their loads have been in the ~500s for 3 days straight, moving terabytes of data across all 4 nics, chewing up every bit of physical and virtual memory and cpu cycle, and of course stressing every subsystem. If it were hardware instability, or something in the FreeBSD kernel I'd have expected it to have fallen over by now, generally speaking.

It looks like I'll be running 1.2.3 for now, which really is unfortunate for me because once this is implemented as our core, moving to 2.0 will be yet another challenge. Less of a challenge than moving from our POS Watchguards (Who bought these, I don't know, but they deserve to be shot.), but still more-so than if I could just run 2.0 to begin with.

-Matt

cmb

With the most recent version, the secondary panics should be fixed.

utternerd

Just thought I'd follow up. This was fixed with the referred build. We are currently running 2.0-RC3 in multiple production environments without issue. The Dell R310's were great choices from a pricing and feature perspective, and work great with pfSense 2.0.

Thanks.