Getting crash dump data - directions in docs not working

stilez

I just hit a Kernel Trap 12 crash running 2.4.1. I'm trying to follow the directions on the docs page and I'm not seeing what that page describes. The page could be out of date, so this post is to ask what info I should be able to get.

The system is unmodified pfSense 2.4.1, clean installed at 2.3.x a few months ago and upgraded to "stable" since then. The hard drive has a 28GB swap partition "label/swap0" which I think is ada3s1b (described by gpart show as "freebsd-swap") but I'm not sure how to confirm it. The main partition has plenty of space.

I've been trying to trace an intermittent data issue on the LAN. As part of this, I've been hammering data traffic to try and get something to break. Last night I was copying data through it, maxing out 2 ports on a normal 1G bridge (4-port 1G Intel PCIe card). I added the first port of a 10G Chelsio card (latest firmware) to the bridge and again stress tested it with a high level of traffic across the bridge (the data arrived on the 1G and left on the 10G so no buffering issues expected and not much stress on the 10G side). About an hour into this test, the router crashed, displaying the "Trap 12" message and some debug info (pointers etc). I have a photo for reference of the debug message, but I can't post it until tomorrow.

The docs page says that "After a panic/crash leading to a reboot, a box will appear on the dashboard to view and automatically submit a crash report". But after rebooting, I didn't see such a box.

I didn't see a debugger prompt ("db>") at any stage, possibly because I didn't realise it had crashed for a few minutes, and it rebooted into "normal" mode without issues.

I've tried a few things from looking online. savecore -Cv says "magic mismatch on last dump header on /dev/label/swap0", and that no dump file was found. "dumpon -l" gives "label/swap0" and swapinfo gives "/dev/label/swap0" and says it's got 28GB (58M blocks) free. That's all I know.

It is probably some hardware issue, given I was stress-testing the networking for reliability at the time. But it's very hard to reproduce (I had to hammer the networking for about 80TB of data taking a few days to get it to happen just once on any NIC, and so far it hasn't happened since recommencing after reboot), so if possible I'd like to get whatever debug info I can get, to try and track down whatever component was (slightly) under-par for reliability. Is there a way?

If there's no way to recover any crash info, what do I have to do, to ensure that another time I can get good debug info?

jimp

If you have swap space configured and active (check "swapinfo") then it should be setting up the scripts to collect crash reports.

Do you have anything in /var/crash?

What do you show in "sysctl debug.ddb debug.kdb kern.shutdown.dumpdevname" ?

If it hits a panic that triggers the textdump script it should be dumping the output to the swap slice and then it picks that up on the next boot, and then copies it to /var/crash before setting up swap to use like it normally does.

https://github.com/pfsense/pfsense/blob/master/src/etc/pfSense-rc#L59
https://github.com/pfsense/pfsense/blob/master/src/etc/rc.dumpon

stilez

@jimp:

If you have swap space configured and active (check "swapinfo") then it should be setting up the scripts to collect crash reports.
Do you have anything in /var/crash?
What do you show in "sysctl debug.ddb debug.kdb kern.shutdown.dumpdevname" ?
If it hits a panic that triggers the textdump script it should be dumping the output to the swap slice and then it picks that up on the next boot, and then copies it to /var/crash before setting up swap to use like it normally does.
https://github.com/pfsense/pfsense/blob/master/src/etc/pfSense-rc#L59
https://github.com/pfsense/pfsense/blob/master/src/etc/rc.dumpon

Thanks Jim. Output from these is below; I can't see any signs of a saved crash report or dump either at the time or now.

# swapinfo

Device 1K-blocks Used Avail Capacity
/dev/label/swap0 29307832 0 29307832 0%

# ls -ltR /var/crash
total 4
-rw-r–r-- 1 root wheel 5 Oct 22 23:31 minfree

# sysctl debug.ddb debug.kdb kern.shutdown.dumpdevname
debug.ddb.textdump.do_version: 1
debug.ddb.textdump.do_panic: 1
debug.ddb.textdump.do_msgbuf: 1
debug.ddb.textdump.do_ddb: 1
debug.ddb.textdump.do_config: 1
debug.ddb.textdump.pending: 0
debug.ddb.scripting.unscript:
debug.ddb.scripting.scripts: lockinfo=show locks; show alllocks; show lockedvnods
kdb.enter.default=textdump set; capture on; run lockinfo; show pcpu; bt; ps; alltrace; capture off; textdump dump; reset
kdb.enter.witness=run lockinfo

debug.ddb.capture.data:
debug.ddb.capture.bufsize: 49152
debug.ddb.capture.inprogress: 0
debug.ddb.capture.maxbufsize: 5242880
debug.ddb.capture.bufoff: 0
debug.kdb.alt_break_to_debugger: 0
debug.kdb.break_to_debugger: 0
debug.kdb.trap_code: 0
debug.kdb.trap: 0
debug.kdb.panic: 0
debug.kdb.enter: 0
debug.kdb.current: ddb
debug.kdb.available: ddb
kern.shutdown.dumpdevname: label/swap0

jimp

Hmm that all appears to be in line, going by memory at the moment I don't have a box up to compare.

The disk and/or swap are not encrypted, are they?

If all else fails, setup a serial console, set that as primary, and then record the crash dump that way.

stilez

@Jimp - no, they aren't encrypted. But I don't seem to have any crash dump either. Are there directions for setting up a serial console, and the equipment needed, and is there a way to trace why I didn't get a crash dump the usual way (or why I don't next time it happens?)

I should say this is rare. Once in a month or 2. So I can't say when it will next happen or what prompts it (which is why I wanted the crash dump in the first place of course).

How do I manually trigger a kernel trap issue of this kind, so I can test further whether crash dumps would be created and if not, why not?

jimp

We don't have any docs about setting up a serial console but there isn't much to it. Your hardware has to have a physical (not USB) serial port built into it. Then just go to System > Advanced, Admin tab and enable the serial console there and set it to be the primary console. Hook up a client with a null modem serial cable and use PuTTY or something similar to watch/record the console output.

Without seeing what's in the report, I can't say why it wouldn't be saved. It's possible, perhaps, that the OS loses contact with the disk which leads to the panic. That would explain both the crash and the lack of crash dump, but that is pure speculation until we get some hint of detail. You could maybe disable the ddb scripts (run "ddb scripts" and then "ddb unscript <name>" for every script. Then when it crashes it should land you at a "db>" prompt so you can manually run and capture a backtrace.

To manually force a panic/crashdump/reboot, run this: sysctl debug.kdb.panic=1

Do NOT set that as a tunable (or you'll put yourself in a panic loop :-), just run it from an ssh shell prompt</name>