Help with a crash dump

nik.taylor · Jan 5, 2019, 4:21 PM

Hi,

New to the forum, have been running pfSense for a number of years.

I've been having crash dumps every couple of days recently. I did some reading on the forums and did some analysis on the dumps and I noticed before it dumps out, the message was around not being able to mount the disk.

Trying to mount root from ufs:/dev/gptid/6ab9347b-10fc-11e9-a953-00012e81b0a1 [rw]...
random: unblocking device.
<118>Configuring crash dumps...

Given feedback in another thread I thought it might be some bad sectors on the ssd so I replaced it this morning. Unfortunately, it crashed within about 2 hours of replacing the ssd (i did put back the original config on a newly downloaded image install).

The crash dump is attached. Would appreciate it if someone could help point me to any other issues.

Thanks.

crash_dump.txt

stephenw10 · Jan 5, 2019, 5:39 PM

It's nothing to do with mounting the disk. That's the last thing you see on the console if the primary console is the other one, serial vs video. If you look in the crash dump at the console buffer output it is passing that point.

The key parts of the crash are:

Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer	= 0x20:0xffffffff80d0b0b2
stack pointer	        = 0x28:0xfffffe010e6e67f0
frame pointer	        = 0x28:0xfffffe010e6e6810
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 12 (swi4: clock (0))
version.txt06000033613414152726  7622 ustarrootwheelFreeBSD 11.2-RELEASE-p4 #2 b00c407ba5d(RELENG_2_4_4): Mon Nov 26 11:41:48 EST 2018
    root@buildbot2.nyi.netgate.com:/build/ce-crossbuild-244/obj/amd64/ZfGpH5cd/build/ce-crossbuild-244/pfSense/tmp/FreeBSD-src/sys/pfSense

and

db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100026 td 0xfffff8000396b620
pfslowtimo() at pfslowtimo+0x52/frame 0xfffffe010e6e6810
softclock_call_cc() at softclock_call_cc+0x13a/frame 0xfffffe010e6e68c0
softclock() at softclock+0x79/frame 0xfffffe010e6e68e0
intr_event_execute_handlers() at intr_event_execute_handlers+0xe9/frame 0xfffffe010e6e6920
ithread_loop() at ithread_loop+0xe7/frame 0xfffffe010e6e6970
fork_exit() at fork_exit+0x83/frame 0xfffffe010e6e69b0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe010e6e69b0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
db:0:kdb.enter.default>  ps

So some clock/timing related issue. It's not something I'm familiar with.

Has it always done this or just started after an update perhaps? Some other change?

Steve

nik.taylor · Jan 5, 2019, 5:47 PM

Thanks for getting back so quickly.

It was running fine for about 4 or 5 months. It started having issues a few months ago. I can't pinpoint the exact date unfortunately. No changes to hardware apart from the new drive I just installed. I have been keeping up with releases. No new packages installed recently. I only have Cron, nut, openvpn-client-export installed.

nik.taylor · Jan 12, 2019, 2:50 PM

Is there any way I can dig into this further? It's happened twice since I posted this.

Thanks.

stephenw10 · Jan 12, 2019, 9:20 PM

What hardware are you using?

Are you running anything unusual in the config?

Attempting to replicate that in FreeBSD 11.2 is always a good step. Prove it's something pfSense is doing or something in base.

Steve

nik.taylor · Jan 12, 2019, 9:28 PM

Hardware:

ZOTAC C Series ZBOX CI327 NANO, Palm-Sized Passive Cooled Mini PC, Intel N3450 Quad-Core CPU, Intel HD Graphics 500, ZBOX-CI327NANO-U
G.SKILL Ripjaws Series 4GB 204-Pin DDR3 SO-DIMM DDR3 1866 (PC3 14900) Laptop Memory Model F3-1866C11S-4GRSL
Crucial BX500 120GB 3D NAND SATA 2.5-Inch Internal SSD - CT120BX500SSD1Z

Nothing unusual in config. Can send it over if needed. I only have Cron, nut, openvpn-client-export installed as add ins.

How do I replicate in FreeBSD other than installing on the hardware and letting it run for a few weeks?

stephenw10 · Jan 15, 2019, 10:19 PM

Hmm. Well I'd disable Nut as a test just because it's the only thing doing anything active.

Pretty sure there are others running that box without issue so I'd guess it's either a config issue or some bad component, assuming you have not added any thing like a wifi card etc.

nik.taylor · Jan 20, 2019, 8:08 PM

I'll disable nut and report back.

Nothing unusual in the config. Not sure how I could tell if there is?

No other components added and no additional cards / hardware.

nik.taylor · Jan 20, 2019, 8:08 PM

I disabled nut and it crashed again this week.

Anything else I can do?

Thanks.

chrismacmahon · Jan 20, 2019, 8:16 PM

Are the crashes all the same with this error:

db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100026 td 0xfffff8000396b620
pfslowtimo() at pfslowtimo+0x52/frame 0xfffffe010e6e6810
softclock_call_cc() at softclock_call_cc+0x13a/frame 0xfffffe010e6e68c0

Or is it different?

nik.taylor · Jan 26, 2019, 2:50 PM

This is the latest error I recieved:

db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100026 td 0xfffff8000397d620
ipport_tick() at ipport_tick+0x4e/frame 0xfffffe010e6e6810
softclock_call_cc() at softclock_call_cc+0x13a/frame 0xfffffe010e6e68c0
softclock() at softclock+0x79/frame 0xfffffe010e6e68e0

chrismacmahon · Jan 26, 2019, 4:25 PM

It looks like that's hardware, I would potentially look at changing the on-board battery see if that helps, but I highly doubt it would.

Warden · Jan 26, 2019, 4:28 PM

Hi,

I'm experimenting the same kind of issue, my PFsense box crashing on daily basis since a couple of months. I did the same changing the SSD drive but getting the same results.

Looking at the logs I see the same kind of details as discussed above:

Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 06
fault virtual address	= 0xc46b3dd0
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80d89866
stack pointer	        = 0x28:0xfffffe01188d6688
frame pointer	        = 0x28:0xfffffe01188d6688
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 55216 (darkstat)
��version.txt������0600����0�������0�������336���������13421446713�  7622� ������ustar���root���������wheel���������FreeBSD 11.2-RELEASE-p6 #3 518496b29ae(RELENG_2_4_4): Wed Dec 12 07:41:44 EST 2018
    root@buildbot2.nyi.netgate.com:/build/ce-crossbuild-244/obj/amd64/ZfGpH5cd/build/ce-crossbuild-244/pfSense/tmp/FreeBSD-src/sys/pfSense��
Filename: /var/crash/textdump.tar.11
ddb.txt�����0600����0�������0�������140000������13422124500�  7063� ��ustar���root������wheel����db:0:kdb.enter.default>  run lockinfo
db:1:lockinfo> show locks
No such command; use "help" to list available commands
db:1:lockinfo>  show alllocks
No such command; use "help" to list available commands
db:1:lockinfo>  show lockedvnods
Locked vnodes
db:0:kdb.enter.default>  show pcpu
cpuid        = 2
dynamic pcpu = 0xfffffe018f873480
curthread    = 0xfffff80003dd0000: pid 12 "irq259: re0"
curpcb       = 0xfffffe0118646cc0
fpcurthread  = none
idlethread   = 0xfffff80003939000: tid 100005 "idle: cpu2"
curpmap      = 0xffffffff82b83898
tssp         = 0xffffffff82bb47e0
commontssp   = 0xffffffff82bb47e0
rsp0         = 0xfffffe0118646cc0
gs32p        = 0xffffffff82bbb038
ldt          = 0xffffffff82bbb078
tss          = 0xffffffff82bbb068
db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100057 td 0xfffff80003dd0000
turnstile_broadcast() at turnstile_broadcast+0x47/frame 0xfffffe0118646050
__mtx_unlock_sleep() at __mtx_unlock_sleep+0xb9/frame 0xfffffe0118646080
pf_state_insert() at pf_state_insert+0xb33/frame 0xfffffe0118646110
pf_test_rule() at pf_test_rule+0x2c7c/frame 0xfffffe01186465a0
pf_test() at pf_test+0x20e9/frame 0xfffffe0118646800
pf_check_in() at pf_check_in+0x1d/frame 0xfffffe0118646820
pfil_run_hooks() at pfil_run_hooks+0x90/frame 0xfffffe01186468b0
ip_input() at ip_input+0x441/frame 0xfffffe0118646910
netisr_dispatch_src() at netisr_dispatch_src+0xa8/frame 0xfffffe0118646960
ether_demux() at ether_demux+0x173/frame 0xfffffe0118646990
ether_nh_input() at ether_nh_input+0x32b/frame 0xfffffe01186469f0
netisr_dispatch_src() at netisr_dispatch_src+0xa8/frame 0xfffffe0118646a40
ether_input() at ether_input+0x26/frame 0xfffffe0118646a60
re_rxeof() at re_rxeof+0x601/frame 0xfffffe0118646ad0
re_intr_msi() at re_intr_msi+0xfc/frame 0xfffffe0118646b20
intr_event_execute_handlers() at intr_event_execute_handlers+0xe9/frame 0xfffffe0118646b60
ithread_loop() at ithread_loop+0xe7/frame 0xfffffe0118646bb0
fork_exit() at fork_exit+0x83/frame 0xfffffe0118646bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0118646bf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

I also attached the latest crash report to this post: https://forum.netgate.com/post/819822

Thanks

chrismacmahon · Jan 26, 2019, 4:31 PM

@warden said in Help with a crash dump:

your crash is very different from the other one, Can you open a new thread?

nik.taylor · Mar 2, 2019, 3:48 PM

It's definitely nut. I uninstalled and there were no crashes. I re-installed and it's started crashing. Is this an integration issue you can look at or should I contact the nut team?

Thanks.

stephenw10 · Mar 2, 2019, 4:03 PM

Yet it still crashed with Nut installed but disabled previously?

Were you able to replicate that? It seems hard to imagine that could happen if it really was disabled.

If it's a problem with the nut binaries in FreeBSD that would need to be reported upstream but there must be be a lot of people running that in FreeBSD.

Do you have any additional crash reports? Anything showing the NUT package specifically?

Steve

nik.taylor · Mar 2, 2019, 4:11 PM

I'm pretty sure it did. I'm going to disable it again and see if I get a crash dump with nut installed but disabled.

Here is the latest crash:

0_1551543086047_nut dump.txt

stephenw10 · Mar 2, 2019, 5:50 PM

Mmm, well identical crash then. Implies probably software at least.

nik.taylor · Jun 20, 2020, 6:28 PM

Bumping this thread back up. I've continued to have this problem. I disabled nut for a few months and it didn't go away. I'm seeing crashes about once or twice a week still. Any next debugging steps?

Latest crash dump attached.

Thanks in advance.

crash_dump.txt

stephenw10 · Jun 20, 2020, 9:14 PM

Hmm, well that is three almost identical crashes:

hardclock_cnt() at hardclock_cnt+0x131/frame 0xfffffe010e4d44e0
handleevents() at handleevents+0xc9/frame 0xfffffe010e4d4530
timercb() at timercb+0xad/frame 0xfffffe010e4d4580
lapic_handle_timer() at lapic_handle_timer+0xa2/frame 0xfffffe010e4d45c0
Xtimerint() at Xtimerint+0xa8/frame 0xfffffe010e4d45c0

I got to think it's some issue with the system clock being used on that system.

I see it's loading the speedstep driver (est), is powerd enabled? You might disabling it if so. It's been a while since I've seen one but some systems has issues with varying the cpu clock that would throw errors.

You could usually work past that by selevting a non variable system timer instead.
For example:

[2.5.0-DEVELOPMENT][admin@apu.stevew.lan]/root: sysctl kern.timecounter.choice
kern.timecounter.choice: ACPI-fast(900) HPET(950) i8254(0) TSC(800) dummy(-1000000)
[2.5.0-DEVELOPMENT][admin@apu.stevew.lan]/root: sysctl kern.timecounter.hardware
kern.timecounter.hardware: HPET

Steve