Crashes on APU2

Cypher100

It seems to crash every 2 weeks or every month. I have no idea what I'm looking for in the error dump, and can't determine to what to even google for to find a answer. I've tried reinstalling it, run a self test on the ssd, and that hasn't fixed the issue in anyway. If anyone has any ideas, please let me know.

0_1550212997635_textdump.tar.0

Cypher100

Current temperatures is 54 °C to 60 °C. It seems to crash when I'm not actively using the internet, because it has never crashed on me during usage. The device is only a couple of months old. I don't think it was crashing on earlier versions of pfSense.

stephenw10

The key parts of that are:

db:0:kdb.enter.default>  show pcpu
cpuid        = 2
dynamic pcpu = 0xfffffe0197692480
curthread    = 0xfffff801034d9620: pid 4609 "sh"
curpcb       = 0xfffffe012089fb80
fpcurthread  = 0xfffff801034d9620: pid 4609 "sh"
idlethread   = 0xfffff80003975000: tid 100005 "idle: cpu2"
curpmap      = 0xfffff8007b66f138
tssp         = 0xffffffff82bb47e0
commontssp   = 0xffffffff82bb47e0
rsp0         = 0xfffffe012089fb80
gs32p        = 0xffffffff82bbb038
ldt          = 0xffffffff82bbb078
tss          = 0xffffffff82bbb068
db:0:kdb.enter.default>  bt
Tracing pid 4609 tid 100201 td 0xfffff801034d9620
pmap_remove_pages() at pmap_remove_pages+0x2d7/frame 0xfffffe012089f450
exec_new_vmspace() at exec_new_vmspace+0x1b5/frame 0xfffffe012089f4c0
exec_elf64_imgact() at exec_elf64_imgact+0x931/frame 0xfffffe012089f5b0
kern_execve() at kern_execve+0x77c/frame 0xfffffe012089f900
sys_execve() at sys_execve+0x4a/frame 0xfffffe012089f980
amd64_syscall() at amd64_syscall+0xa38/frame 0xfffffe012089fab0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe012089fab0
--- syscall (59, FreeBSD ELF64, sys_execve), rip = 0x800b4664a, rsp = 0x7fffffffe218, rbp = 0x7fffffffe360 ---
db:0:kdb.enter.default>  ps

and

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address	= 0xfffff83df000e028
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff81181117
stack pointer	        = 0x28:0xfffffe012089f380
frame pointer	        = 0x28:0xfffffe012089f450
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 4609 (sh)

Unfortunately nothing super conclusive there but it does look similar to this:
https://forum.netgate.com/topic/106192/regular-crash-reports-on-my-apu2-2-3-2

I would boot memtest86+ and run that for a few loops to be sure if you can.

Steve

dugeem

@cypher100 Apart from running memtest on your APU2 you could also consider upgrading BIOS to v4.0.24 which enables ECC on APU2 models with 4GB RAM (e.g. APU2C4) variants. FreeBSD supports ECC and can report errors via MCA - although the APU2 ECC is relatively recent and so is unproven. Of course I'm not suggesting you continue using marginal HW but it may add another data point.

it might also be worth checking the power supply - specifically as It seems to crash when I'm not actively using the internet could well be a PS issue.

stephenw10

It would be interesting to compare it to other reports if it crashes regularly. If they are all the same that's usually a pretty big clue.

Steve

Cypher100

I changed some options around, and the crashes continue. Memtest didn't show anything wrong with the memory, I turned off PowerD to make sure it wasn't a downclocking issue, and crashed sooner after I turned that off.

I have a universal laptop charger, I'll test out the PSU theory, and report here.

Cypher100

I will also give v4.024 a shot, and update here if any crashes occur.

Cypher100

Today it crashed again. I installed the latest BIOS with ECC, and used a third party adapter that matches the requirements for the APU2. I reinstalled PFSense after doing all that above to. I have attached the error log. I'm out of ideas on what could be causing this.

0_1551922109339_textdump.tar.0

Cypher100

I'm updating to v4.9.0.2 to see if that solves the issue.

stephenw10

Hmm, very different crash:

db:0:kdb.enter.default>  bt
Tracing pid 0 tid 100250 td 0xfffff8001d5f0000
lz4_compress() at lz4_compress+0x761/frame 0xfffffe01205358d0
zio_compress_data() at zio_compress_data+0x8c/frame 0xfffffe0120535910
zio_write_compress() at zio_write_compress+0x21f/frame 0xfffffe0120535990
zio_execute() at zio_execute+0xac/frame 0xfffffe01205359e0
taskqueue_run_locked() at taskqueue_run_locked+0x154/frame 0xfffffe0120535a40
taskqueue_thread_loop() at taskqueue_thread_loop+0x98/frame 0xfffffe0120535a70
fork_exit() at fork_exit+0x83/frame 0xfffffe0120535ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0120535ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
db:0:kdb.enter.default>  ps

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address	= 0x1
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff8300ab51
stack pointer	        = 0x28:0xfffffe0120535860
frame pointer	        = 0x28:0xfffffe01205358d0
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 0 (zio_write_issue_2)

More like a hardware issue with different crashes like that.

Steve

edz

I seem to be having some instability issues with my APU2C. It was running OK for over a week. This morning the orange lights on each NIC were not flashing and all connected clients were receiving a self-assigned IP address.

The only way to resolve this was to reboot I had a look through the logs but couldn't find anything. My grafana dashboard shows that something odd started to occur around midnight:

Screen Shot 2019-11-13 at 06.39.40.png

CPU temperature on average is about 53 degrees Celsius and it is running the latest BIOS v4.10.0.2||spoiler||