Some help interpreting the crash files?

mi8088

Hi all,

I've installed 2.5.0 snapshots to test on four ALIX devices, and now they all keep crashing and rebooting, often - on one device, it happened 14 times since yesterday afternoon.

I may have unsupported hardware, or it may be because I didn't remove extra packages before upgrading - but I can't interpret what the crash dump is trying to tell me. Can someone give me a pointer, please? If necessary, I will try to reinstall fresh, or whatever..

I've uploaded one set, these go up to .13 (but the forum doesn't like the extensions higher than .0 - and they seem to be are more of the same)

textdump.tar.0 info.0

This may be the relevant part:

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address	= 0x70
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80ee68d7
stack pointer	        = 0x28:0xfffffe002d22c320
frame pointer	        = 0x28:0xfffffe002d22c360
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 40056 (unbound-anchor)

jimp

@mi8088 said in Some help interpreting the crash files?:

I've installed 2.5.0 snapshots to test on four ALIX devices, and now they all keep crashing and rebooting

I highly doubt that. ALIX devices are not capable of running 2.5.0. Perhaps you meant APU devices?

What you want from that tar file is primarily the backtrace from ddb.txt:

db:0:kdb.enter.default>  show pcpu
cpuid        = 0
dynamic pcpu = 0xb31c40
curthread    = 0xfffff80005359000: pid 40056 tid 100102 "unbound-anchor"
curpcb       = 0xfffffe002d22ccc0
fpcurthread  = 0xfffff80005359000: pid 40056 "unbound-anchor"
idlethread   = 0xfffff80004208000: tid 100003 "idle: cpu0"
curpmap      = 0xfffff80005a8c130
tssp         = 0xffffffff82db3a20
commontssp   = 0xffffffff82db3a20
rsp0         = 0xfffffe002d22ccc0
gs32p        = 0xffffffff82dba658
ldt          = 0xffffffff82dba698
tss          = 0xffffffff82dba688
curvnet      = 0xfffff8000406d640
db:0:kdb.enter.default>  bt
Tracing pid 40056 tid 100102 td 0xfffff80005359000
in_broadcast() at in_broadcast+0x27/frame 0xfffffe002d22c360
pf_test() at pf_test+0x201b/frame 0xfffffe002d22c610
pf_check_out() at pf_check_out+0x1d/frame 0xfffffe002d22c630
pfil_run_hooks() at pfil_run_hooks+0xa1/frame 0xfffffe002d22c6d0
ip_output() at ip_output+0xc85/frame 0xfffffe002d22c810
udp_send() at udp_send+0xb6e/frame 0xfffffe002d22c8e0
sosend_dgram() at sosend_dgram+0x33b/frame 0xfffffe002d22c950
sosend() at sosend+0x50/frame 0xfffffe002d22c980
kern_sendit() at kern_sendit+0x19f/frame 0xfffffe002d22ca20
sendit() at sendit+0x19e/frame 0xfffffe002d22ca70
sys_sendto() at sys_sendto+0x4d/frame 0xfffffe002d22cac0
amd64_syscall() at amd64_syscall+0x369/frame 0xfffffe002d22cbf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe002d22cbf0

That particular backtrace doesn't look familiar, though. Are all of these identical? It's odd that it's claiming to crash in unbound-anchor which manages trust anchors for DNSSEC, and the backtrace suggests that it's crashing while checking if an IP address is a broadcast address. That's a fairly simple operation, so I would tend to think it's actually a hardware operation failing there (e.g. memory/cpu/heat). Though why it would only happen on 2.5.0 and not before is less clear, unless it's a BIOS or other similar issue leading to instability.

As for the crash dumps, you can rename them to textdump.<n>.tar instead. They are just .tar files but the way FreeBSD writes the crash dumps it tacks the number on the end since it's easier that way.

mi8088

I'd have to check to be sure - i've overtaken them from another person, who always refers to them as ALIX - they might be APU, though. I can't check now as not close at the moment. I do get a GUI and some functionality, so they are in some sense capable of running. just not very well

Some general info: I have packages frr on all four, acme on some (not used though), and blinkled packages. These were there before upgrading to 2.5.0.

Anyway, comparing the bit you refer to in all the files shows some differences, even though most are the same. I also find, instead of "unbound-anchor", "ntpd" (#7) and "dpinger" (#12) and "ospfd" (#13) in that part.

In dump #12, the corresponding part is

db:0:kdb.enter.default>  show pcpu
cpuid        = 0
dynamic pcpu = 0xb31c40
curthread    = 0xfffff8011a72d000: pid 15283 tid 101158 "dpinger"
curpcb       = 0xfffffe002d316cc0
fpcurthread  = 0xfffff8011a72d000: pid 15283 "dpinger"
idlethread   = 0xfffff80004208000: tid 100003 "idle: cpu0"
curpmap      = 0xfffff8002097c130
tssp         = 0xffffffff82db3a20
commontssp   = 0xffffffff82db3a20
rsp0         = 0xfffffe002d316cc0
gs32p        = 0xffffffff82dba658
ldt          = 0xffffffff82dba698
tss          = 0xffffffff82dba688
curvnet      = 0xfffff8000406d640
db:0:kdb.enter.default>  bt
Tracing pid 15283 tid 101158 td 0xfffff8011a72d000
??() at 0
ip_output() at ip_output+0x13f3/frame 0xfffffe002d316810
rip_output() at rip_output+0x2c3/frame 0xfffffe002d3168a0
sosend_generic() at sosend_generic+0x586/frame 0xfffffe002d316950
sosend() at sosend+0x50/frame 0xfffffe002d316980
kern_sendit() at kern_sendit+0x19f/frame 0xfffffe002d316a20
sendit() at sendit+0x19e/frame 0xfffffe002d316a70
sys_sendto() at sys_sendto+0x4d/frame 0xfffffe002d316ac0
amd64_syscall() at amd64_syscall+0x369/frame 0xfffffe002d316bf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe002d316bf0

It's not impossible the hardware is the problem, but all of the devices ran for several years without problems, up to the last production version (2.4.4-p3). We then replaced them with new gear, and use them as test boxes. I'd would seem weird if they all had a hardware fault simultaneously..

I've made a new tar file with the complete contents of the 14 dumps.

(Uploading 100%) textdumps.tgz

They're turned off now, and tomorrow I'll try to reinstall one with 2.5.0 direct, or an older version.

jimp

If it's that random, it would have to be hardware/driver related. It's stable on my APU (first generation), but maybe if those are APU2 or some later revision it might be related to the network drivers.

mi8088

Well, I've reinstalled all from memstick, no packages, basic config (basically the wizard) and they have been running without problems for 16 hours. Either the newer snapshot solved it, it's a package causing errors, or the upgrade process went wrong and caused errors.

They're APU, not APU2, by the way. Should be these, 4 GB RAM: https://pcengines.ch/apu1d4.htm

Going to go do the upgrade to the latest snapshot on two of these, see what happens then..

mi8088

[Edited with new info]

Well, that didn't quite work, first time round

The update one the first device was stuck for about 2 hours, looking like this:

In the system log, I found loads of these lines (on all four devices):

Oct 3 09:04:59 	check_reload_status 	373 	Reloading filter
Oct 3 09:05:00 	php-fpm 	24759 	/rc.newwanipv6: rc.newwanipv6: Info: starting on re1.
Oct 3 09:05:00 	php-fpm 	24759 	/rc.newwanipv6: rc.newwanipv6: on (IP address: 2001:1680:104:1:1::580b) (interface: wan) (real interface: re1).
Oct 3 09:05:03 	php-fpm 	24759 	/rc.newwanipv6: Removing static route for monitor fe80::290:bff:fea2:b929 and adding a new route through fe80::290:bff:fea2:b929%re1

I went to the WAN interface and changed the IPv6 configuration type from DHCP6 to None. After that, the log doesn't have the lines above, and the firewall GUI seems more responsive - including actually running the update as expected. The interface which these APUs connect to on WAN does have DHCP6 activated.

I'm not promising that the DHCP6 config is perfect on the external pfSense box which servers as the DHCP server the for APUs, but the APUs basically had the default settings - something must be off somewhere?!

Updated two devices let's see how they run now. There's no hardware fault though, it seems.

mi8088

All right, I'm now sure of when the crashes are provoked, I just have no idea what is causing them.

I have the following version installed:

2.5.0-DEVELOPMENT (amd64)
built on Mon Oct 14 00:22:51 EDT 2019
FreeBSD 12.0-RELEASE-p10

Furthermore, I have the FRR package installed, verion 0.6.3_1. Each of the four test firewalls is configured to connect via IPSec to two other units, in a "circle" configuration. On top of IPSec, they are configured with Phase 2 VTI and OSPF Routing.

The important setting is "IPv6 Configuration Type" for the WAN interface. It this is set to DHCP6, as it was by default, the firewalls crash regularly. If it is set to "None", there are no crashes (or at least they are so infrequent that I haven't seen them yet). Also, as described above, DHCP6 causes a lot of log entries and blocks updates.

Crash log attached:
fw3_20191014.zip

It's not impossible that the IPv6 config on the upstream pfSense box dealing as the WAN gateway and DHCP server is not ideal - but in any case, a misconfiguration here shouldn't cause crashes IMHO.

I can share config backups if needed, since this is a test system. I'm also fine with doing any more tests, but I don't know what.