Some help interpreting the crash files?



  • Hi all,

    I've installed 2.5.0 snapshots to test on four ALIX devices, and now they all keep crashing and rebooting, often - on one device, it happened 14 times since yesterday afternoon.

    I may have unsupported hardware, or it may be because I didn't remove extra packages before upgrading - but I can't interpret what the crash dump is trying to tell me. Can someone give me a pointer, please? If necessary, I will try to reinstall fresh, or whatever..

    I've uploaded one set, these go up to .13 (but the forum doesn't like the extensions higher than .0 - and they seem to be are more of the same)

    textdump.tar.0 info.0

    This may be the relevant part:

    Fatal trap 12: page fault while in kernel mode
    cpuid = 0; apic id = 00
    fault virtual address	= 0x70
    fault code		= supervisor read data, page not present
    instruction pointer	= 0x20:0xffffffff80ee68d7
    stack pointer	        = 0x28:0xfffffe002d22c320
    frame pointer	        = 0x28:0xfffffe002d22c360
    code segment		= base 0x0, limit 0xfffff, type 0x1b
    			= DPL 0, pres 1, long 1, def32 0, gran 1
    processor eflags	= interrupt enabled, resume, IOPL = 0
    current process		= 40056 (unbound-anchor)
    

  • Rebel Alliance Developer Netgate

    @mi8088 said in Some help interpreting the crash files?:

    I've installed 2.5.0 snapshots to test on four ALIX devices, and now they all keep crashing and rebooting

    I highly doubt that. ALIX devices are not capable of running 2.5.0. Perhaps you meant APU devices?

    What you want from that tar file is primarily the backtrace from ddb.txt:

    db:0:kdb.enter.default>  show pcpu
    cpuid        = 0
    dynamic pcpu = 0xb31c40
    curthread    = 0xfffff80005359000: pid 40056 tid 100102 "unbound-anchor"
    curpcb       = 0xfffffe002d22ccc0
    fpcurthread  = 0xfffff80005359000: pid 40056 "unbound-anchor"
    idlethread   = 0xfffff80004208000: tid 100003 "idle: cpu0"
    curpmap      = 0xfffff80005a8c130
    tssp         = 0xffffffff82db3a20
    commontssp   = 0xffffffff82db3a20
    rsp0         = 0xfffffe002d22ccc0
    gs32p        = 0xffffffff82dba658
    ldt          = 0xffffffff82dba698
    tss          = 0xffffffff82dba688
    curvnet      = 0xfffff8000406d640
    db:0:kdb.enter.default>  bt
    Tracing pid 40056 tid 100102 td 0xfffff80005359000
    in_broadcast() at in_broadcast+0x27/frame 0xfffffe002d22c360
    pf_test() at pf_test+0x201b/frame 0xfffffe002d22c610
    pf_check_out() at pf_check_out+0x1d/frame 0xfffffe002d22c630
    pfil_run_hooks() at pfil_run_hooks+0xa1/frame 0xfffffe002d22c6d0
    ip_output() at ip_output+0xc85/frame 0xfffffe002d22c810
    udp_send() at udp_send+0xb6e/frame 0xfffffe002d22c8e0
    sosend_dgram() at sosend_dgram+0x33b/frame 0xfffffe002d22c950
    sosend() at sosend+0x50/frame 0xfffffe002d22c980
    kern_sendit() at kern_sendit+0x19f/frame 0xfffffe002d22ca20
    sendit() at sendit+0x19e/frame 0xfffffe002d22ca70
    sys_sendto() at sys_sendto+0x4d/frame 0xfffffe002d22cac0
    amd64_syscall() at amd64_syscall+0x369/frame 0xfffffe002d22cbf0
    fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe002d22cbf0
    

    That particular backtrace doesn't look familiar, though. Are all of these identical? It's odd that it's claiming to crash in unbound-anchor which manages trust anchors for DNSSEC, and the backtrace suggests that it's crashing while checking if an IP address is a broadcast address. That's a fairly simple operation, so I would tend to think it's actually a hardware operation failing there (e.g. memory/cpu/heat). Though why it would only happen on 2.5.0 and not before is less clear, unless it's a BIOS or other similar issue leading to instability.

    As for the crash dumps, you can rename them to textdump.<n>.tar instead. They are just .tar files but the way FreeBSD writes the crash dumps it tacks the number on the end since it's easier that way.



  • I'd have to check to be sure - i've overtaken them from another person, who always refers to them as ALIX - they might be APU, though. I can't check now as not close at the moment. I do get a GUI and some functionality, so they are in some sense capable of running. just not very well 😏

    Some general info: I have packages frr on all four, acme on some (not used though), and blinkled packages. These were there before upgrading to 2.5.0.

    Anyway, comparing the bit you refer to in all the files shows some differences, even though most are the same. I also find, instead of "unbound-anchor", "ntpd" (#7) and "dpinger" (#12) and "ospfd" (#13) in that part.

    In dump #12, the corresponding part is

    db:0:kdb.enter.default>  show pcpu
    cpuid        = 0
    dynamic pcpu = 0xb31c40
    curthread    = 0xfffff8011a72d000: pid 15283 tid 101158 "dpinger"
    curpcb       = 0xfffffe002d316cc0
    fpcurthread  = 0xfffff8011a72d000: pid 15283 "dpinger"
    idlethread   = 0xfffff80004208000: tid 100003 "idle: cpu0"
    curpmap      = 0xfffff8002097c130
    tssp         = 0xffffffff82db3a20
    commontssp   = 0xffffffff82db3a20
    rsp0         = 0xfffffe002d316cc0
    gs32p        = 0xffffffff82dba658
    ldt          = 0xffffffff82dba698
    tss          = 0xffffffff82dba688
    curvnet      = 0xfffff8000406d640
    db:0:kdb.enter.default>  bt
    Tracing pid 15283 tid 101158 td 0xfffff8011a72d000
    ??() at 0
    ip_output() at ip_output+0x13f3/frame 0xfffffe002d316810
    rip_output() at rip_output+0x2c3/frame 0xfffffe002d3168a0
    sosend_generic() at sosend_generic+0x586/frame 0xfffffe002d316950
    sosend() at sosend+0x50/frame 0xfffffe002d316980
    kern_sendit() at kern_sendit+0x19f/frame 0xfffffe002d316a20
    sendit() at sendit+0x19e/frame 0xfffffe002d316a70
    sys_sendto() at sys_sendto+0x4d/frame 0xfffffe002d316ac0
    amd64_syscall() at amd64_syscall+0x369/frame 0xfffffe002d316bf0
    fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe002d316bf0
    

    It's not impossible the hardware is the problem, but all of the devices ran for several years without problems, up to the last production version (2.4.4-p3). We then replaced them with new gear, and use them as test boxes. I'd would seem weird if they all had a hardware fault simultaneously..

    I've made a new tar file with the complete contents of the 14 dumps.

    (Uploading 100%) textdumps.tgz

    They're turned off now, and tomorrow I'll try to reinstall one with 2.5.0 direct, or an older version.


  • Rebel Alliance Developer Netgate

    If it's that random, it would have to be hardware/driver related. It's stable on my APU (first generation), but maybe if those are APU2 or some later revision it might be related to the network drivers.



  • Well, I've reinstalled all from memstick, no packages, basic config (basically the wizard) and they have been running without problems for 16 hours. Either the newer snapshot solved it, it's a package causing errors, or the upgrade process went wrong and caused errors.

    They're APU, not APU2, by the way. Should be these, 4 GB RAM: https://pcengines.ch/apu1d4.htm

    7e43494e-2cac-462b-af6d-e15e7c2c689e-image.png

    Going to go do the upgrade to the latest snapshot on two of these, see what happens then..



  • [Edited with new info]

    Well, that didn't quite work, first time round

    The update one the first device was stuck for about 2 hours, looking like this:

    3be97993-6452-45b8-9856-995b7db802b5-image.png

    In the system log, I found loads of these lines (on all four devices):

    Oct 3 09:04:59 	check_reload_status 	373 	Reloading filter
    Oct 3 09:05:00 	php-fpm 	24759 	/rc.newwanipv6: rc.newwanipv6: Info: starting on re1.
    Oct 3 09:05:00 	php-fpm 	24759 	/rc.newwanipv6: rc.newwanipv6: on (IP address: 2001:1680:104:1:1::580b) (interface: wan) (real interface: re1).
    Oct 3 09:05:03 	php-fpm 	24759 	/rc.newwanipv6: Removing static route for monitor fe80::290:bff:fea2:b929 and adding a new route through fe80::290:bff:fea2:b929%re1 
    

    I went to the WAN interface and changed the IPv6 configuration type from DHCP6 to None. After that, the log doesn't have the lines above, and the firewall GUI seems more responsive - including actually running the update as expected. The interface which these APUs connect to on WAN does have DHCP6 activated.

    I'm not promising that the DHCP6 config is perfect on the external pfSense box which servers as the DHCP server the for APUs, but the APUs basically had the default settings - something must be off somewhere?!

    Updated two devices let's see how they run now. There's no hardware fault though, it seems.



  • All right, I'm now sure of when the crashes are provoked, I just have no idea what is causing them.

    I have the following version installed:

    2.5.0-DEVELOPMENT (amd64)
    built on Mon Oct 14 00:22:51 EDT 2019
    FreeBSD 12.0-RELEASE-p10
    

    Furthermore, I have the FRR package installed, verion 0.6.3_1. Each of the four test firewalls is configured to connect via IPSec to two other units, in a "circle" configuration. On top of IPSec, they are configured with Phase 2 VTI and OSPF Routing.

    The important setting is "IPv6 Configuration Type" for the WAN interface. It this is set to DHCP6, as it was by default, the firewalls crash regularly. If it is set to "None", there are no crashes (or at least they are so infrequent that I haven't seen them yet). Also, as described above, DHCP6 causes a lot of log entries and blocks updates.

    Crash log attached:
    fw3_20191014.zip

    It's not impossible that the IPv6 config on the upstream pfSense box dealing as the WAN gateway and DHCP server is not ideal - but in any case, a misconfiguration here shouldn't cause crashes IMHO.

    I can share config backups if needed, since this is a test system. I'm also fine with doing any more tests, but I don't know what.


Log in to reply