Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Some help interpreting the crash files?

    Scheduled Pinned Locked Moved 2.5 Development Snapshots (Retired)
    7 Posts 2 Posters 480 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M
      mi8088
      last edited by

      Hi all,

      I've installed 2.5.0 snapshots to test on four ALIX devices, and now they all keep crashing and rebooting, often - on one device, it happened 14 times since yesterday afternoon.

      I may have unsupported hardware, or it may be because I didn't remove extra packages before upgrading - but I can't interpret what the crash dump is trying to tell me. Can someone give me a pointer, please? If necessary, I will try to reinstall fresh, or whatever..

      I've uploaded one set, these go up to .13 (but the forum doesn't like the extensions higher than .0 - and they seem to be are more of the same)

      textdump.tar.0 info.0

      This may be the relevant part:

      Fatal trap 12: page fault while in kernel mode
      cpuid = 0; apic id = 00
      fault virtual address	= 0x70
      fault code		= supervisor read data, page not present
      instruction pointer	= 0x20:0xffffffff80ee68d7
      stack pointer	        = 0x28:0xfffffe002d22c320
      frame pointer	        = 0x28:0xfffffe002d22c360
      code segment		= base 0x0, limit 0xfffff, type 0x1b
      			= DPL 0, pres 1, long 1, def32 0, gran 1
      processor eflags	= interrupt enabled, resume, IOPL = 0
      current process		= 40056 (unbound-anchor)
      
      1 Reply Last reply Reply Quote 0
      • jimpJ
        jimp Rebel Alliance Developer Netgate
        last edited by

        @mi8088 said in Some help interpreting the crash files?:

        I've installed 2.5.0 snapshots to test on four ALIX devices, and now they all keep crashing and rebooting

        I highly doubt that. ALIX devices are not capable of running 2.5.0. Perhaps you meant APU devices?

        What you want from that tar file is primarily the backtrace from ddb.txt:

        db:0:kdb.enter.default>  show pcpu
        cpuid        = 0
        dynamic pcpu = 0xb31c40
        curthread    = 0xfffff80005359000: pid 40056 tid 100102 "unbound-anchor"
        curpcb       = 0xfffffe002d22ccc0
        fpcurthread  = 0xfffff80005359000: pid 40056 "unbound-anchor"
        idlethread   = 0xfffff80004208000: tid 100003 "idle: cpu0"
        curpmap      = 0xfffff80005a8c130
        tssp         = 0xffffffff82db3a20
        commontssp   = 0xffffffff82db3a20
        rsp0         = 0xfffffe002d22ccc0
        gs32p        = 0xffffffff82dba658
        ldt          = 0xffffffff82dba698
        tss          = 0xffffffff82dba688
        curvnet      = 0xfffff8000406d640
        db:0:kdb.enter.default>  bt
        Tracing pid 40056 tid 100102 td 0xfffff80005359000
        in_broadcast() at in_broadcast+0x27/frame 0xfffffe002d22c360
        pf_test() at pf_test+0x201b/frame 0xfffffe002d22c610
        pf_check_out() at pf_check_out+0x1d/frame 0xfffffe002d22c630
        pfil_run_hooks() at pfil_run_hooks+0xa1/frame 0xfffffe002d22c6d0
        ip_output() at ip_output+0xc85/frame 0xfffffe002d22c810
        udp_send() at udp_send+0xb6e/frame 0xfffffe002d22c8e0
        sosend_dgram() at sosend_dgram+0x33b/frame 0xfffffe002d22c950
        sosend() at sosend+0x50/frame 0xfffffe002d22c980
        kern_sendit() at kern_sendit+0x19f/frame 0xfffffe002d22ca20
        sendit() at sendit+0x19e/frame 0xfffffe002d22ca70
        sys_sendto() at sys_sendto+0x4d/frame 0xfffffe002d22cac0
        amd64_syscall() at amd64_syscall+0x369/frame 0xfffffe002d22cbf0
        fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe002d22cbf0
        

        That particular backtrace doesn't look familiar, though. Are all of these identical? It's odd that it's claiming to crash in unbound-anchor which manages trust anchors for DNSSEC, and the backtrace suggests that it's crashing while checking if an IP address is a broadcast address. That's a fairly simple operation, so I would tend to think it's actually a hardware operation failing there (e.g. memory/cpu/heat). Though why it would only happen on 2.5.0 and not before is less clear, unless it's a BIOS or other similar issue leading to instability.

        As for the crash dumps, you can rename them to textdump.<n>.tar instead. They are just .tar files but the way FreeBSD writes the crash dumps it tacks the number on the end since it's easier that way.

        Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

        Need help fast? Netgate Global Support!

        Do not Chat/PM for help!

        M 1 Reply Last reply Reply Quote 0
        • M
          mi8088 @jimp
          last edited by

          I'd have to check to be sure - i've overtaken them from another person, who always refers to them as ALIX - they might be APU, though. I can't check now as not close at the moment. I do get a GUI and some functionality, so they are in some sense capable of running. just not very well ๐Ÿ˜

          Some general info: I have packages frr on all four, acme on some (not used though), and blinkled packages. These were there before upgrading to 2.5.0.

          Anyway, comparing the bit you refer to in all the files shows some differences, even though most are the same. I also find, instead of "unbound-anchor", "ntpd" (#7) and "dpinger" (#12) and "ospfd" (#13) in that part.

          In dump #12, the corresponding part is

          db:0:kdb.enter.default>  show pcpu
          cpuid        = 0
          dynamic pcpu = 0xb31c40
          curthread    = 0xfffff8011a72d000: pid 15283 tid 101158 "dpinger"
          curpcb       = 0xfffffe002d316cc0
          fpcurthread  = 0xfffff8011a72d000: pid 15283 "dpinger"
          idlethread   = 0xfffff80004208000: tid 100003 "idle: cpu0"
          curpmap      = 0xfffff8002097c130
          tssp         = 0xffffffff82db3a20
          commontssp   = 0xffffffff82db3a20
          rsp0         = 0xfffffe002d316cc0
          gs32p        = 0xffffffff82dba658
          ldt          = 0xffffffff82dba698
          tss          = 0xffffffff82dba688
          curvnet      = 0xfffff8000406d640
          db:0:kdb.enter.default>  bt
          Tracing pid 15283 tid 101158 td 0xfffff8011a72d000
          ??() at 0
          ip_output() at ip_output+0x13f3/frame 0xfffffe002d316810
          rip_output() at rip_output+0x2c3/frame 0xfffffe002d3168a0
          sosend_generic() at sosend_generic+0x586/frame 0xfffffe002d316950
          sosend() at sosend+0x50/frame 0xfffffe002d316980
          kern_sendit() at kern_sendit+0x19f/frame 0xfffffe002d316a20
          sendit() at sendit+0x19e/frame 0xfffffe002d316a70
          sys_sendto() at sys_sendto+0x4d/frame 0xfffffe002d316ac0
          amd64_syscall() at amd64_syscall+0x369/frame 0xfffffe002d316bf0
          fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe002d316bf0
          

          It's not impossible the hardware is the problem, but all of the devices ran for several years without problems, up to the last production version (2.4.4-p3). We then replaced them with new gear, and use them as test boxes. I'd would seem weird if they all had a hardware fault simultaneously..

          I've made a new tar file with the complete contents of the 14 dumps.

          (Uploading 100%) textdumps.tgz

          They're turned off now, and tomorrow I'll try to reinstall one with 2.5.0 direct, or an older version.

          1 Reply Last reply Reply Quote 0
          • jimpJ
            jimp Rebel Alliance Developer Netgate
            last edited by

            If it's that random, it would have to be hardware/driver related. It's stable on my APU (first generation), but maybe if those are APU2 or some later revision it might be related to the network drivers.

            Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

            Need help fast? Netgate Global Support!

            Do not Chat/PM for help!

            1 Reply Last reply Reply Quote 0
            • M
              mi8088
              last edited by

              Well, I've reinstalled all from memstick, no packages, basic config (basically the wizard) and they have been running without problems for 16 hours. Either the newer snapshot solved it, it's a package causing errors, or the upgrade process went wrong and caused errors.

              They're APU, not APU2, by the way. Should be these, 4 GB RAM: https://pcengines.ch/apu1d4.htm

              7e43494e-2cac-462b-af6d-e15e7c2c689e-image.png

              Going to go do the upgrade to the latest snapshot on two of these, see what happens then..

              1 Reply Last reply Reply Quote 0
              • M
                mi8088
                last edited by mi8088

                [Edited with new info]

                Well, that didn't quite work, first time round

                The update one the first device was stuck for about 2 hours, looking like this:

                3be97993-6452-45b8-9856-995b7db802b5-image.png

                In the system log, I found loads of these lines (on all four devices):

                Oct 3 09:04:59 	check_reload_status 	373 	Reloading filter
                Oct 3 09:05:00 	php-fpm 	24759 	/rc.newwanipv6: rc.newwanipv6: Info: starting on re1.
                Oct 3 09:05:00 	php-fpm 	24759 	/rc.newwanipv6: rc.newwanipv6: on (IP address: 2001:1680:104:1:1::580b) (interface: wan) (real interface: re1).
                Oct 3 09:05:03 	php-fpm 	24759 	/rc.newwanipv6: Removing static route for monitor fe80::290:bff:fea2:b929 and adding a new route through fe80::290:bff:fea2:b929%re1 
                

                I went to the WAN interface and changed the IPv6 configuration type from DHCP6 to None. After that, the log doesn't have the lines above, and the firewall GUI seems more responsive - including actually running the update as expected. The interface which these APUs connect to on WAN does have DHCP6 activated.

                I'm not promising that the DHCP6 config is perfect on the external pfSense box which servers as the DHCP server the for APUs, but the APUs basically had the default settings - something must be off somewhere?!

                Updated two devices let's see how they run now. There's no hardware fault though, it seems.

                1 Reply Last reply Reply Quote 0
                • M
                  mi8088
                  last edited by

                  All right, I'm now sure of when the crashes are provoked, I just have no idea what is causing them.

                  I have the following version installed:

                  2.5.0-DEVELOPMENT (amd64)
                  built on Mon Oct 14 00:22:51 EDT 2019
                  FreeBSD 12.0-RELEASE-p10
                  

                  Furthermore, I have the FRR package installed, verion 0.6.3_1. Each of the four test firewalls is configured to connect via IPSec to two other units, in a "circle" configuration. On top of IPSec, they are configured with Phase 2 VTI and OSPF Routing.

                  The important setting is "IPv6 Configuration Type" for the WAN interface. It this is set to DHCP6, as it was by default, the firewalls crash regularly. If it is set to "None", there are no crashes (or at least they are so infrequent that I haven't seen them yet). Also, as described above, DHCP6 causes a lot of log entries and blocks updates.

                  Crash log attached:
                  fw3_20191014.zip

                  It's not impossible that the IPv6 config on the upstream pfSense box dealing as the WAN gateway and DHCP server is not ideal - but in any case, a misconfiguration here shouldn't cause crashes IMHO.

                  I can share config backups if needed, since this is a test system. I'm also fine with doing any more tests, but I don't know what.

                  1 Reply Last reply Reply Quote 0
                  • First post
                    Last post
                  Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.