SG-4860 upgrade failed



  • Looks like my Backup member (CARP) 's update to 2.3 failed  :(
    It was on version 2.2.6, I stripped all packages, put it in persistent maintenance, and took a backup. Then hit the upgrade button, and that was about an hour ago. I still can't reach it, not sure what state it is in but this doesn't look promising. Looks like it will need console.
    Going to head over the server-room later today.

    @cmb, I've downloaded the ADI memstick 2.3 installer… but just in case, could you please point me to the 2.2.6 installer as well in case I should want to revert? (can't find it on the portal)



  • in update, I had to reboot the thing to get some console output.
    The upgrade has happened, it booted 2.3. After checking all vlans, it got to CARP, then to the point where it was syncing OpenVPN, and ended up with a nice kernel error:

    Configuring CARP settings…done.
    Configuring CARP settings...done.
    Syncing OpenVPN settings...

    Fatal trap 9: general protection fault while in kernel mode
    cpuid = 0; apic id = 00
    instruction pointer    = 0x20:0xffffffff80b82a2b
    stack pointer          = 0x28:0xfffffe001a3f59d0
    frame pointer          = 0x28:0xfffffe001a3f59f0
    code segment            = base 0x0, limit 0xfffff, type 0x1b
                            = DPL 0, pres 1, long 1, def32 0, gran 1
    processor eflags        = interrupt enabled, resume, IOPL = 0
    current process        = 12 (irq256: igb0:que 0)
    [ thread pid 12 tid 100034 ]
    Stopped at      m_tag_delete_chain+0x5b:        movq    (%rdi),%rax
    db:0:kdb.enter.default> textdump set
    textdump set
    db:0:kdb.enter.default>  capture on
    db:0:kdb.enter.default>  run lockinfo
    db:1:lockinfo> show locks
    No such command
    db:1:locks>  show alllocks
    No such command
    db:1:alllocks>  show lockedvnods
    Locked vnodes
    db:0:kdb.enter.default>  show pcpu
    cpuid        = 0
    dynamic pcpu = 0x531e00
    curthread    = 0xfffff80003567960: pid 12 "irq256: igb0:que 0"
    curpcb      = 0xfffffe001a3f5cc0
    fpcurthread  = none
    idlethread  = 0xfffff800033a0000: tid 100003 "idle: cpu0"
    curpmap      = 0xffffffff820f7ca0
    tssp        = 0xffffffff82112b90
    commontssp  = 0xffffffff82112b90
    rsp0        = 0xfffffe001a3f5cc0
    gs32p        = 0xffffffff821145e8
    ldt          = 0xffffffff82114628
    tss          = 0xffffffff82114618
    db:0:kdb.enter.default>  bt
    Tracing pid 12 tid 100034 td 0xfffff80003567960
    m_tag_delete_chain() at m_tag_delete_chain+0x5b/frame 0xfffffe001a3f59f0
    uma_zfree_arg() at uma_zfree_arg+0x3e/frame 0xfffffe001a3f5a60

    Lost more info afterwards, and it resets on its own. The cycle repeats.

    However, in an attempt to get a better log (putty output to log), I cycled again (by pulling the plug) and strangely enough I got past the point of OpenVPN sync? Now it says "Generating RRD graphs"…. It's not up yet, but we are making progress. Pfff.... I must admit this doesn't feel good  ::)



  • got past RRD graphs, but that's it. Crashed again… man...

    So, finally I got through. how? Disconnect all interfaces (unplug) but WAN, I left that one connected.
    Not that my success was that good, because after x-time console was non-responsive, and connecting a lan interface did not make it reachable on the network.
    Yet another kernel panic somewhere?



  • another update… I decided to take another deep breath, and go have a look at it.

    some findings:

    After a power reset, with only WAN connected, it boots fine. Console is accessible, remains responsive. (no carp on that interface, no vlans, dedicated igb1)
    Connecting the sync interface (igb5), no issue, console remains accessible, responsive.
    Then connected one of my opt interfaces (with vlans, and carp), and also that continued to work.
    I also got a possibility to log into the gui, and submit a crash report (hope it helps or tells something)

    Now what is interesting, as soon as I connect LAN (igb0, which has vlans and carps) I hangs almost instantly, and it needs a hard reset... ???

    Puzzled...



  • After I moved all my vlans to another interface (on both carp members ::) ), I was able to plug in igb0 (one subnet was untagged on that interface) -> console became unresponsive, but came back alive after 10seconds or so. Phiew. So now I have +- a working unit.
    OpenVPN still refuses to start for the moment, both instances, with following entries in log:

    Apr 13 16:34:42	openvpn	45820	Exiting due to fatal error 
    Apr 13 16:34:42	openvpn	45820	TCP/UDP: Socket bind failed on local address [AF_INET] x.x.x.x:1195: Address already in use 
    ...
    Apr 13 16:34:42	openvpn	42503	Exiting due to fatal error
    Apr 13 16:34:42	openvpn	42503	TCP/UDP: Socket bind failed on local address [AF_INET] x.x.x.x:1194: Address already in use
    ```so I'll need to look into that 1st before even thinking on upgrading the primary node…
    
    so far for my monologue...

  • Rebel Alliance Developer Netgate

    Looks like you're missing the /boot/loader.conf.local adjustment to increase nmbclusters. On that box just set it to kern.ipc.nmbclusters="1000000"

    https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards#Adding_to_loader.conf.local



  • Hi Jimp,

    thanks for looking at my issue(s). I verified that box, and I have 3 rows in the /boot/loader.conf.local:

    ahci_load="YES"
    kern.cam.boot_delay=10000
    kern.ipc.nmbclusters="1000000"
    

    I checked it against the "master" (still on 2.2.6), and see no difference, even the file TS is the same.