SG-4860 upgrade failed

bennyc

Looks like my Backup member (CARP) 's update to 2.3 failed :(
It was on version 2.2.6, I stripped all packages, put it in persistent maintenance, and took a backup. Then hit the upgrade button, and that was about an hour ago. I still can't reach it, not sure what state it is in but this doesn't look promising. Looks like it will need console.
Going to head over the server-room later today.

@cmb, I've downloaded the ADI memstick 2.3 installer… but just in case, could you please point me to the 2.2.6 installer as well in case I should want to revert? (can't find it on the portal)

bennyc

in update, I had to reboot the thing to get some console output.
The upgrade has happened, it booted 2.3. After checking all vlans, it got to CARP, then to the point where it was syncing OpenVPN, and ended up with a nice kernel error:

Configuring CARP settings…done.
Configuring CARP settings...done.
Syncing OpenVPN settings...

Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer = 0x20:0xffffffff80b82a2b
stack pointer = 0x28:0xfffffe001a3f59d0
frame pointer = 0x28:0xfffffe001a3f59f0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 12 (irq256: igb0:que 0)
[ thread pid 12 tid 100034 ]
Stopped at m_tag_delete_chain+0x5b: movq (%rdi),%rax
db:0:kdb.enter.default> textdump set
textdump set
db:0:kdb.enter.default> capture on
db:0:kdb.enter.default> run lockinfo
db:1:lockinfo> show locks
No such command
db:1:locks> show alllocks
No such command
db:1:alllocks> show lockedvnods
Locked vnodes
db:0:kdb.enter.default> show pcpu
cpuid = 0
dynamic pcpu = 0x531e00
curthread = 0xfffff80003567960: pid 12 "irq256: igb0:que 0"
curpcb = 0xfffffe001a3f5cc0
fpcurthread = none
idlethread = 0xfffff800033a0000: tid 100003 "idle: cpu0"
curpmap = 0xffffffff820f7ca0
tssp = 0xffffffff82112b90
commontssp = 0xffffffff82112b90
rsp0 = 0xfffffe001a3f5cc0
gs32p = 0xffffffff821145e8
ldt = 0xffffffff82114628
tss = 0xffffffff82114618
db:0:kdb.enter.default> bt
Tracing pid 12 tid 100034 td 0xfffff80003567960
m_tag_delete_chain() at m_tag_delete_chain+0x5b/frame 0xfffffe001a3f59f0
uma_zfree_arg() at uma_zfree_arg+0x3e/frame 0xfffffe001a3f5a60

Lost more info afterwards, and it resets on its own. The cycle repeats.

However, in an attempt to get a better log (putty output to log), I cycled again (by pulling the plug) and strangely enough I got past the point of OpenVPN sync? Now it says "Generating RRD graphs"…. It's not up yet, but we are making progress. Pfff.... I must admit this doesn't feel good ::)

bennyc

got past RRD graphs, but that's it. Crashed again… man...

So, finally I got through. how? Disconnect all interfaces (unplug) but WAN, I left that one connected.
Not that my success was that good, because after x-time console was non-responsive, and connecting a lan interface did not make it reachable on the network.
Yet another kernel panic somewhere?

bennyc

another update… I decided to take another deep breath, and go have a look at it.

some findings:

After a power reset, with only WAN connected, it boots fine. Console is accessible, remains responsive. (no carp on that interface, no vlans, dedicated igb1)
Connecting the sync interface (igb5), no issue, console remains accessible, responsive.
Then connected one of my opt interfaces (with vlans, and carp), and also that continued to work.
I also got a possibility to log into the gui, and submit a crash report (hope it helps or tells something)

Now what is interesting, as soon as I connect LAN (igb0, which has vlans and carps) I hangs almost instantly, and it needs a hard reset... ???

Puzzled...

bennyc

After I moved all my vlans to another interface (on both carp members ::) ), I was able to plug in igb0 (one subnet was untagged on that interface) -> console became unresponsive, but came back alive after 10seconds or so. Phiew. So now I have +- a working unit.
OpenVPN still refuses to start for the moment, both instances, with following entries in log:

Apr 13 16:34:42	openvpn	45820	Exiting due to fatal error 
Apr 13 16:34:42	openvpn	45820	TCP/UDP: Socket bind failed on local address [AF_INET] x.x.x.x:1195: Address already in use 
...
Apr 13 16:34:42	openvpn	42503	Exiting due to fatal error
Apr 13 16:34:42	openvpn	42503	TCP/UDP: Socket bind failed on local address [AF_INET] x.x.x.x:1194: Address already in use
```so I'll need to look into that 1st before even thinking on upgrading the primary node…

so far for my monologue...

jimp

Looks like you're missing the /boot/loader.conf.local adjustment to increase nmbclusters. On that box just set it to kern.ipc.nmbclusters="1000000"

https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards#Adding_to_loader.conf.local

bennyc

Hi Jimp,

thanks for looking at my issue(s). I verified that box, and I have 3 rows in the /boot/loader.conf.local:

ahci_load="YES"
kern.cam.boot_delay=10000
kern.ipc.nmbclusters="1000000"

I checked it against the "master" (still on 2.2.6), and see no difference, even the file TS is the same.