Adventures upgrading 2 SG-5100 to 21.02

stephenw10

Hmmm, I have WG running on an SG-5100 and have not hit that or anything close to it.

Do you have any logs showing the actual output you saw?

Both ends were running 21.02(p1)?

Steve

gabacho4

@stephenw10 both were running 21.02 p1. I don’t have any logs as my priority was recovery. It might have been possible for me to get something off the devices after I CTRL-C’ed in console in order to get the never ending garble output to stop but I had no GUI or ssh access to either device. My Linux /BSD kung fu is not as strong. In retrospect I should have taken pictures at the least but the instinct to recover quickly lest I incur the wrath of those who need the internet was my driving force. I was willing to chalk the first device issue off as a fluke. But 2/2 of my devices barfing and both in the course of conducting wireguard configurations? Seems highly unlikely to me. I’ve done all sorts of madness with openvpn and ipsec and never had an issue. Do you have a list of commands I should/could run if something similar occurs again? My appetite, especially with the remote device being so far away, to continue to experiment is essentially null. Remote device went offline at 0100 for me and I didn’t end up getting back into bed with a working config until 0600.

gabacho4

@stephenw10 if I might ask, what does the red LED mean? If you are looking at the 5100 it was the one in the middle of the three. Also, I’ve been running mullvad wireguard client with no issues. My issues both occurred while making changes to a site to site type config.

stephenw10

It's the status LED. It start duting boot and goes green at the end of boot.
https://docs.netgate.com/pfsense/en/latest/solutions/sg-5100/io-ports.html#other-ports-and-indicators

The fact it was red indicates it halted it rebooted. If it had crashed out completely it would have remained at it's current state, green.

Just copy/pasting some of the console output might have been useful. Or it could have been meaningless garbage.

Steve

gabacho4

@stephenw10 very interesting. Despite being red it was certainly working. Once I got the garble stopped I had an ash prompt which allowed me to issue commands, such as the reboot one that I issued. I was able to watch the device boot to a certain point before the terminal would go berserk. Should it happen again I'll capture some of the output. I know I wasn't much help on the troubleshooting front. Unfortunately my need to ensure internet service was available forced me to take the most direct action to meet that demand.

gabacho4

@stephenw10 I'm back! I had yet another crash tonight. Router became unresponsive, no internet, no dhcp, dead. I was able to console in again and had the never ending line of ........................... running across the screen for eternity. I CTRL-C again and, for reasons I can't explain, while sitting there feeling dread about yet again having to install and restore, the device kicked and I got to the main console screen with interfaces and ip addresses etc. Oddly, the router had reverted to a state where my WAN was on PPPoE - I haven't had that for 6+ months now. Regardless, I logged in via the gui and had a crash notice on the dashboard. I was able to download two files. Can be found at files.

Again I'm not guru but I did see panic in there which never seems like a good thing with *nix. Would greatly appreciate any information you all at Netgate can provide about what you see and if it is something already known or not. I was yet again making some changes to wireguard (MTU and MSS) when this issue occurred. I'm getting very paranoid about making any changes right now lest I press my luck and crash this sucker again. Cannot wait for 21.05 or 21.02 pX, whichever comes first and corrects some of the weirdness. Thanks in advance!

stephenw10

Mmm, that does look like something in WireGuard from the backtrace:

db:0:kdb.enter.default>  bt
Tracing pid 75993 tid 100526 td 0xfffff801708b0740
kdb_enter() at kdb_enter+0x37/frame 0xfffffe002cdb4c10
vpanic() at vpanic+0x197/frame 0xfffffe002cdb4c60
panic() at panic+0x43/frame 0xfffffe002cdb4cc0
trap_fatal() at trap_fatal+0x391/frame 0xfffffe002cdb4d20
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe002cdb4d70
trap() at trap+0x286/frame 0xfffffe002cdb4e80
calltrap() at calltrap+0x8/frame 0xfffffe002cdb4e80
--- trap 0xc, rip = 0xffffffff80d84c37, rsp = 0xfffffe002cdb4f50, rbp = 0xfffffe002cdb4fd0 ---
__mtx_lock_sleep() at __mtx_lock_sleep+0xd7/frame 0xfffffe002cdb4fd0
wg_queue_out() at wg_queue_out+0x21b/frame 0xfffffe002cdb5010
wg_transmit() at wg_transmit+0xda/frame 0xfffffe002cdb5070
pf_test() at pf_test+0x22f0/frame 0xfffffe002cdb52b0
pf_test() at pf_test+0x20f6/frame 0xfffffe002cdb54f0
pf_check_out() at pf_check_out+0x1d/frame 0xfffffe002cdb5510
pfil_run_hooks() at pfil_run_hooks+0xa1/frame 0xfffffe002cdb55b0
ip_output() at ip_output+0xb4f/frame 0xfffffe002cdb56f0
udp_send() at udp_send+0xbbe/frame 0xfffffe002cdb57f0
sosend_dgram() at sosend_dgram+0x348/frame 0xfffffe002cdb5850
sosend() at sosend+0x50/frame 0xfffffe002cdb5880
kern_sendit() at kern_sendit+0x19d/frame 0xfffffe002cdb5920
sendit() at sendit+0x19c/frame 0xfffffe002cdb5970
sys_sendto() at sys_sendto+0x4d/frame 0xfffffe002cdb59c0
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe002cdb5af0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe002cdb5af0
--- syscall (133, FreeBSD ELF64, sys_sendto), rip = 0x800c421ca, rsp = 0x7fffdfbfb3f8, rbp = 0x7fffdfbfb440 ---

Though the panic appears to be in unbound:

<4>matchaddr failed
<4>matchaddr failed
<4>matchaddr failed
<4>matchaddr failed
<4>matchaddr failed
<4>matchaddr failed
<6>wg0: link state changed to DOWN
<6>wg0: sc=0xfffff8000e00dc00
<6>wg0: link state changed to UP
<6>wg0: link state changed to DOWN
<6>wg0: sc=0xfffff8000e00dc00
<6>wg0: link state changed to UP
<6>wg1: link state changed to DOWN


Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 18
fault virtual address	= 0x410
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80d84c37
stack pointer	        = 0x28:0xfffffe002cdb4f50
frame pointer	        = 0x28:0xfffffe002cdb4fd0
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 75993 (unbound)
trap number		= 12
panic: page fault
cpuid = 3
time = 1615314653
KDB: enter: panic

The console buffer shows a quite a few WireGuard interface state chnages. Were you making those or was it losing connection?

Steve

gabacho4

@stephenw10 it was probably me. I was trying various mtu/mss combinations to see if they offered any discernible difference. Have had issues with the wireguard gateways reporting between 6 and 15 percent packet loss, as well as very slow page loading with various websites. Figured mtu/mss was a good place to start. It really seems like things just aren't 100% with wireguard. I have no issues with OpenVPN configurations, packet loss, page loading.

It really seems like the router struggles to figure out how to route requests when there are multiple wireguard gateways. I've noticed weirdness even with OpenVPN where, despite having PBR, my LAN devices will have a local WAN IP. If I restart the OpenVPN service, those devices then show a VPN IP instead. 21.02 seems to have some pretty rough edges compared to 2.4.5p1. I'm not angry or anything but just hope the next updatr is pushed out soon to resolve the ones you all know about. I don't really want to have to rely on applying patches one at a time. Do you know if there is a release expected again for the 21.02 install that's not the 21.05 release? I'm trying to be strong but man, I've had some issues I've never experienced before and have really fought the urge to roll back.

stephenw10

Ah, looks like you're hitting this: https://redmine.pfsense.org/issues/11585

gabacho4

@stephenw10 I’d seen that one, along with another that causes a panic if there are multiple changes/saves in a short timeframe. Glad I’m not experiencing anything new. Here’s to hoping a fix comes very soon.