Adventures upgrading 2 SG-5100 to 21.02

gabacho4

Thought I'd share my experience upgrading two SG-5100s to 21.02 and attempting to run Wireguard. Am curious if others have encountered similar issues. Apologize in advance if this isn't the expected location for such a post; I couldn't really find a place that made more sense.

Typically I do fresh installs of PfSense whenever a new version rolls out. However, this time, I decided to just upgrade the devices in place. Both successfully updated with no problems at all. One of my 5100s is local to me while the other resides at a location about 7000 miles away. Yes, I know, don't mess with things far, far away. Yes, that it typically my rule. However, due to COVID-19 and being unable to get back to the US this past summer, my remote 5100 was still running 2.4.4 and was therefore two versions behind. As it is the edge device for my network, I didn't feel like it was smart to leave it that far out of date.

While configuring wireguard on my local 5100 one evening, I changed the port that the remote end listened too and hit save. Right after, my 5100 went unresponsive and my internet died. I could not ssh into the box. When I tried the console, it was just a steady stream of -------- and . and + across the screen forever. I had to CTRL - C it to get it to stop. So I issued a reboot command and watched to console to see if things corrected themselves. Unfortunately, I was met with the same issue. I also noticed that the 2nd led in the row of 3 that is on the 5100 was red. I think I'm a smart guy but am not necessarily a FreeBSD buff and so the extent of troubleshooting I could do was limited. Rather than have no internet service in the house, I decided to do a fresh install and restore my config (always keep backups!) I've had no issues with it since then.

We all know that good things (and apparently bad things) come in 2. So, last night at about midnight, I was making some wireguard changes on the remote peer when suddenly, the GUI went unresponsive. I tried to ssh in to both internal LAN and external WAN ips to no success. My IPSEC and OPENVPN (back up connection) would not connect. A ping to my external WAN IP (via DDNS FQDN) did nothing.

I had my in-laws go over to the house and power cycle the router. No luck. My father-in-law noted that the 2nd of the 3 leds on the front of the 5100 was red. gulp I had him connect to the 5100 via console and sure enough, the same steady stream of -----, ., and + that I had seen on my local device. And so, we began the adventure of reinstalling PfSense and restoring the config. Some 4.5 hours later, tunnels went up, WAN ping began to respond, and I could breath much easier.

At this point, I am running my old OpenVPN and Ipsec configuration and am quite hesitant to press my luck with Wireguard for the time being. Unfortunately, I have no logs to provide Netgate or anyone else. I wish I could have but, again, restoring functioning internet at both locations as soon as possible was paramount. I wanted to capture the situation in the hopes that it confirms things already reported by others or things that might be reported later. In BOTH instances the death of the router occurred while configuring wireguard. I am very confident arguing that there is something haywire going on there.

I do greatly thank Netgate for the great work on wireguard implementation and recognize that this was the first release and therefore, there was a greater degree of likelihood that some kinks existed. It's entirely possible that a fresh install might be the solution to the problem and that upgrading left behind some old setting or config that was causing a conflict.

For the record, my wireguard configuration had Site A and B connect and access network resources at either end. Also, Site B had a 0.0.0.0/0 allowed IPs for the peer so that I could policy route certain devices to the internet on the US side for Amazon Prime Video and other geoblocked services. My config mirrored the client - server config in the PfSense book with the added outbound NAT rule so that B could access internet through A. Everything worked great until the routers died. Will report if I encounter any further issues. Anyone else seen/heard of the same type of experience?

stephenw10

Hmmm, I have WG running on an SG-5100 and have not hit that or anything close to it.

Do you have any logs showing the actual output you saw?

Both ends were running 21.02(p1)?

Steve

gabacho4

@stephenw10 both were running 21.02 p1. I don’t have any logs as my priority was recovery. It might have been possible for me to get something off the devices after I CTRL-C’ed in console in order to get the never ending garble output to stop but I had no GUI or ssh access to either device. My Linux /BSD kung fu is not as strong. In retrospect I should have taken pictures at the least but the instinct to recover quickly lest I incur the wrath of those who need the internet was my driving force. I was willing to chalk the first device issue off as a fluke. But 2/2 of my devices barfing and both in the course of conducting wireguard configurations? Seems highly unlikely to me. I’ve done all sorts of madness with openvpn and ipsec and never had an issue. Do you have a list of commands I should/could run if something similar occurs again? My appetite, especially with the remote device being so far away, to continue to experiment is essentially null. Remote device went offline at 0100 for me and I didn’t end up getting back into bed with a working config until 0600.

gabacho4

@stephenw10 if I might ask, what does the red LED mean? If you are looking at the 5100 it was the one in the middle of the three. Also, I’ve been running mullvad wireguard client with no issues. My issues both occurred while making changes to a site to site type config.

stephenw10

It's the status LED. It start duting boot and goes green at the end of boot.
https://docs.netgate.com/pfsense/en/latest/solutions/sg-5100/io-ports.html#other-ports-and-indicators

The fact it was red indicates it halted it rebooted. If it had crashed out completely it would have remained at it's current state, green.

Just copy/pasting some of the console output might have been useful. Or it could have been meaningless garbage.

Steve

gabacho4

@stephenw10 very interesting. Despite being red it was certainly working. Once I got the garble stopped I had an ash prompt which allowed me to issue commands, such as the reboot one that I issued. I was able to watch the device boot to a certain point before the terminal would go berserk. Should it happen again I'll capture some of the output. I know I wasn't much help on the troubleshooting front. Unfortunately my need to ensure internet service was available forced me to take the most direct action to meet that demand.

gabacho4

@stephenw10 I'm back! I had yet another crash tonight. Router became unresponsive, no internet, no dhcp, dead. I was able to console in again and had the never ending line of ........................... running across the screen for eternity. I CTRL-C again and, for reasons I can't explain, while sitting there feeling dread about yet again having to install and restore, the device kicked and I got to the main console screen with interfaces and ip addresses etc. Oddly, the router had reverted to a state where my WAN was on PPPoE - I haven't had that for 6+ months now. Regardless, I logged in via the gui and had a crash notice on the dashboard. I was able to download two files. Can be found at files.

Again I'm not guru but I did see panic in there which never seems like a good thing with *nix. Would greatly appreciate any information you all at Netgate can provide about what you see and if it is something already known or not. I was yet again making some changes to wireguard (MTU and MSS) when this issue occurred. I'm getting very paranoid about making any changes right now lest I press my luck and crash this sucker again. Cannot wait for 21.05 or 21.02 pX, whichever comes first and corrects some of the weirdness. Thanks in advance!

stephenw10

Mmm, that does look like something in WireGuard from the backtrace:

db:0:kdb.enter.default>  bt
Tracing pid 75993 tid 100526 td 0xfffff801708b0740
kdb_enter() at kdb_enter+0x37/frame 0xfffffe002cdb4c10
vpanic() at vpanic+0x197/frame 0xfffffe002cdb4c60
panic() at panic+0x43/frame 0xfffffe002cdb4cc0
trap_fatal() at trap_fatal+0x391/frame 0xfffffe002cdb4d20
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe002cdb4d70
trap() at trap+0x286/frame 0xfffffe002cdb4e80
calltrap() at calltrap+0x8/frame 0xfffffe002cdb4e80
--- trap 0xc, rip = 0xffffffff80d84c37, rsp = 0xfffffe002cdb4f50, rbp = 0xfffffe002cdb4fd0 ---
__mtx_lock_sleep() at __mtx_lock_sleep+0xd7/frame 0xfffffe002cdb4fd0
wg_queue_out() at wg_queue_out+0x21b/frame 0xfffffe002cdb5010
wg_transmit() at wg_transmit+0xda/frame 0xfffffe002cdb5070
pf_test() at pf_test+0x22f0/frame 0xfffffe002cdb52b0
pf_test() at pf_test+0x20f6/frame 0xfffffe002cdb54f0
pf_check_out() at pf_check_out+0x1d/frame 0xfffffe002cdb5510
pfil_run_hooks() at pfil_run_hooks+0xa1/frame 0xfffffe002cdb55b0
ip_output() at ip_output+0xb4f/frame 0xfffffe002cdb56f0
udp_send() at udp_send+0xbbe/frame 0xfffffe002cdb57f0
sosend_dgram() at sosend_dgram+0x348/frame 0xfffffe002cdb5850
sosend() at sosend+0x50/frame 0xfffffe002cdb5880
kern_sendit() at kern_sendit+0x19d/frame 0xfffffe002cdb5920
sendit() at sendit+0x19c/frame 0xfffffe002cdb5970
sys_sendto() at sys_sendto+0x4d/frame 0xfffffe002cdb59c0
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe002cdb5af0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe002cdb5af0
--- syscall (133, FreeBSD ELF64, sys_sendto), rip = 0x800c421ca, rsp = 0x7fffdfbfb3f8, rbp = 0x7fffdfbfb440 ---

Though the panic appears to be in unbound:

<4>matchaddr failed
<4>matchaddr failed
<4>matchaddr failed
<4>matchaddr failed
<4>matchaddr failed
<4>matchaddr failed
<6>wg0: link state changed to DOWN
<6>wg0: sc=0xfffff8000e00dc00
<6>wg0: link state changed to UP
<6>wg0: link state changed to DOWN
<6>wg0: sc=0xfffff8000e00dc00
<6>wg0: link state changed to UP
<6>wg1: link state changed to DOWN


Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 18
fault virtual address	= 0x410
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80d84c37
stack pointer	        = 0x28:0xfffffe002cdb4f50
frame pointer	        = 0x28:0xfffffe002cdb4fd0
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 75993 (unbound)
trap number		= 12
panic: page fault
cpuid = 3
time = 1615314653
KDB: enter: panic

The console buffer shows a quite a few WireGuard interface state chnages. Were you making those or was it losing connection?

Steve

gabacho4

@stephenw10 it was probably me. I was trying various mtu/mss combinations to see if they offered any discernible difference. Have had issues with the wireguard gateways reporting between 6 and 15 percent packet loss, as well as very slow page loading with various websites. Figured mtu/mss was a good place to start. It really seems like things just aren't 100% with wireguard. I have no issues with OpenVPN configurations, packet loss, page loading.

It really seems like the router struggles to figure out how to route requests when there are multiple wireguard gateways. I've noticed weirdness even with OpenVPN where, despite having PBR, my LAN devices will have a local WAN IP. If I restart the OpenVPN service, those devices then show a VPN IP instead. 21.02 seems to have some pretty rough edges compared to 2.4.5p1. I'm not angry or anything but just hope the next updatr is pushed out soon to resolve the ones you all know about. I don't really want to have to rely on applying patches one at a time. Do you know if there is a release expected again for the 21.02 install that's not the 21.05 release? I'm trying to be strong but man, I've had some issues I've never experienced before and have really fought the urge to roll back.

stephenw10

Ah, looks like you're hitting this: https://redmine.pfsense.org/issues/11585

gabacho4

@stephenw10 I’d seen that one, along with another that causes a panic if there are multiple changes/saves in a short timeframe. Glad I’m not experiencing anything new. Here’s to hoping a fix comes very soon.