Adventures upgrading 2 SG-5100 to 21.02
-
Thought I'd share my experience upgrading two SG-5100s to 21.02 and attempting to run Wireguard. Am curious if others have encountered similar issues. Apologize in advance if this isn't the expected location for such a post; I couldn't really find a place that made more sense.
Typically I do fresh installs of PfSense whenever a new version rolls out. However, this time, I decided to just upgrade the devices in place. Both successfully updated with no problems at all. One of my 5100s is local to me while the other resides at a location about 7000 miles away. Yes, I know, don't mess with things far, far away. Yes, that it typically my rule. However, due to COVID-19 and being unable to get back to the US this past summer, my remote 5100 was still running 2.4.4 and was therefore two versions behind. As it is the edge device for my network, I didn't feel like it was smart to leave it that far out of date.
While configuring wireguard on my local 5100 one evening, I changed the port that the remote end listened too and hit save. Right after, my 5100 went unresponsive and my internet died. I could not ssh into the box. When I tried the console, it was just a steady stream of -------- and . and + across the screen forever. I had to CTRL - C it to get it to stop. So I issued a reboot command and watched to console to see if things corrected themselves. Unfortunately, I was met with the same issue. I also noticed that the 2nd led in the row of 3 that is on the 5100 was red. I think I'm a smart guy but am not necessarily a FreeBSD buff and so the extent of troubleshooting I could do was limited. Rather than have no internet service in the house, I decided to do a fresh install and restore my config (always keep backups!) I've had no issues with it since then.
We all know that good things (and apparently bad things) come in 2. So, last night at about midnight, I was making some wireguard changes on the remote peer when suddenly, the GUI went unresponsive. I tried to ssh in to both internal LAN and external WAN ips to no success. My IPSEC and OPENVPN (back up connection) would not connect. A ping to my external WAN IP (via DDNS FQDN) did nothing.
I had my in-laws go over to the house and power cycle the router. No luck. My father-in-law noted that the 2nd of the 3 leds on the front of the 5100 was red. gulp I had him connect to the 5100 via console and sure enough, the same steady stream of -----, ., and + that I had seen on my local device. And so, we began the adventure of reinstalling PfSense and restoring the config. Some 4.5 hours later, tunnels went up, WAN ping began to respond, and I could breath much easier.
At this point, I am running my old OpenVPN and Ipsec configuration and am quite hesitant to press my luck with Wireguard for the time being. Unfortunately, I have no logs to provide Netgate or anyone else. I wish I could have but, again, restoring functioning internet at both locations as soon as possible was paramount. I wanted to capture the situation in the hopes that it confirms things already reported by others or things that might be reported later. In BOTH instances the death of the router occurred while configuring wireguard. I am very confident arguing that there is something haywire going on there.
I do greatly thank Netgate for the great work on wireguard implementation and recognize that this was the first release and therefore, there was a greater degree of likelihood that some kinks existed. It's entirely possible that a fresh install might be the solution to the problem and that upgrading left behind some old setting or config that was causing a conflict.
For the record, my wireguard configuration had Site A and B connect and access network resources at either end. Also, Site B had a 0.0.0.0/0 allowed IPs for the peer so that I could policy route certain devices to the internet on the US side for Amazon Prime Video and other geoblocked services. My config mirrored the client - server config in the PfSense book with the added outbound NAT rule so that B could access internet through A. Everything worked great until the routers died. Will report if I encounter any further issues. Anyone else seen/heard of the same type of experience?
-
Hmmm, I have WG running on an SG-5100 and have not hit that or anything close to it.
Do you have any logs showing the actual output you saw?
Both ends were running 21.02(p1)?
Steve
-
@stephenw10 both were running 21.02 p1. I don’t have any logs as my priority was recovery. It might have been possible for me to get something off the devices after I CTRL-C’ed in console in order to get the never ending garble output to stop but I had no GUI or ssh access to either device. My Linux /BSD kung fu is not as strong. In retrospect I should have taken pictures at the least but the instinct to recover quickly lest I incur the wrath of those who need the internet was my driving force. I was willing to chalk the first device issue off as a fluke. But 2/2 of my devices barfing and both in the course of conducting wireguard configurations? Seems highly unlikely to me. I’ve done all sorts of madness with openvpn and ipsec and never had an issue. Do you have a list of commands I should/could run if something similar occurs again? My appetite, especially with the remote device being so far away, to continue to experiment is essentially null. Remote device went offline at 0100 for me and I didn’t end up getting back into bed with a working config until 0600.
-
@stephenw10 if I might ask, what does the red LED mean? If you are looking at the 5100 it was the one in the middle of the three. Also, I’ve been running mullvad wireguard client with no issues. My issues both occurred while making changes to a site to site type config.
-
It's the status LED. It start duting boot and goes green at the end of boot.
https://docs.netgate.com/pfsense/en/latest/solutions/sg-5100/io-ports.html#other-ports-and-indicatorsThe fact it was red indicates it halted it rebooted. If it had crashed out completely it would have remained at it's current state, green.
Just copy/pasting some of the console output might have been useful. Or it could have been meaningless garbage.
Steve
-
@stephenw10 very interesting. Despite being red it was certainly working. Once I got the garble stopped I had an ash prompt which allowed me to issue commands, such as the reboot one that I issued. I was able to watch the device boot to a certain point before the terminal would go berserk. Should it happen again I'll capture some of the output. I know I wasn't much help on the troubleshooting front. Unfortunately my need to ensure internet service was available forced me to take the most direct action to meet that demand.
-
@stephenw10 I'm back! I had yet another crash tonight. Router became unresponsive, no internet, no dhcp, dead. I was able to console in again and had the never ending line of ........................... running across the screen for eternity. I CTRL-C again and, for reasons I can't explain, while sitting there feeling dread about yet again having to install and restore, the device kicked and I got to the main console screen with interfaces and ip addresses etc. Oddly, the router had reverted to a state where my WAN was on PPPoE - I haven't had that for 6+ months now. Regardless, I logged in via the gui and had a crash notice on the dashboard. I was able to download two files. Can be found at files.
Again I'm not guru but I did see panic in there which never seems like a good thing with *nix. Would greatly appreciate any information you all at Netgate can provide about what you see and if it is something already known or not. I was yet again making some changes to wireguard (MTU and MSS) when this issue occurred. I'm getting very paranoid about making any changes right now lest I press my luck and crash this sucker again. Cannot wait for 21.05 or 21.02 pX, whichever comes first and corrects some of the weirdness. Thanks in advance!
-
Mmm, that does look like something in WireGuard from the backtrace:
db:0:kdb.enter.default> bt Tracing pid 75993 tid 100526 td 0xfffff801708b0740 kdb_enter() at kdb_enter+0x37/frame 0xfffffe002cdb4c10 vpanic() at vpanic+0x197/frame 0xfffffe002cdb4c60 panic() at panic+0x43/frame 0xfffffe002cdb4cc0 trap_fatal() at trap_fatal+0x391/frame 0xfffffe002cdb4d20 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe002cdb4d70 trap() at trap+0x286/frame 0xfffffe002cdb4e80 calltrap() at calltrap+0x8/frame 0xfffffe002cdb4e80 --- trap 0xc, rip = 0xffffffff80d84c37, rsp = 0xfffffe002cdb4f50, rbp = 0xfffffe002cdb4fd0 --- __mtx_lock_sleep() at __mtx_lock_sleep+0xd7/frame 0xfffffe002cdb4fd0 wg_queue_out() at wg_queue_out+0x21b/frame 0xfffffe002cdb5010 wg_transmit() at wg_transmit+0xda/frame 0xfffffe002cdb5070 pf_test() at pf_test+0x22f0/frame 0xfffffe002cdb52b0 pf_test() at pf_test+0x20f6/frame 0xfffffe002cdb54f0 pf_check_out() at pf_check_out+0x1d/frame 0xfffffe002cdb5510 pfil_run_hooks() at pfil_run_hooks+0xa1/frame 0xfffffe002cdb55b0 ip_output() at ip_output+0xb4f/frame 0xfffffe002cdb56f0 udp_send() at udp_send+0xbbe/frame 0xfffffe002cdb57f0 sosend_dgram() at sosend_dgram+0x348/frame 0xfffffe002cdb5850 sosend() at sosend+0x50/frame 0xfffffe002cdb5880 kern_sendit() at kern_sendit+0x19d/frame 0xfffffe002cdb5920 sendit() at sendit+0x19c/frame 0xfffffe002cdb5970 sys_sendto() at sys_sendto+0x4d/frame 0xfffffe002cdb59c0 amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe002cdb5af0 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe002cdb5af0 --- syscall (133, FreeBSD ELF64, sys_sendto), rip = 0x800c421ca, rsp = 0x7fffdfbfb3f8, rbp = 0x7fffdfbfb440 ---
Though the panic appears to be in unbound:
<4>matchaddr failed <4>matchaddr failed <4>matchaddr failed <4>matchaddr failed <4>matchaddr failed <4>matchaddr failed <6>wg0: link state changed to DOWN <6>wg0: sc=0xfffff8000e00dc00 <6>wg0: link state changed to UP <6>wg0: link state changed to DOWN <6>wg0: sc=0xfffff8000e00dc00 <6>wg0: link state changed to UP <6>wg1: link state changed to DOWN Fatal trap 12: page fault while in kernel mode cpuid = 3; apic id = 18 fault virtual address = 0x410 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80d84c37 stack pointer = 0x28:0xfffffe002cdb4f50 frame pointer = 0x28:0xfffffe002cdb4fd0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 75993 (unbound) trap number = 12 panic: page fault cpuid = 3 time = 1615314653 KDB: enter: panic
The console buffer shows a quite a few WireGuard interface state chnages. Were you making those or was it losing connection?
Steve
-
@stephenw10 it was probably me. I was trying various mtu/mss combinations to see if they offered any discernible difference. Have had issues with the wireguard gateways reporting between 6 and 15 percent packet loss, as well as very slow page loading with various websites. Figured mtu/mss was a good place to start. It really seems like things just aren't 100% with wireguard. I have no issues with OpenVPN configurations, packet loss, page loading.
It really seems like the router struggles to figure out how to route requests when there are multiple wireguard gateways. I've noticed weirdness even with OpenVPN where, despite having PBR, my LAN devices will have a local WAN IP. If I restart the OpenVPN service, those devices then show a VPN IP instead. 21.02 seems to have some pretty rough edges compared to 2.4.5p1. I'm not angry or anything but just hope the next updatr is pushed out soon to resolve the ones you all know about. I don't really want to have to rely on applying patches one at a time. Do you know if there is a release expected again for the 21.02 install that's not the 21.05 release? I'm trying to be strong but man, I've had some issues I've never experienced before and have really fought the urge to roll back.
-
Ah, looks like you're hitting this: https://redmine.pfsense.org/issues/11585
-
@stephenw10 I’d seen that one, along with another that causes a panic if there are multiple changes/saves in a short timeframe. Glad I’m not experiencing anything new. Here’s to hoping a fix comes very soon.