2.2.6 nanobsd - crashes/reboots - have console kernel dump, what next?

ZPrime

I think the subject covers it… I'm on a Soekris net6501-70. Was running nanobsd from mSATA SSD and the router started rebooting itself (and occasionally not coming back up / hanging at BIOS looking for disk), so I thought maybe the mSATA disk was starting to go. Backed up config, created a fresh nanobsd USB stick instead, removed mSATA, and booted the machine from USB, then restored the same config into that USB install.

Problem continues (except now it reboots correctly since net6501 handles USB better than SATA ::) ).

The serial console on the Soekris outputs to a Perle ioLAN ethernet device (this is one of the biggest reasons I won't purchase an official SG-series unit from the pfSense store -- how am I supposed to get out-of-band remote console access with a damn USB port?), and I happened to be logged in when it crashed/rebooted again. Got the entire crashdump on my terminal window, and saved it to a file.

I don't have the slightest clue how to parse through this and attempt to determine where the cause lies. 2.2.5 never crashed like this, so I suspect it might be something that changed in 2.2.6. I have a single IKEv1 ipsec connection with a bunch of child/phase2 SAs, and I think this might be where the bug lies... but that's only based on knowing that strongswan was updated in 2.2.6, and it may actually have nothing to do with the problem. It would be nice to know if it's somehow hardware related as well; I do have a spare net6501 and spare PSU and I was going to try swapping those one at a time to rule out hardware.

Here's a Gist link to the full dump.

Any help would be greatly appreciated here. I have no other FreeBSD machine to debug with, although I suppose I could spin something up in Parallels if I had to (except I'd still have no clue how to deal with the dump after installing BSD!)

ZPrime

I hate to talk to myself, but I may have traced the cause down more concretely to the ipsec subsystem.

The only other change I made recently was to alter my sole ipsec Ph1 entry and disable DPD, plus alter the hash type to SHA512 from SHA1. I then altered the hashes on all of the Ph2 entries to also use SHA512. I made corresponding changes on the remote side as well so both ends of the tunnel matched (remote side being Palo Alto OS 6.x).

I reversed the hashing change (put everything back to SHA1) after seeing the crashing, thinking maybe the increased CPU usage from SHA512 was somehow overheating the Soekris box (unlikely, since loadavg barely moved and the tunnel doesn't see that much traffic).

It looks like disabling DPD may have been the problem. I've re-enabled DPD on both sides and the pfSense box has now been up for ~12 hours without issue, albeit without much ipsec traffic either. A day ago, when I was pushing more ipsec, it died after being up for only a few hours.

It seems like there's something funky in the IKEv1 re-key / phase 2 recycling process when DPD is disabled, and apparently the code path is avoided (or never reached) when DPD is on?

For reference / testing, tunnels are using the following settings now:
Phase 1:
IKEv1, aggressive mode (my side is on dynamic IP and remote end doesn't support hostnames for ipsec, so stuck with aggressive mode)
Local ID / remote id "user distinguished name" - in the form of "routername@localdomain.ext" and "routername@remotedomain.fqdn"
AES-256 encryption (no GCM, not supported by remote)
SHA1 transform
DH Group 14 (highest supported by remote side)
24 hour lifetime
DPD enabled

Phase 2 entries (there are 16 for this single Ph1 entry, required due to the "LAN bypass" option not being exposed more flexibly _in pfSense):
AES-256, SHA1
DH Group 14
8 hour lifetime

Yesterday when it was still crashy (and when the above-linked dump was output) I had it configured as above except DPD was disabled on both sides of the tunnel. I need to give it another day to make sure it stays relatively stable as confirmation that DPD is the culprit.

Is it generally advised to have DPD enabled? I know that both ends of a tunnel should always match. However, I figured "less options = less complicated = better" but maybe that's not the case here._

Guest

I think the subject covers it… I'm on a Soekris net6501-70.

The Soekris net6501 is very often producing problems together with pfSense as I know it
and if your version 2.2.5 is or was well running on the net6501, why you upgrade it then?

Was running nanobsd from mSATA SSD and the router started rebooting itself (and occasionally
not coming back up / hanging at BIOS looking for disk),

NanoBSD is in my eyes more for a CFCard or USB Stick and for the mSATA a full install will be better.
Why you was using the NanoBSD for the mSATA? Did you make write able before upgrading?
Did you make custom changes on your pfSense firewall and don´t create a loader.conf.local
file that all custom made setting are got overwritten?

so I thought maybe the mSATA disk was starting to go.
Only base don the upgrade process?

Backed up config, created a fresh nanobsd USB stick instead, removed mSATA, and booted the machine from USB, then restored the same config into that USB install.

Was pfSense running well before you was restore the config backup?
Perhaps it is something about your config backup, that causes the crash.

cmb

If you're in a position you can try the latest 2.3, I'm pretty confident that specific crash is fixed in FreeBSD 10.2-STABLE and hence won't be an issue there.

ZPrime

cmb, I appreciate the reply. I'd rather wait for 2.3 to be closer to release before jumping in there - I need the connection for work, plus my wife would grumble if she doesn't have her Netflix. ;)

re-enabling DPD seems to be avoiding the bug, or at least lessening the frequency of it (it hasn't happened in close to 20 hours now, before it seemed to pop up every few hours at times).

It had nothing to do with the hash algo, I went back to SHA-512 and it's not having a problem with that.

I did stumble across a minor bug related to hash algorithm choice though – the remote side was enabled for both SHA-512 and SHA-1, with SHA-512 set as highest priority. pfSense negotiated phase2 entries at SHA-1 even though the more secure algo was available. Either the webui needs an "ordering" interface to make sure that the user has control over which gets chosen, or the back-end code needs to sort the algos so most secure is most preferred automatically. For now I just set pfsense to only offer the single hash algo of SHA-512 which works fine, but with multiple options selected, my expectation was for most secure to win (or at least for the remote side's "most preferred" option to win... but maybe that "preference" means nothing).

cmb

Appreciate the feedback.

Usually strongswan picks the strongest option where multiple are chosen, like AES auto defaults to 256 bit. racoon did the opposite there at times, with AES auto choosing 128, then it switched to preferring 256 post-upgrade to 2.2.x. Which is most always fine, but some people using glxsb crypto accelerators which don't work with 256 bit had issues. I'll check into that.

@bradenmcg:

cmb, I appreciate the reply. I'd rather wait for 2.3 to be closer to release before jumping in there - I need the connection for work, plus my wife would grumble if she doesn't have her Netflix. ;)

I hear that. Though outside of packages that haven't been Bootstrap-converted yet, 2.3 is solid. That's all we use internally at home, including those who work from home.