PfSense 2.2.x Panics with "Sleeping thread owns a non-sleepable lock"
-
Admins: if there is a better place to put this, please move it!
After updating to 2.2.x, I've been seeing occasional panics under load, and over the last few days (in which we've been doing some offsite backups), crashes have happened much more frequently. Yesterday, we had four crashes, a couple within less than an hour of each other.
I'm on 2.2.3 right now on this box.
While I don't have historical crash logs for the previous crashes, in the most recent cases, it's been a kernel panic with the main cause "Sleeping thread (tid 100067, pid 12) owns a non-sleepable lock".
I read in this bug report: https://redmine.pfsense.org/issues/4685 that it's a problem with the underlying FreeBSD code, and I see that a fix has been committed to base on June 17: https://reviews.freebsd.org/D2828
The pfSense bug report 4685 says that the target is 2.2.5 with the change from the initial target of 2.3 being made just a couple of days ago. I'm not versed in how the pfSense team keeps track of which FreeBSD base system is included with which pfSense release, but based on this, I figure that the fix is not in 2.2.4.
My question is: does the nightly 2.2.5 snapshot now contain this fix? And, should I update to 2.2.5 snapshot to get this fix, or wait a little, if a release 2.2.5 is imminent?
Or, is there another way to skin this cat that I'm not thinking of, like taking just the kernel or some replacement object from 2.2.5 snapshot and drop it into my existing system?
-
as far as i can tell, there have been some patches that have been merged in 2.2.4, but it appears it's still not completely fixed. (so it probably isn't in 2.2.5 snapshots either)
perhaps you could provide intel about your config/situation and a way to replicate, to narrow down the causes ; it seems to be hard to replicate the situation in a lab-environment
-
My hardware is a fitPC2i, with an Atom processor. The two built-in Gig-E ports are using the Realtek driver. Verizon FiOS on one side, my company LAN on the other. I'm not using the wireless in the fitPC, so it's a pretty basic two-interface firewall config, with a touch of NAT and about a page's worth of explicit rules. No traffic shaping, no proxy, no captive portal. I'm not running any third-party packages on this pfSense install.
I bought the hardware in 2010, and about a year ago, I replaced the spinning disk in it with an industrial SSD.
In July, I updated to pfSense 2.2 from 2.1.1. Since July, there have been 19 spontaneous reboots.
Conversely, I have another fitPC2i (identical hardware) at another location, and it's running 2.1.1. No spontaneous reboots I can recall.
The panic log is exactly the same (excepting PIDs) as the one in bug report 4685, so I'm eager to test if this patch to if_ether.c in D2828 addresses the problem.
What other intel can I provide?
-
Admins: if there is a better place to put this, please move it!
After updating to 2.2.x, I've been seeing occasional panics under load, and over the last few days (in which we've been doing some offsite backups), crashes have happened much more frequently. Yesterday, we had four crashes, a couple within less than an hour of each other.
I'm on 2.2.3 right now on this box.
While I don't have historical crash logs for the previous crashes, in the most recent cases, it's been a kernel panic with the main cause "Sleeping thread (tid 100067, pid 12) owns a non-sleepable lock".
I read in this bug report: https://redmine.pfsense.org/issues/4685 that it's a problem with the underlying FreeBSD code, and I see that a fix has been committed to base on June 17: https://reviews.freebsd.org/D2828
The pfSense bug report 4685 says that the target is 2.2.5 with the change from the initial target of 2.3 being made just a couple of days ago. I'm not versed in how the pfSense team keeps track of which FreeBSD base system is included with which pfSense release, but based on this, I figure that the fix is not in 2.2.4.
My question is: does the nightly 2.2.5 snapshot now contain this fix? And, should I update to 2.2.5 snapshot to get this fix, or wait a little, if a release 2.2.5 is imminent?
Or, is there another way to skin this cat that I'm not thinking of, like taking just the kernel or some replacement object from 2.2.5 snapshot and drop it into my existing system?
FreeBSD 11 replacing 10.1 following the steps in the link below, is one possibility.
https://forum.pfsense.org/index.php?topic=83785.msg459222#msg459222
What Intel Atom chip is it?
There are an awful lot of complaints about Intel Atom chips randomly crashing, for any number of reasons on any number of platforms, if you search the internet.
Have you tried switching off Hyperthreading in the bios, as some report improvements ie no crashes, which might also be worth a try.
-
It's an Intel Atom Z530.
I'll try turning off HT the next time it crashes.
While there may be complaints about Atom crashes on the internet, this box was extremely stable up until the pfSense switch to FreeBSD 10.1p13.
Putting FBSD11 on this router is probably more effort than I want to put into the problem. I suspect I might downgrade to pfSense 2.1.x instead.
-
The initial fix that went in did fix a problem, but it wasn't the root problem of bug #4685. The root issue there was identified and one of our developers committed a fix into FreeBSD in the past few days. It's not in any snapshots yet.
The only trigger we're aware of is proxy ARP VIPs. Do you have any proxy ARP VIPs configured? Changing those to IP aliases instead would prevent that from occurring.
-
Thanks cmb.
I have a single ARP Proxy VIP configured on this machine. I'm using it to NAT to several different internal services.
Can I simply click the "IP Alias" radio button and reload the config with little other impact to my config?
I see in the chart in the docs for VIPs that IP Aliases and Proxy ARPs have similar enough features.
-
Can I simply click the "IP Alias" radio button and reload the config with little other impact to my config?
yes, everything else will remain the same.
-
OK, I've changed it.
Should I update to 2.2.4 as well, or see how this change affects stability first?
-
OK, I've changed it.
Should I update to 2.2.4 as well, or see how this change affects stability first?
Past experience would lead me to tell you "one thing at a time". :) See how it affects stability on your current release. If things improve, plan an update to 2.2.4. If you do both, you can't be sure which one fixed it.
-
@mer:
Past experience would lead me to tell you "one thing at a time". :) See how it affects stability on your current release. If things improve, plan an update to 2.2.4. If you do both, you can't be sure which one fixed it.
Very often true. In this case, the original issue is understood well enough to know that upgrading won't change things, and that switching from proxy ARP to IP alias will prevent the issue. So I wouldn't hesitate to upgrade in this case.
-
It's been a few days since I changed the Proxy ARP to an IP Alias.
No crashes so far!
I'll do the update during the next maintenance window.
Thanks for the help!