PF was wedged/busy and has been reset.

dennypage

I have an SG-2220 that has just been upgraded from 2.2.6 to 2.3. Following the upgrade, I am receiving the following two notices each time the unit boots:


pf_busy

 PF was wedged/busy and has been reset.	@ 2016-05-03 11:58:53

Filter Reload

 There were error(s) loading the rules: pfctl: DIOCXCOMMIT: Device busy - The line in question reads [0]: @ 2016-05-03 11:58:54

The system log entries are:


May 3 11:58:53	php-fpm	272	rc.newwanip: New alert found: There were error(s) loading the rules: pfctl: DIOCXCOMMIT: Device busy - The line in question reads [0]:
May 3 11:58:53	php-fpm	272	rc.newwanip: New alert found: PF was wedged/busy and has been reset.

Both WAN addresses (IPv4 & IPv6) are DHCP. If I disable IPv6 on the WAN interface, the problem still occurs.

The only packages that are installed are AutoConfigBackup and Service_Watchdog.

Suggestions welcome.

cmb

That consistently happen on every reboot? I've seen it on rare occasion but never with enough reproducibility to track it down.

It's not harmful, but there's something that should be tracked down and fixed.

dennypage

Can't guarantee every time, but pretty close. I rebooted the unit a lot today as part of testing. I'm not sure of the exact number, but somewhere around 20 times. Only twice did it not happen.

tga

Hi Folks,
i've got the same issue here. We have 3 SG-2220 here. All three of them give this message after reboot, but as far as i can tell its also not every reboot.

@cmb: is it possible, that its just a task taking longer on these kind of machines, causing a timer to hit and causing this error without actually having a problem?

dorap

Hi Folks,

nearly*** the same here: I got these messages right after reboot - ratio: 3 reboots / 2 times (1 time without a message).

Cheers
Dora

***Nearly means: I'm running pfsense 2.3.1, but on pcengine's apu2c4

NOYB

I've seen wedged notification a few times too. But not consistently.

2.3-RELEASE (i386)
built on Mon Apr 11 18:12:06 CDT 2016
FreeBSD 10.3-RELEASE

Intel(R) Pentium(R) 4 CPU 2.66GHz, 2 GB RAM

dennypage

Anything I can offer to help track this down?

cmb

Could you email me a sanitized copy of your config, Denny? Via status.php would be fine, that'll trim out the most sensitive parts.

cmb

@tga:

@cmb: is it possible, that its just a task taking longer on these kind of machines, causing a timer to hit and causing this error without actually having a problem?

I don't think it's hardware-specific. It could be something where the slower the CPU the more likely you are to hit it, but a 2220 CPU runs circles around an ALIX and I've never seen it on an ALIX.

Anyone else that would be willing to share a config backup, please email to cmb at pfsense dot org with a link to this thread.

dennypage

Done.

cmb

Thanks Denny.

I have a system internally now that's replicating that issue, though only with ALTQ and OpenVPN enabled. I think the root cause for you and others is probably the same though. Could you try adding:

if (platform_booting()) {
   return;
}

to the top of /etc/rc.newwanip and rc.newwanipv6 under the require_once lines, and see if that stops happening?

dennypage

Yes, that appears to fix it. I tested ten times or so with and without the change. With the platform_booting() check in /etc/rc.newwanip, the problem did not occur. Without the platform_booting() check in /etc/rc.newwanip, the problem occurred each time. The change for IPv6 had no effect in my testing.

dorap

I've also inserted the conditional-reboot-return-code and tested some reboots:
The messages are gone and the problem seems to be solved. Thanks!

This is the (hopefully correct applied) patch I've used:


--- /etc/rc.newwanip.orig       2016-05-06 09:04:29.768137000 +0200
+++ /etc/rc.newwanip    2016-05-06 09:05:24.885368000 +0200
@@ -43,6 +43,11 @@
 require_once("IPv6.inc");
 require_once("rrd.inc");

+if (platform_booting()) {
+   /* see https://forum.pfsense.org/index.php?topic=111236.0 */
+   return;
+}
+
 function restart_packages() {
        global $oldip, $curwanip, $g;

--- /etc/rc.newwanipv6.orig     2016-05-06 09:05:57.001394000 +0200
+++ /etc/rc.newwanipv6  2016-05-06 09:06:12.371604000 +0200
@@ -44,6 +44,11 @@
 require_once("services.inc");
 require_once("rrd.inc");

+if (platform_booting()) {
+   /* see https://forum.pfsense.org/index.php?topic=111236.0 */
+   return;
+}
+
 function restart_packages() {
        global $oldipv6, $curwanipv6, $g;

cmb

Great, thanks for the feedback.

@dorap:

This is the (hopefully correct applied) patch I've used:


--- /etc/rc.newwanip.orig       2016-05-06 09:04:29.768137000 +0200
+++ /etc/rc.newwanip    2016-05-06 09:05:24.885368000 +0200
@@ -43,6 +43,11 @@
 require_once("IPv6.inc");
 require_once("rrd.inc");

+if (platform_booting()) {
+   /* see https://forum.pfsense.org/index.php?topic=111236.0 */
+   return;
+}
+
 function restart_packages() {
        global $oldip, $curwanip, $g;

--- /etc/rc.newwanipv6.orig     2016-05-06 09:05:57.001394000 +0200
+++ /etc/rc.newwanipv6  2016-05-06 09:06:12.371604000 +0200
@@ -44,6 +44,11 @@
 require_once("services.inc");
 require_once("rrd.inc");

+if (platform_booting()) {
+   /* see https://forum.pfsense.org/index.php?topic=111236.0 */
+   return;
+}
+
 function restart_packages() {
        global $oldipv6, $curwanipv6, $g;

That's correct, yes. It won't be what the actual fix is, but it'll workaround the issue and be fine in most circumstances.

dorap

hmm, some minutes ago I rebooted (only) the switch all Ports (and vlans) from my pfsense (2.3.1) box are connected to, and pfsense reports again (the above patch is still applied):


pf_busy

    PF was wedged/busy and has been reset. @ 2016-05-08 13:04:57
    PF was wedged/busy and has been reset. @ 2016-05-08 13:05:58

Filter Reload

    There were error(s) loading the rules: pfctl: DIOCXCOMMIT: Device busy - The line in question reads [0]: @ 2016-05-08 13:04:58
    There were error(s) loading the rules: pfctl: DIOCXCOMMIT: Device busy - The line in question reads [0]: @ 2016-05-08 13:05:59

cmb

That change only affects hitting that issue during booting, as that was the only time we seemed to see it occur.

The latest 2.3.1 has a fix for the booting issue, which is what everyone else here was hitting. That appears to be the same root cause, but upon linkup of multiple NICs at exactly the same time. There's a larger architectural issue to be addressed there in the future to fully address the root issue. In the mean time, the message is ugly, but doesn't impact functionality.