Pfsync kernel panic after 2.1.5 to 2.2 to Upgrade - pfsync_undefer_state

stephenw10

Yes, that patch is in 2.2.1 and has been in snaps since Feb.
If you are seeing this problem and are running 2.2.1 can you make sure you keep any crashreports and tell us what hardware you're running.
Thanks.

Steve

Mathiew

I have to same problem when I upgrade Pfsense from 2.1.5 to 2.2.1…

Pfsense is running on virtualcenter 2.5. (Never had this before)

I tried twice, but nothing changed, this message is looping on the screen, and vm is using 100% CPU. There's no crash, but pfsense is useless.

I'm not using any high availability services (only limiters)

Marlenio

Yes, it's the same error on my carp 2.2.1 with limiters.

flofogl

Steve,

I would really like to help but don't know how. Have you been able to reproduce the behavior? I am back on 2.1.5 now and could repeat the update process. Apart from the non-existing crash reports what else would you need or could be useful?

I have HA and limiters configured and pfSense runs on KVM/QEMU.

As mentioned before, I got "pfsync_undefer_state: unable to find deferred state" printed in the console after the upgrade process on the backup node (after reboot).

I got Mathiew's behavior when I tried to restore my 2.1.5 configuration from the backup node on a fresh install with 2.2.1. In both cases, I had to "physically" shut down the machine.

Florian

stephenw10

So we are able to replicate the continuous log spam but not any sort of crash. Though in our test the box remained running and accessible. We are looking for crash reports really but any info is useful.

Mathiew, you are seeing that on a single VM? No CARP/HA setup at all?
That's interesting. What NICs are you using? What limiters do you have defined?

Steve

Mathiew

One VM only, I tried to import the config file on a fresh pfsense 2.2.1 install, but I had the exact same behavior.

No CARP/HA setup. I use a a lot of services, openvpn, ipsec, limiters.

I can send you my config file, if needed. I use E1000 adapter.

Limiters:
00001: 700.000 Kbit/s 0 ms burst 0
q131073 50 sl. 0 flows (1 buckets) sched 65537 weight 0 lmax 0 pri 0 droptail
sched 65537 type FIFO flags 0x1 256 buckets 1 active
mask: 0x00 0x00000000/0x0000 -> 0xffffff00/0x0000
BKT Prot Source IP/port_ Dest. IP/port Tot_pkt/bytes Pkt/Byte Drp
80 ip 0.0.0.0/0 192.168.100.0/0 4 336 0 0 0
00002: 10.000 Mbit/s 0 ms burst 0
q131074 50 sl. 0 flows (1 buckets) sched 65538 weight 0 lmax 0 pri 0 droptail
sched 65538 type FIFO flags 0x1 256 buckets 1 active
mask: 0x00 0x00000000/0x0000 -> 0xffffff00/0x0000
BKT Prot Source IP/port_ Dest. IP/port Tot_pkt/bytes Pkt/Byte Drp
80 ip 0.0.0.0/0 192.168.101.0/0 4 336 0 0 0
00003: 1.000 Mbit/s 0 ms burst 0
q131075 50 sl. 0 flows (1 buckets) sched 65539 weight 0 lmax 0 pri 0 droptail
sched 65539 type FIFO flags 0x1 256 buckets 0 active
mask: 0x00 0xffffff00/0x0000 -> 0x00000000/0x0000
BKT Prot Source IP/port_ Dest. IP/port Tot_pkt/bytes Pkt/Byte Drp
00005: 2.000 Mbit/s 0 ms burst 0
q131077 50 sl. 0 flows (1 buckets) sched 65541 weight 0 lmax 0 pri 0 droptail
sched 65541 type FIFO flags 0x1 256 buckets 1 active
mask: 0x00 0xffffff00/0x0000 -> 0x00000000/0x0000
BKT Prot Source IP/port_ Dest. IP/port Tot_pkt/bytes Pkt/Byte Drp
0 ip 10.0.17.0/0 0.0.0.0/0 11056 14116413 0 0 0
00006: 3.000 Mbit/s 0 ms burst 0
q131078 50 sl. 0 flows (1 buckets) sched 65542 weight 0 lmax 0 pri 0 droptail
sched 65542 type FIFO flags 0x0 0 buckets 1 active
0 ip 0.0.0.0/0 0.0.0.0/0 19 1118 0 0 0
00007: 2.000 Mbit/s 0 ms burst 0
q131079 50 sl. 0 flows (1 buckets) sched 65543 weight 0 lmax 0 pri 0 droptail
sched 65543 type FIFO flags 0x0 0 buckets 0 active
00008: 1.000 Mbit/s 0 ms burst 0
q131080 50 sl. 0 flows (1 buckets) sched 65544 weight 0 lmax 0 pri 0 droptail
sched 65544 type FIFO flags 0x0 0 buckets 1 active
0 ip 0.0.0.0/0 0.0.0.0/0 79 40016 0 0 0

tdale

@stephenw10 Is there something i can provide to you to help out? We are using two physical machines running two Dell CS24s with dual cpus and 16GB ram each and these are our front line. This is a production environment so i wont be able to make a ton of changes but i can give you information. Just let me know what you need me to post.

stephenw10

If you see any crash reports then we want to see those.
Other than that it's odd that not everyone running is seeing the same thing. If there's something unusual in your config then maybe we can try to see a pattern.

Steve

flofogl

I repeated the upgrade process today after having disabled my floating limiter rules and it worked.

However, as soon as I enabled any of them the console didn’t stop printing "pfsync_undefer_state: unable to find deferred state". The first time I wasn’t quick enough in disabling the rules and the machine got unresponsive with no CPU load (no crash report).

After a reboot I was able to disable the rules and the machine stayed responsive. I then deactivated “Synchronize States” and enabled the floating limiter rules. Apart from the known “Bump sched buckets to 256 (was 0)” the console remained unchanged. As soon as I activated state synchronization "pfsync_undefer_state: unable to find deferred state" was back again.

In contrast to Mathiew I can choose whether to have either HA or limiters.

This was done on a backup node, the master still runs 2.1.5.

stephenw10

Thanks for that report flofogal. All data is helpful.

Steve

Mathiew

I removed my limiters rules from the config and it's working, no more psync error…

stephenw10

Mathiew,
I have seen one other incidence of this in a single box (not part of a HA setup). IN that case the box previously had a CARP config of some sort and had stray tags in the config file that had not been translated correctly across an update.
In that instance it was fixed by enabling HA sync, saving, and the disabling HA sync again. Limiters could then be used.

Steve

flofogl

Steve,

if you say "fixed" it means that limiters could be used without HA afterwards not together with HA. It is a solution to Mathiew's issue only. Correct?

Cheers,

Florian

Mathiew

@stephenw10:

Mathiew,
I have seen one other incidence of this in a single box (not part of a HA setup). IN that case the box previously had a CARP config of some sort and had stray tags in the config file that had not been translated correctly across an update.
In that instance it was fixed by enabling HA sync, saving, and the disabling HA sync again. Limiters could then be used.

Steve

I can try, but I never touch any HA/CARP services on this machine.

Thanks for your work.

EDIT : I reactivated limiters after doing that and no problem so far.

stephenw10

Yes, still not fixed (though a problem hasn't yet been found) for HA+Limiters. But we had one other case where a stray HA tag in the config was causing this on a standalone box. Which may be a useful clue in itself because the pfsync interface was not actually configured on that box.

Steve

flofogl

Steve,

I would also like to thank you for inspecting the issue and I hope Mathiew's efforts will prove valuable. However I don't really unserstand what you mean by "a problem hasn't yet been found"? You wrote you were able to reproduce the behavior but the machine stayed responsive. The question is for how long? Once I was on 2.2.1(upgrade from 2.1.5 wihtout limiters enabled), it stayed responsive in my case too after having re-enabled the limiters, but only for a couple of minutes. After that, there was nothing left to do other than "physically" shutting down the machine (no web UI, no SSH, no console). I would consider this a problem…

The upgrade process with HA and limiters never worked for me, the box didn't come back up again. As mentioned before, I can test things if needed.

Is this something specific to my setup or are limiters in combination with HA not as common as I thought they would be?

Thanks,

Florian

stephenw10

I mean we are not, yet, able to replicate the crashes that you are seeing. We tested for hours with a variety of limiter setups and just saw continuous log spamming. Which itself is not great. ;)
If you have any ability to run this and deliberately cause it to crash and get us the crash report then we have something solid to go on. Right now it looks like the crashes may be secondary to the log spamming in some way.
I appreciate all the testing that you guys are doing.

Steve

Marlenio

Ho Steve,
i made a new test, installing 2.2.1 on my double CARP front firewall. It's a simple configuration, with a IPSec VPN (with four phase 2) and only watchdog as installed package. It seems to be ok. But in this config i don'y use any type of limiter as i do in my back firewall CARP config. Could it be limiter the problem in 2.2.1?

stephenw10

This is definitely a conflict between Limiters and pfsync removing either of those will solve it. That's not really a solution though.

Steve

Marlenio

@stephenw10:

This is definitely a conflict between Limiters and pfsync removing either of those will solve it. That's not really a solution though.

Steve

Yes, i think so. Today i installed my back pfsense CARP configuration, the one with the sync problem. First i uninstalled all limiters and all package, then install 2.2.1. It run vithout problem.