Pfsync kernel panic after 2.1.5 to 2.2 to Upgrade - pfsync_undefer_state



  • Hi,

    I have a 2 real machines configured on typical a failover setup, with a dedicated pfsync connection between them. After the upgrade, both machines started having kernel panics a few minutes after boot completion. They indicate the following panic message: "pfsync_undefer_state: unable to find deferred state". The complete crash dump is here (can't attach here, as it is too large): https://gist.github.com/anonymous/16fce0e2fa29ea6dd53a

    I figured out the kernel panics disappear if I disable the "Synchronize States" option under "High Availability Sync". Even if only one machine has the "Synchronize States" enabled it will crash (but the other won't). XMLRPC sync and CARP are still enabled, and work as expected. pfsync was working fine before the upgrade, synchronising the firewall states between the servers.

    If anyone successfully using pfsync on a 2.2 machine? I've seem others are reporting the same error on similar failover setups (their setups were virtualised, mine isn't). Guess I will have to downgrade…

    Best, Bernardo

    Update: I tried a fresh install of 2.2, uploaded the config backup, and got the same (bad) results when "Synchronize States" was enabled.



  • Hi bernardo,

    I posted a similar problem in a virtualized environment yesterday but haven't got an answer yet. Thank you for the information that disabling "Synchronize States" is enough to make the panic disappear.

    Have you tried upgrading both machines without snychronization or was one machine always on version 2.1.5 when you reenabled "Synchronize States"? Maybe synchronizing between two machines running version 2.2 works!? I am currently not on-site and I don't want to test it remotely, but maybe at the end of the week.

    Cheers

    Florian



  • I just set up a CARP HA setup in a VM (VirtualBox) and it works fine with 2.2 <> 2.2. I'm fairly sure that syncing between 2.1.X and 2.2 will cause trouble as the base operating system has changed from FreeBSD 8.3 to 10.1.

    Edit:
    The blog post about 2.2 has been updated with this:
    @pfSense:

    Limiters not working with High Availability

    If you’re using limiters and high availability (CARP+pfsync+config sync), do not upgrade at this time. We have an open bug on a crash in this circumstance.

    Bug: https://redmine.pfsense.org/issues/4310
    Blog: https://blog.pfsense.org/?p=1546



  • Hi fragged,

    thank you for the information. That means that in theory I could do an upgrade on both machines with state snychronization and limiter rules disabled and turn them back on atferwards, correct?

    Florian



  • @flofogl:

    Hi fragged,

    thank you for the information. That means that in theory I could do an upgrade on both machines with state snychronization and limiter rules disabled and turn them back on atferwards, correct?

    Florian

    From the blog post / bug report it looks like limiters will cause a kernel panic when CARP HA is used with 2.2. I have tested CARP without limiters.



  • @fragged:

    @flofogl:

    Hi fragged,

    thank you for the information. That means that in theory I could do an upgrade on both machines with state snychronization and limiter rules disabled and turn them back on atferwards, correct?

    Florian

    From the blog post / bug report it looks like limiters will cause a kernel panic when CARP HA is used with 2.2. I have tested CARP without limiters.

    Looks pretty much as what i experienced 2 days ago. I have another thread where discussing it.
    Can you link to the bug report?

    EDIT: bug report found: https://redmine.pfsense.org/issues/4310



  • From the blog post / bug report it looks like limiters will cause a kernel panic when CARP HA is used with 2.2. I have tested CARP without limiters.

    Sorry, I didn't get it at first. I thought it was only if one node was on version 2.2 and the other one still on version 2.1.5. According to the bug report it now seems that synchronizing limiter rules is simply broken in version 2.2 and will cause a panic regardless of the version of the other nodes.



  • Hi All,

    I do have limiters enabled, thank you very much, @fragged, I will try to disable the limiters and see how it goes. Let's hope the great pfsense team is able to fix this soon.

    @flofogl: At first I upgraded one host only, but then the problem persisted after both hosts were upgraded to 2.2. One host (at 2.2) would crash even if the other was turned off.

    Disabling "Synchronize States" is a workaround, though, as connections won't be maintained when master and backup change roles.

    Best, Bernardo



  • Following up, I disabled my limiters (didn't delete them, just disabled) and then enabled "Synchronize States", and the kernel panics stopped. Right after I enabled the "Synchronize States" back I got another panic reboot on my master, which got me worried. But after it came back both machines have been stable for a couple of hours now, in production (routing about 50 employees to 3 Wans totalling 280Mbits of bandwidth).

    Besides HA + Carp, I use Multi Wan with failover (3 Wan links), policy based routing on my firewall rules,  traffic shapper (with HFSC), IPSec VPN, DNS Forwarder. Everything works apparently so far. But I miss my limiters… :(

    I will report if I find anything else.

    Best, Bernardo


  • Netgate Administrator

    Have any of you tried a 2.2.1 snapshot to confirm this is fixed?
    The bug reported listed above is marked resolved but it doesn't match the symptoms described here exactly.

    Steve



  • Hi Steve,

    it might be a little late now and I don't know whether it is related to the original issue but there seem to be still issues related to CARP. I tried an upgrade from 2.1.5 to 2.2.1 (RELEASE) as described in my post here with 2.2 (RELEASE): https://forum.pfsense.org/index.php?topic=87485.msg480549#msg480549

    I get "pfsync_undefer_state: unable to find deferred state" printed in the console and the it just hangs after the upgrade process (after reboot). Since it is a virtual machine (a backup node) I can easily revert and try again if you you want me to test something. I even tried to restore the configuration on a fresh install with 2.2.1. I got the same error message printed all over the screen.

    Florian



  • Same error on my CARP installation upgrade from 2.1.5 to 2.2.1. Back on 2.1.5  :(



  • @Marlenio:

    Same error on my CARP installation upgrade from 2.1.5 to 2.2.1. Back on 2.1.5  :(

    Oh, that was some very sad news. I was really looking forward to have this carp kernel thing fixes  :(


  • Netgate Administrator

    If any of you have a chance to test this in 2.2.1-rel and submit a crash report we'd love to see it.

    Steve



  • @stephenw10:

    If any of you have a chance to test this in 2.2.1-rel and submit a crash report we'd love to see it.

    Steve

    Hi Steve, yesterday i have sent 3 crash log about this error.


  • Netgate Administrator

    Awesome, can you send me the IP they came from? Use a PM if you want.

    Steve



  • 'm sorry, i had switch back on 2.1.5. :(


  • Netgate Administrator

    Ok, so you don't know what IP they were sent from?



  • @stephenw10:

    Ok, so you don't know what IP they were sent from?

    Sure. :-) 213.215.138.68


  • Netgate Administrator

    Great. We are trying to replicate this but are just seeing continuous error messages without the crash.
    Do any of you have any special Limiter setup? Can you give any details?

    Steve


  • Netgate Administrator

    Marlenio,
    Looks like the most recent crash report we have from that IP is Mar 3rd. Could they have come from a different IP?

    Steve



  • @stephenw10:

    Marlenio,
    Looks like the most recent crash report we have from that IP is Mar 3rd. Could they have come from a different IP?

    Steve

    Hi steve,
    213.215.138 is VIP of the first output array of pfSense (2 units HA mode). Master IP is 213.215.138.67, BACKUP 213.215.138.71. Let me know if you find it.

    Thanks in advance,


    Mario (Marlenio)


  • Netgate Administrator

    Nothing from anything in that /24 subnet since Mar 3rd.  :-\

    Steve



  • @stephenw10:

    Nothing from anything in that /24 subnet since Mar 3rd.  :-\

    Steve

    It 's very strange. I'm sure it was sent at least three times.  :(



  • https://redmine.pfsense.org/issues/4310

    Can anyone tell us where that patch is, We are having the same issues when applying limiters with CARP and HA.

    Thanks,

    Tom


  • Netgate Administrator

    Yes, that patch is in 2.2.1 and has been in snaps since Feb.
    If you are seeing this problem and are running 2.2.1 can you make sure you keep any crashreports and tell us what hardware you're running.
    Thanks.

    Steve



  • I have to same problem when I upgrade Pfsense from 2.1.5 to 2.2.1…

    Pfsense is running on virtualcenter 2.5. (Never had this before)

    I tried twice, but nothing changed, this message is looping on the screen, and vm is using 100% CPU. There's no crash, but pfsense is useless.

    I'm not using any high availability services (only limiters)



  • Yes, it's the same error on my carp 2.2.1 with limiters.



  • Steve,

    I would really like to help but don't know how. Have you been able to reproduce the behavior? I am back on 2.1.5 now and could repeat the update process. Apart from the non-existing crash reports what else would you need or could be useful?

    I have HA and limiters configured and pfSense runs on KVM/QEMU.

    As mentioned before, I got "pfsync_undefer_state: unable to find deferred state" printed in the console after the upgrade process on the backup node (after reboot).

    I got Mathiew's behavior when I tried to restore my 2.1.5 configuration from the backup node on a fresh install with 2.2.1. In both cases, I had to "physically" shut down the machine.

    Florian


  • Netgate Administrator

    So we are able to replicate the continuous log spam but not any sort of crash. Though in our test the box remained running and accessible. We are looking for crash reports really but any info is useful.

    Mathiew, you are seeing that on a single VM? No CARP/HA setup at all?
    That's interesting. What NICs are you using? What limiters do you have defined?

    Steve



  • One VM only, I tried to import the config file on a fresh pfsense 2.2.1 install, but I had the exact same behavior.

    No CARP/HA setup. I use a a lot of services, openvpn, ipsec, limiters.

    I can send you my config file, if needed. I use E1000 adapter.

    Limiters:
    00001: 700.000 Kbit/s    0 ms burst 0
    q131073  50 sl. 0 flows (1 buckets) sched 65537 weight 0 lmax 0 pri 0 droptail
    sched 65537 type FIFO flags 0x1 256 buckets 1 active
        mask:  0x00 0x00000000/0x0000 -> 0xffffff00/0x0000
    BKT Prot Source IP/port_ Dest. IP/port Tot_pkt/bytes Pkt/Byte Drp
    80 ip          0.0.0.0/0      192.168.100.0/0        4      336  0    0  0
    00002:  10.000 Mbit/s    0 ms burst 0
    q131074  50 sl. 0 flows (1 buckets) sched 65538 weight 0 lmax 0 pri 0 droptail
    sched 65538 type FIFO flags 0x1 256 buckets 1 active
        mask:  0x00 0x00000000/0x0000 -> 0xffffff00/0x0000
    BKT Prot Source IP/port_ Dest. IP/port Tot_pkt/bytes Pkt/Byte Drp
    80 ip          0.0.0.0/0      192.168.101.0/0        4      336  0    0  0
    00003:  1.000 Mbit/s    0 ms burst 0
    q131075  50 sl. 0 flows (1 buckets) sched 65539 weight 0 lmax 0 pri 0 droptail
    sched 65539 type FIFO flags 0x1 256 buckets 0 active
        mask:  0x00 0xffffff00/0x0000 -> 0x00000000/0x0000
    BKT Prot Source IP/port_ Dest. IP/port Tot_pkt/bytes Pkt/Byte Drp
    00005:  2.000 Mbit/s    0 ms burst 0
    q131077  50 sl. 0 flows (1 buckets) sched 65541 weight 0 lmax 0 pri 0 droptail
    sched 65541 type FIFO flags 0x1 256 buckets 1 active
        mask:  0x00 0xffffff00/0x0000 -> 0x00000000/0x0000
    BKT Prot Source IP/port_ Dest. IP/port Tot_pkt/bytes Pkt/Byte Drp
      0 ip        10.0.17.0/0            0.0.0.0/0    11056 14116413  0    0  0
    00006:  3.000 Mbit/s    0 ms burst 0
    q131078  50 sl. 0 flows (1 buckets) sched 65542 weight 0 lmax 0 pri 0 droptail
    sched 65542 type FIFO flags 0x0 0 buckets 1 active
      0 ip          0.0.0.0/0            0.0.0.0/0      19    1118  0    0  0
    00007:  2.000 Mbit/s    0 ms burst 0
    q131079  50 sl. 0 flows (1 buckets) sched 65543 weight 0 lmax 0 pri 0 droptail
    sched 65543 type FIFO flags 0x0 0 buckets 0 active
    00008:  1.000 Mbit/s    0 ms burst 0
    q131080  50 sl. 0 flows (1 buckets) sched 65544 weight 0 lmax 0 pri 0 droptail
    sched 65544 type FIFO flags 0x0 0 buckets 1 active
      0 ip          0.0.0.0/0            0.0.0.0/0      79    40016  0    0  0



  • @stephenw10 Is there something i can provide to you to help out? We are using two physical machines running two Dell CS24s with dual cpus and 16GB ram each and these are our front line. This is a production environment so i wont be able to make a ton of changes but i can give you information. Just let me know what you need me to post.


  • Netgate Administrator

    If you see any crash reports then we want to see those.
    Other than that it's odd that not everyone running is seeing the same thing. If there's something unusual in your config then maybe we can try to see a pattern.

    Steve



  • I repeated the upgrade process today after having disabled my floating limiter rules and it worked.

    However, as soon as I enabled any of them the console didn’t stop printing "pfsync_undefer_state: unable to find deferred state". The first time I wasn’t quick enough in disabling the rules and the machine got unresponsive with no CPU load (no crash report).

    After a reboot I was able to disable the rules and the machine stayed responsive. I then deactivated “Synchronize States” and enabled the floating limiter rules. Apart from the known “Bump sched buckets to 256 (was 0)” the console remained unchanged. As soon as I activated state synchronization "pfsync_undefer_state: unable to find deferred state" was back again.

    In contrast to Mathiew I can choose whether to have either HA or limiters.

    This was done on a backup node, the master still runs 2.1.5.


  • Netgate Administrator

    Thanks for that report flofogal. All data is helpful.

    Steve



  • I removed my limiters rules from the config and it's working, no more psync error…


  • Netgate Administrator

    Mathiew,
    I have seen one other incidence of this in a single box (not part of a HA setup). IN that case the box previously had a CARP config of some sort and had stray tags in the config file that had not been translated correctly across an update.
    In that instance it was fixed by enabling HA sync, saving, and the disabling HA sync again. Limiters could then be used.

    Steve



  • Steve,

    if you say "fixed" it means that limiters could be used without HA afterwards not together with HA. It is a solution to Mathiew's issue only. Correct?

    Cheers,

    Florian



  • @stephenw10:

    Mathiew,
    I have seen one other incidence of this in a single box (not part of a HA setup). IN that case the box previously had a CARP config of some sort and had stray tags in the config file that had not been translated correctly across an update.
    In that instance it was fixed by enabling HA sync, saving, and the disabling HA sync again. Limiters could then be used.

    Steve

    I can try, but I never touch any HA/CARP services on this machine.

    Thanks for your work.

    EDIT : I reactivated limiters after doing that and no problem so far.


  • Netgate Administrator

    Yes, still not fixed (though a problem hasn't yet been found) for HA+Limiters. But we had one other case where a stray HA tag in the config was causing this on a standalone box. Which may be a useful clue in itself because the pfsync interface was not actually configured on that box.

    Steve