Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    2.2.2 sudden instability, TCP sessions: "Operation not permitted: write failed"

    Scheduled Pinned Locked Moved General pfSense Questions
    12 Posts 5 Posters 2.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A
      andydills
      last edited by

      We upgraded to our cluster of 2 pfsense firewalls over the weekend. Everything was fine for about 48 hours, and then suddenly this morning tcp sessions would only work for a few seconds before locking up.

      I manually failed over to the secondary, and everything works great through the secondary.

      Logging into the console on the primary, when I ssh out to another server (on ANY interface…wan, lan, failover), within about 5 seconds max (during which ssh works fine), the ssh session drops with an error of "Operation not permitted: write failed". I also verified that telnet sessions (for example) also timeout and die after initially working for a second, just without a helpful error.

      We're currently running fine on the secondary thankfully, but I can't seem to find any indicators of what could be causing this.

      Suggestions?

      1 Reply Last reply Reply Quote 0
      • stephenw10S
        stephenw10 Netgate Administrator
        last edited by

        What did you upgrade from?
        Are you seeing any errors in the system log?
        Check the dashboard. If you came from 2.1.X make sure it's not reporting FreeBSD 8.3 there still. If so it hasn't rebooted correctly.

        Steve

        1 Reply Last reply Reply Quote 0
        • A
          andydills
          last edited by

          Upgraded from 2.0.3.

          After the upgrade to 2.2.2, it worked up for about 48 hours, up until 4 AM this morning when it suddenly stopped passing much traffic.

          Nothing notable in any of the logs that I can see, aside from "sshd[41070]: fatal: Write failed: Operation not permitted"

          The same config is currently working fine on the secondary (which also went 2.0.3->2.2.2).

          Edit: Yes, I can confirm it's fully upgraded to 2.2.2…this is a datacenter environment and the upgrade was done on-site, and also the box has been rebooted a couple of times since with no improvement.

          1 Reply Last reply Reply Quote 0
          • M
            mer
            last edited by

            du -sh /
            make sure you have diskspace.
            snort running?

            1 Reply Last reply Reply Quote 0
            • S
              Supermule Banned
              last edited by

              Do a backup of the config on 2.0.3 and reinstall a vanilla 2.2.2

              I had to do it that way since the upgrade from 2.1.5 was not working.

              1 Reply Last reply Reply Quote 0
              • A
                andydills
                last edited by

                @mer:

                du -sh /
                make sure you have diskspace.
                snort running?

                No snort, plenty of diskspace.

                I'm fairly certain the write error is relating to writing to the network socket descriptor, not writing to the disk.

                1 Reply Last reply Reply Quote 0
                • A
                  andydills
                  last edited by

                  @Supermule:

                  Do a backup of the config on 2.0.3 and reinstall a vanilla 2.2.2

                  I had to do it that way since the upgrade from 2.1.5 was not working.

                  Hmm…I'm not against trying that, but why would it have worked fine on 2.2.2 for almost two full days?

                  1 Reply Last reply Reply Quote 0
                  • S
                    Supermule Banned
                    last edited by

                    Good question but I experienced a lot of issues when running the upgrade among those were missing .ko files which made it into the release full install but not the upgrade files.

                    I had to kill the 2.1.5 since it was messed up after the upgrade.

                    1 Reply Last reply Reply Quote 0
                    • M
                      mer
                      last edited by

                      @andydills:

                      @mer:

                      du -sh /
                      make sure you have diskspace.
                      snort running?

                      No snort, plenty of diskspace.

                      I'm fairly certain the write error is relating to writing to the network socket descriptor, not writing to the disk.

                      Ok, then it would likely be a queue  not draining somewhere.  There should be commands that let you look at some things.  There may be some information here: https://calomel.org/freebsd_network_tuning.html look at the sysctl.conf section.

                      Have you been up 48hrs on the secondary yet?  It would be interesting datapoint if the secondary does not show the same issue after 48 hrs.

                      1 Reply Last reply Reply Quote 0
                      • A
                        andydills
                        last edited by

                        Figured it out.

                        I have to say, what a huge letdown from the pfsense team for not mentioning this absolutely enormous change to pfsync in 2.2:

                        https://forum.pfsense.org/index.php?topic=93132.msg519077

                        The usual reason on 2.2.x for states to not sync is that the interfaces are mismatched. States in 2.2.x are interface-bound, meaning the interface is a part of the state. For example if the primary node has igb(4) NICs and the secondary has em(4), the states can't sync.

                        That can be worked around in a silly way by adding the NICs to single interface laggs so the states would be on lagg(4) interfaces on both.

                        This is why I'm having problems. The firewalls do not have consistent interface names. Why is this NOT at the top of the upgrade guide, in bold letters?

                        Once I disabled state table sync, the behavior of the firewall returned to normal. Tonight, I'll be implementing some workarounds, but seriously…this is just sloppy. For such a tremendous change, one which causes instability to the point of uselessness (try doing something when the state table resets every 5 seconds), this needs to be well documented and made clear.

                        1 Reply Last reply Reply Quote 0
                        • D
                          divsys
                          last edited by

                          This is why I'm having problems. The firewalls do not have consistent interface names. Why is this NOT at the top of the upgrade guide, in bold letters?

                          Probably because not everyone implements pfSense in a HA or pfsync setup and there were other changes that might have been considered more pressing (that's just my guess).
                          I know there were some similar discussions in the CARP/VIPs section. You might want to review what's there for any more gotchas.

                          Glad you got it up and running.

                          -jfp

                          1 Reply Last reply Reply Quote 0
                          • A
                            andydills
                            last edited by

                            I guess…I don't see anything else really on the upgrade guide that deals with potentially outage-causing issues like this, and they have whole sections on HA considerations.

                            You could also say, most people doing HA already have lagg groups configured. And while that is also true, it also doesn't excuse the omission of this critical piece of data.

                            Thanks for the followups and suggestions though, I don't mean to sound ungrateful, this is just a bit too sloppy for what I've come to expect from the pfsense team.

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post
                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.