Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    22.05 Net problems after upgrading (SG3100)

    Scheduled Pinned Locked Moved General pfSense Questions
    12 Posts 2 Posters 1.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M
      michael_samer
      last edited by

      Hi
      I'm running here a bunch of SG3100 LAN-2-LAN Firewalls. Since I upgraded about 15 of my firewalls I received a huge amount of WAN Port dropouts (DHCP=new IP) for 1-2Min. every 20min up to 24h. My (DynFI) Log is quite full of these reports. Because we use NAC I had to implement the
      wpa_supplicant (early shell start) mechanism, and because I have no issues on the LAN port side I suspect the wpa_supplicant as the causing software. So far I have not dug deep into the logs (which one?) to find why the port is dropped (down) reported by the (Cisco) Switch.
      I attached two pics of the systemlog on the client and the external DynFi Log.
      The init command is
      "wpa_supplicant -dd -D wired -c /etc/cert/eap_tls.conf -i mvneta2 -B & sleep 20 & ifconfig mvneta2 down & sleep 10 & ifconfig mvneta2 up"
      As I found not big issue about the 22.05 on the net I guess it's an upgrade issue from wpa V1 to V2 ?
      All my boxes (15) with 22.05 have this issue; the rest (10) with 22.01 works as it should with no fault. As I have one HA solution (XG7100) with no NAC on the WAN Ports I'd report that the issue is convincingly connected to the wpa_supplicant, don't you think?

      CATAR0017_Unreachable.png CATAR0017_NewIP.png

      Switching Switch Ports, locations, cables and renewed setup with USB Stick and putting the config back didn't supplied a workable solution.
      My solution so far is fetching replacement boxes with 22.01, put the config back and replace the "faulty boxes" so far. This works, but is quite resource intensive and disrupts the rock solid appearance (since 2017) so far.

      Any guess?

      Cheers
      Michael

      1 Reply Last reply Reply Quote 0
      • stephenw10S stephenw10 moved this topic from Problems Installing or Upgrading pfSense Software on
      • stephenw10S
        stephenw10 Netgate Administrator
        last edited by

        Do you know what happened at 11.13 in those logs? What triggered the new WAN IP?

        Are you able to test one without the NAC port restriction so you can boot without the WPA line?
        I don't think I've ever seen anyone else using that so it would be my first suspect here. It's completely untested as far as I know.

        Steve

        M 1 Reply Last reply Reply Quote 0
        • M
          michael_samer @stephenw10
          last edited by

          @stephenw10 :
          Hi Steve
          the WAN port is reported dropped, same goes for the switch log which shows the client as initiator. With the reinit of the interface the wpa_supplicant restarts (and fetch a new IP via DHCP) and everything is OK for some 30min (on some machines even up to 1 day).
          I'm willing to look deeper if I knew where and for what I'd look out.
          Everything is fine on my 22.01 (and older) SG3100's, so pretty strange.

          As pfsense is still missing the 802.1X function (on LAN/WAN) we asked in 2017 for a working solution and were linked to the above script in a thread in the Forum of Netgate. We are aware that the function is rarely used so far. Back then (V2.3x and 2.4x) it was OK and we hoped it would be implemented sometime in the future (like Mikrotik did).

          As we have no influence on our office network (and therefor NAC) I cannot simply test to disable the script, but as mentioned in the topic I've one HA version running (22.05) with static IP and no NAC and there is no problem reported, so it seems the wpa_supplicant seems to have changed?!

          Cheers
          Michael

          1 Reply Last reply Reply Quote 0
          • stephenw10S
            stephenw10 Netgate Administrator
            last edited by

            Hmm, it seems odd to me that the switch port reports the link dropped and pfSense then pulls a new IP but the pfSense logs above don't show mvneta2 losing link.
            Do you have any x86 pfSense installs using this?

            Does the switch log any sort of authentication error?

            Or the auth server?

            Steve

            M 1 Reply Last reply Reply Quote 0
            • M
              michael_samer @stephenw10
              last edited by

              @stephenw10
              Hi Steve
              The XG7100 is x86 based, but without NAC as it would not work with the CARP function. Our original implementation was x86 based (SG2220) and it worked back then, but was all replaced in 2018 as the SG2220 went out of support, so we used the next one: SG3100 which are all ARM based. The wpa command is since then the same, just the different dev names where adapted.
              Unfortunately I do not get the (Cisco) log of the switch to look into. No "Unusual" behavior (wrong EAP init, ....) was recorded as I was told. My first guess was the switch port is faulty, but when we upgraded further the problem grew in size until all showed the same symptoms.
              The WAN Port Drop could be a result of the script as it does an ifdown+up to get the EAP config implemented.
              I think it would be best to look into the wpa_supplicant log itself. Afaik there's non done after the mechanic worked, but I could run it now for further details. A different approach could be to use the binary of an 22.01 wpa and put it somewhere in the pfsense and adapt the script to use this one instead and check if the new one or some enviroment is changed. wpa_supplicant is very far stretched used (WLAN) so a solid bug is quite unusual, but we had a few times problems (ubuntu) as the wired function is not often tested against.
              Any hints/tipps?
              Cheers
              Michael

              1 Reply Last reply Reply Quote 0
              • stephenw10S
                stephenw10 Netgate Administrator
                last edited by

                You could try running wpa_cli status when it fails to see if it's seeing the issue.

                It's hard to imagine the wpa auth causing a problem at any time except when it first tries to connect or at a re-auth interval which I would expect to be fixed. Neither of which seem to be what you're seeing.

                M 1 Reply Last reply Reply Quote 0
                • M
                  michael_samer @stephenw10
                  last edited by

                  @stephenw10 Hello Stephen
                  sorry for my long delay due to summer holiday.
                  Unfortunately I'm unable to do such tasks as we are administrating the boxes from the "WAN" side, so when the connect brakes we have to wait the 60-120sec. until the Renew is done and connection again possible. The log shows what I see from the outside.
                  What I've done so far is switching two boxes with V22.01, config import and same problem. So I took the two exchanged boxes and downgraded to 21.05.2 and ping and ssh is stable again.
                  I've not looked into both versions if the wpa is a different version, but maybe the wpa is not the basic problem but a driver AND wpa problem or ???
                  I'll downgrade the boxes (about 40 so far) to 21.05 to have a workaround solution (until IT Sec demands an upgrade) and will retest how the V22.x will behave on the net side BEFORE I upgrade again.
                  I did a test so far with a simple ping tool (multiping) to check IP changes and drops. What I did find out so far that the (V22.x) boxes does not see lost pings (while they still have the ssh tunnel open and still with the same IP) even when I log for them explicitly. Answered pings are all logged and OK. As the V21.x does not lose ping packets it's not a real reference. As this is all internal LAN usually no packges are filtered or lost due to "network in between". I'm quite at a loss here. And 40 boxes including the exchange procedure are a few hundred hours of work; I'm not so keen of "just" downgrading it. Most of my 40 machines did have V21.x installed before (as we normally upgrade only twice a year), while a few already had V22.01 on it when installed thsi year.
                  Cheers
                  Michael

                  1 Reply Last reply Reply Quote 0
                  • stephenw10S
                    stephenw10 Netgate Administrator
                    last edited by

                    Hmm, OK so to clarify you initially thought this was an issue going from 22.01 to 22.05? But in fact you are seeing problems in 22.01 and most of your firewalls upgraded from 21.05 directly?

                    As you say there is almost no-one running this but let me see what I can find. There may be some more general issue there.

                    Steve

                    M 1 Reply Last reply Reply Quote 0
                    • M
                      michael_samer @stephenw10
                      last edited by

                      @stephenw10: Hello Steve
                      that's the right assumption: I upgraded some boxes from 22.01 to 22.05, but a bigger number from 21.x to current stable. A few I'd to except as they are 24/7 and need a long preparation, so I have a few old ones to compare. With the replaced I have a few more or 22.01 and 21.05.2, while the connection loss is only recorded so far with the V22.x Versions. In June we installed DynFi as well and short after the updates I recognized the net problem as I couldn't reach some boxes while working (web) from remote. While digging into that experience I looked into the log and detected the massive loss of SSH connections from the DynFi. When I did the ping tests a serious problem "painted itself on the wall" while I tried to figure out if this is site dependent or version or switch brand or NAT forwarding or whatever.
                      As one box/HA Cluster is using stat. IP and CARP and NAC exception (while all others run on DHCP and NAC=wpa) a connection seems obvious...
                      When I had a switch port analyses ordered I just received a "everything's fine with the switch port" and we exchanged boxes (maybe one or two are faulty or react strange on our Cisco, Juniper network switches), but the symptoms stayed with the boxes. Reinstalling (V22.01 and update to 22.05) and importing the settings showed the same effect.
                      Then I triggered that call while trying other versions in the meantime for testing.
                      And here I'm
                      Cheers
                      Michael

                      1 Reply Last reply Reply Quote 0
                      • stephenw10S
                        stephenw10 Netgate Administrator
                        last edited by

                        Ok.
                        Are you using the Dynfi Connection Agent on all the 3100s directly?

                        I doubt it would affect WPA supplicant directly but that is another variable. And they do claim it doesn't support Plus. So maybe not ARM. I would expect it to fail completely though if that was the case. I don't see any WPA related dependencies.

                        Steve

                        M 1 Reply Last reply Reply Quote 0
                        • M
                          michael_samer @stephenw10
                          last edited by

                          @stephenw10 Hi Steve
                          no we are unable to use the Direct Connector as we use a proxy to the internet and dynfi could not differ between LAN (10.x.x.x) and Internet connection. DynFi is so limited to ssh which is a not proxied connection and therefor works.
                          The wpa just obviously is a connection between the effects we see; but it's just a guess. I'm using the wpa_supplicant on a lot of systems (linux) to get NAC done and never experienced any problem AFTER I figured out the right method/syntax/format.

                          Is there a logging feature just for link status or the NIC itself?
                          Usually our network uses lazy MAC/IP releases (1-2d) so in most cases you get the same IP after just restarting any OS/device even after a whole weekedn.
                          In the "drop packet/Connection loss" case I get a new IP everytime the connection is lost which is very dubious in itself. Usually we have to trigger a clear "MAC/DNS/DHCP release" ticket just to get a consistent IP to DNS base (e.g. device moved from one location to the next) as not all clients send their dhcp_name.
                          Cheers
                          Michael

                          stephenw10S 1 Reply Last reply Reply Quote 0
                          • stephenw10S
                            stephenw10 Netgate Administrator @michael_samer
                            last edited by

                            @michael_samer said in 22.05 Net problems after upgrading (SG3100):

                            In the "drop packet/Connection loss" case I get a new IP everytime the connection is lost which is very dubious in itself.

                            Hmm, yeah that seems very odd. Like it sees a new MAC. Hard to see how that could be the case though.
                            The NIC link status is logged in the main system log only.

                            So DynFi, in your setup, just runs commands over SSH remotely? Not that then.

                            Steve

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post
                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.