22.05 Net problems after upgrading (SG3100)

michael_samer

Hi
I'm running here a bunch of SG3100 LAN-2-LAN Firewalls. Since I upgraded about 15 of my firewalls I received a huge amount of WAN Port dropouts (DHCP=new IP) for 1-2Min. every 20min up to 24h. My (DynFI) Log is quite full of these reports. Because we use NAC I had to implement the
wpa_supplicant (early shell start) mechanism, and because I have no issues on the LAN port side I suspect the wpa_supplicant as the causing software. So far I have not dug deep into the logs (which one?) to find why the port is dropped (down) reported by the (Cisco) Switch.
I attached two pics of the systemlog on the client and the external DynFi Log.
The init command is
"wpa_supplicant -dd -D wired -c /etc/cert/eap_tls.conf -i mvneta2 -B & sleep 20 & ifconfig mvneta2 down & sleep 10 & ifconfig mvneta2 up"
As I found not big issue about the 22.05 on the net I guess it's an upgrade issue from wpa V1 to V2 ?
All my boxes (15) with 22.05 have this issue; the rest (10) with 22.01 works as it should with no fault. As I have one HA solution (XG7100) with no NAC on the WAN Ports I'd report that the issue is convincingly connected to the wpa_supplicant, don't you think?

Switching Switch Ports, locations, cables and renewed setup with USB Stick and putting the config back didn't supplied a workable solution.
My solution so far is fetching replacement boxes with 22.01, put the config back and replace the "faulty boxes" so far. This works, but is quite resource intensive and disrupts the rock solid appearance (since 2017) so far.

Any guess?

Cheers
Michael

stephenw10

Do you know what happened at 11.13 in those logs? What triggered the new WAN IP?

Are you able to test one without the NAC port restriction so you can boot without the WPA line?
I don't think I've ever seen anyone else using that so it would be my first suspect here. It's completely untested as far as I know.

Steve

michael_samer

@stephenw10 :
Hi Steve
the WAN port is reported dropped, same goes for the switch log which shows the client as initiator. With the reinit of the interface the wpa_supplicant restarts (and fetch a new IP via DHCP) and everything is OK for some 30min (on some machines even up to 1 day).
I'm willing to look deeper if I knew where and for what I'd look out.
Everything is fine on my 22.01 (and older) SG3100's, so pretty strange.

As pfsense is still missing the 802.1X function (on LAN/WAN) we asked in 2017 for a working solution and were linked to the above script in a thread in the Forum of Netgate. We are aware that the function is rarely used so far. Back then (V2.3x and 2.4x) it was OK and we hoped it would be implemented sometime in the future (like Mikrotik did).

As we have no influence on our office network (and therefor NAC) I cannot simply test to disable the script, but as mentioned in the topic I've one HA version running (22.05) with static IP and no NAC and there is no problem reported, so it seems the wpa_supplicant seems to have changed?!

Cheers
Michael

stephenw10

Hmm, it seems odd to me that the switch port reports the link dropped and pfSense then pulls a new IP but the pfSense logs above don't show mvneta2 losing link.
Do you have any x86 pfSense installs using this?

Does the switch log any sort of authentication error?

Or the auth server?

Steve

michael_samer

@stephenw10
Hi Steve
The XG7100 is x86 based, but without NAC as it would not work with the CARP function. Our original implementation was x86 based (SG2220) and it worked back then, but was all replaced in 2018 as the SG2220 went out of support, so we used the next one: SG3100 which are all ARM based. The wpa command is since then the same, just the different dev names where adapted.
Unfortunately I do not get the (Cisco) log of the switch to look into. No "Unusual" behavior (wrong EAP init, ....) was recorded as I was told. My first guess was the switch port is faulty, but when we upgraded further the problem grew in size until all showed the same symptoms.
The WAN Port Drop could be a result of the script as it does an ifdown+up to get the EAP config implemented.
I think it would be best to look into the wpa_supplicant log itself. Afaik there's non done after the mechanic worked, but I could run it now for further details. A different approach could be to use the binary of an 22.01 wpa and put it somewhere in the pfsense and adapt the script to use this one instead and check if the new one or some enviroment is changed. wpa_supplicant is very far stretched used (WLAN) so a solid bug is quite unusual, but we had a few times problems (ubuntu) as the wired function is not often tested against.
Any hints/tipps?
Cheers
Michael

stephenw10

You could try running wpa_cli status when it fails to see if it's seeing the issue.

It's hard to imagine the wpa auth causing a problem at any time except when it first tries to connect or at a re-auth interval which I would expect to be fixed. Neither of which seem to be what you're seeing.

michael_samer

@stephenw10 Hello Stephen
sorry for my long delay due to summer holiday.
Unfortunately I'm unable to do such tasks as we are administrating the boxes from the "WAN" side, so when the connect brakes we have to wait the 60-120sec. until the Renew is done and connection again possible. The log shows what I see from the outside.
What I've done so far is switching two boxes with V22.01, config import and same problem. So I took the two exchanged boxes and downgraded to 21.05.2 and ping and ssh is stable again.
I've not looked into both versions if the wpa is a different version, but maybe the wpa is not the basic problem but a driver AND wpa problem or ???
I'll downgrade the boxes (about 40 so far) to 21.05 to have a workaround solution (until IT Sec demands an upgrade) and will retest how the V22.x will behave on the net side BEFORE I upgrade again.
I did a test so far with a simple ping tool (multiping) to check IP changes and drops. What I did find out so far that the (V22.x) boxes does not see lost pings (while they still have the ssh tunnel open and still with the same IP) even when I log for them explicitly. Answered pings are all logged and OK. As the V21.x does not lose ping packets it's not a real reference. As this is all internal LAN usually no packges are filtered or lost due to "network in between". I'm quite at a loss here. And 40 boxes including the exchange procedure are a few hundred hours of work; I'm not so keen of "just" downgrading it. Most of my 40 machines did have V21.x installed before (as we normally upgrade only twice a year), while a few already had V22.01 on it when installed thsi year.
Cheers
Michael

stephenw10

Hmm, OK so to clarify you initially thought this was an issue going from 22.01 to 22.05? But in fact you are seeing problems in 22.01 and most of your firewalls upgraded from 21.05 directly?

As you say there is almost no-one running this but let me see what I can find. There may be some more general issue there.

Steve

michael_samer

@stephenw10: Hello Steve
that's the right assumption: I upgraded some boxes from 22.01 to 22.05, but a bigger number from 21.x to current stable. A few I'd to except as they are 24/7 and need a long preparation, so I have a few old ones to compare. With the replaced I have a few more or 22.01 and 21.05.2, while the connection loss is only recorded so far with the V22.x Versions. In June we installed DynFi as well and short after the updates I recognized the net problem as I couldn't reach some boxes while working (web) from remote. While digging into that experience I looked into the log and detected the massive loss of SSH connections from the DynFi. When I did the ping tests a serious problem "painted itself on the wall" while I tried to figure out if this is site dependent or version or switch brand or NAT forwarding or whatever.
As one box/HA Cluster is using stat. IP and CARP and NAC exception (while all others run on DHCP and NAC=wpa) a connection seems obvious...
When I had a switch port analyses ordered I just received a "everything's fine with the switch port" and we exchanged boxes (maybe one or two are faulty or react strange on our Cisco, Juniper network switches), but the symptoms stayed with the boxes. Reinstalling (V22.01 and update to 22.05) and importing the settings showed the same effect.
Then I triggered that call while trying other versions in the meantime for testing.
And here I'm
Cheers
Michael

stephenw10

Ok.
Are you using the Dynfi Connection Agent on all the 3100s directly?

I doubt it would affect WPA supplicant directly but that is another variable. And they do claim it doesn't support Plus. So maybe not ARM. I would expect it to fail completely though if that was the case. I don't see any WPA related dependencies.

Steve

michael_samer

@stephenw10 Hi Steve
no we are unable to use the Direct Connector as we use a proxy to the internet and dynfi could not differ between LAN (10.x.x.x) and Internet connection. DynFi is so limited to ssh which is a not proxied connection and therefor works.
The wpa just obviously is a connection between the effects we see; but it's just a guess. I'm using the wpa_supplicant on a lot of systems (linux) to get NAC done and never experienced any problem AFTER I figured out the right method/syntax/format.

Is there a logging feature just for link status or the NIC itself?
Usually our network uses lazy MAC/IP releases (1-2d) so in most cases you get the same IP after just restarting any OS/device even after a whole weekedn.
In the "drop packet/Connection loss" case I get a new IP everytime the connection is lost which is very dubious in itself. Usually we have to trigger a clear "MAC/DNS/DHCP release" ticket just to get a consistent IP to DNS base (e.g. device moved from one location to the next) as not all clients send their dhcp_name.
Cheers
Michael

stephenw10

@michael_samer said in 22.05 Net problems after upgrading (SG3100):

In the "drop packet/Connection loss" case I get a new IP everytime the connection is lost which is very dubious in itself.

Hmm, yeah that seems very odd. Like it sees a new MAC. Hard to see how that could be the case though.
The NIC link status is logged in the main system log only.

So DynFi, in your setup, just runs commands over SSH remotely? Not that then.

Steve