Intermittent loss of connectivity + SSH Sessions die
-
Firstly, thank you in advance for any help anyone may be able to provide.
Setup:
PCEngines APU2e4, connected via PPPoE to ISP on igb0 as WAN.
Lan on igb2.
Wifi is serviced by an AmplifiHD in bridge mode (routing managed by pfsense), connected directly to the LAN port.Background:
For some time (months) I have had complaints from various family members of internet dropouts, disconnections something like around every 15-30 minutes (its random) especially from online games etc (see daughter playing Minecraft) as well as my wife shouting at me the internet is not working again regularly :)I have got my ISP to upgrade my connection, I have run cables everywhere I can to reduce WiFi needs, and also set up a smokeping service on my server to monitor the connection to most of the devices as well as various internet services.
I am seeing generally zero packet loss being reported by smokeping over the last 24h to both internal and external services, the only time i have seen loss is when i have purposefully disconnected the router (reboot etc).
Additionally, since I installed the router, SSH to the console has always been unreliable, sometimes it disconnects within 5s of starting a session, sometimes it lasts a few minutes but always gets dropped.
Thinking maybe there is some connection between these issues, I have dug through the log files to find any issues/errors I can and find this the main culprit;
"2021-04-08 13:15:02.653773+02:00 sshd 29331 Fssh_packet_write_poll: Connection from user admin X.X.X.X port 59064: Permission denied"
I have read every post I can find on the topic, but none are relevant. I am not authenticating wrongly as am using key authentication only, and the problem is not on connect, it is dropping the connection sometime later in the session (randomly). I considered maybe there was a conflict in having the web gui and ssh session open simultaneously, so even tried using different users and/or closing the gui, but it has not helped.
Following some other topics, I saw some advice to change the firewall optimisation to 'conservative' and to set 'Clear invalid DF bits instead of dropping the packets' on and doing so did have some effect, the web gui is noticeably faster loading and much more snappy, it was always very slow before.
I have also added my local LAN network to the Login Protection pass list (to whitelist it). But the issue persists.
There is nothing especially fancy with the setup on the firewall, mainly the automated rules and a couple of port forwards.
I am at a loss as to where to search next, and can find no more info, if anyone has some hints/tips on where to look next it would be much appreciated!
-
Strange.
I see the opposite of this issue : I approach a PC that I didn't use for a day or so, and find a 'reduced to the tskbar' Putty session : I was logged in for day(s) ...
This connection, if Putty - or your ssh client - is set to maintain it, stays up until "the TCP sessions" dies.
This can happen if you rip out the cable - restart pfSense - some NIC event happens (both side) or something like that.So, you should tell us what the issue is :
You are using a Wifi connection, which adds a whole boatload of new possible issues, from bas rado reception to grand-ma start that Microwave that not only fries the neurones but also nukes the 2.4 or 5 Ghz band.
Logs on the pfSense side : some message from a NIC ? If you're using Realtek - driver is called "rex" where x is a number that you're might be good for bad experiences. Realtek devices are great for 'others', but never for the stuff you use yourself.@nova3uk said in Intermittent loss of connectivity + SSH Sessions die:
Connection from user admin X.X.X.X port 59064: Permission denied"
Some one tries to enter the SSH access - port 22 - on pfSense. You should give him/her a copy of your 'key'.
Where X.X.X.X is your LAN device ? (no need to hide this IP : we all have the same IP between 192.168.1.2 and 192.168.1.254, where 192.168.1.1 = pfSense.@nova3uk said in Intermittent loss of connectivity + SSH Sessions die:
internet dropouts
That's vague.
Nothing out of the ordinary in the pfSense log files ?
Take note : there are several log files !
Check the General log, the Resolver log (how often unbound says ....... start ......), the DHCP log (is there a device that asks for an IP every 10 seconds ?)Just a popular one : you can create the situation yourself : See this one :
Click on the square + circle button.
Now go 'surfing'..
Believe it or not, "Internet" works fine right now. But you stopped DNS. And 'humans' need DNS to work, because, if not, their comfort zone is immensely reduced.Maybe your 'ssh ruptures' and 'all other sessions are ruptured =='internet is not working ' is one and the same problem.
Exchange wires, switches, access points, etc test up until you find the 'thing' that creates your issue.
-
@gertjan Thanks for your reply :)
The problems with SSH happen from anywhere on the lan, be it my pc connected by cable or from my laptop over wifi - but i usually use the pc on a cable 99% of time from home.
I also have a server running ubuntu on the same network, and I could leave the putty open to it for a week and find it in the tray just like you, it's totally specific to the pfsense box.
The latest example, happened just now and you can see how long my session managed to stay alive (yes I have hidden my ssh key with x's ;) )
2021-04-08 16:00:04.205491+02:00 sshd 64732 Fssh_packet_write_poll: Connection from user admin 192.168.1.50 port 62729: Permission denied
2021-04-08 15:59:17.797235+02:00 sshd 64732 Accepted publickey for admin from 192.168.1.50 port 62729 ssh2: RSA SHA256:XXXXXXXXXX
It managed all of around 45 seconds to keep connected. The GUI was not open, on any machine. And I was doing nothing, the internet connection was up and no issues. In fact the ping I left running from my pc to pfsense stayed stable all the time with no losses.
I checked every log file I can see for anything around 16:00 and there is nothing. I even asked the network admin in my office to check prior to this opening post, and neither of us could find anything to give us any clue what is the issue.
The only other thing I can say is that if you get disconnected, then try to open the web gui it hangs for a good 5-10s and this seems to be a common occurrence, of course, I can't prove it with a logfile its just an observation.
It feels like something is restarting or blocking, but I cannot find a single thing in the logs. To try to pin it down so far I have;
- Shut down OpenVPN
- Disabled DNS Resolver(unbound)
The next step will be to reinstall on different hardware maybe because it is highly inconvenient. Or buy something else as family still whinge about random disconnections :(
Prior to using pfsense I used for probably 10yrs vanilla openbsd on the same kind of device and never had a single such issue, I only moved to pfsense as I was lacking time to keep maintaining the box with updates etc and wanted a more plug n play solution which was as secure/functional and familiar with pf.
Thanks again for any tips on where to look next!
-
@nova3uk said in Intermittent loss of connectivity + SSH Sessions die:
And I was doing nothing
Who is your SSH client ?
Does it log ? Can it tell you why the connections stops ?SSD clienst have a keepalive option :
@nova3uk said in Intermittent loss of connectivity + SSH Sessions die:
Disabled DNS Resolver(unbound)
Shutting down DNS is a nice card to play if you want to make things worse.
But if you use a IPv4 to connect to your pfSense, then DNS shouldn't play a role.
( but be careful, if the SSH server or clients need DNS to resolve something, then you will have an issue )You have an internal Ubuntu server.
Set up something like this :where 192.168.2.2 is the IP of the Ubuntu.
Select the correct interface.Now you have dpinger making ping stats - see the Status => Monitoring => Quality page.
Also : go to the console.
run top - or install and run htop.
Are there processes jumping at the top eating processor time ?I'm using pfSense at work (dedictaed device : an old PC) and at home (running from a Hyper-V VM) I've a Favorite link in my web browser, using an URL (not an IP) and it uses https login.
After I click, I get the login page in half a second.Another test : swap NIC's on your pfSense device.
-
@gertjan Thanks again :)
I shut down the DNS as there was some suspicion unbound could be triggering reloads of the rules, of course i set the firewall to use 1.1.1.1 in its global settings and not its own resolver, so it had access to dns no problem.
In any case it did not help.
I have gone now for ballistic solution(thankfully i have an old edgerouter ready configured as a hot swap, so its currently doing the routing - and surprise surprise no connection flapping reported since it was).
In any case, i reinstalled the APU with opnsense, run through all the motions, zero issues in testing it in a lab environment. Wanted to prove it was not a hardware fault.
Then reinstalled clean pfsense 2.5, again no issues... even ssh is keeping alive during all the setup, adding/changing rules in the firewall or nat etc no issues.. maintained connection 2h without a glitch.
So I thought ok.. lets reinstall clean once again, and then push my config.xml and see what happens..
After loading my backup config and restarting, GUI ran like a pig and the ssh dropped within 2 minutes. Clearly, something bad in my config, as no packages was restored.
So now I am going through and adding back piece by piece what i need from the config, and testing to see what section I change causes it to start having issues.. Long-winded I know, but its a needle in a haystack search now i refuse to be beaten :)
I will check the things you mentioned and see if anything pops up.. Thanks again!
-
After much sweat, I have solved the issue. It may well be of interest to someone else in future so I will post what I found..
Full manual clean setup, piece by piece got it working with zero issues, very fast & snappy.
So I downloaded the config, got the backup, and used a merge tool to do a diff...
After going through every single line, the conclusion is... that I did not reset the 'network tweaks' after upgrading to 2.5 :(
Had I have noticed this earlier, I might have saved myself a world of pain! Of course, with so many settings in so many places, it is easy to be blind to oneself, more fool me though as I was confident that I had checked and reset them post-upgrade after I saw that they were not required any longer to maintain Gb throughput in v2.5.
Many thanks for your tips, in the end, we got there and I am very happy at last, as are the rest of the household ;)