Previously working pfSense 2.4.4 setup stops "randomly" accepting LAN traffic
-
Hello everyone,
I ran into an interessting issue which I am not really sure how to solve.
HW / VM Setup:
My pfSense is virtualized with VMware ESXi 6.7 U2 (Build 13006603) on a Dell PowerEdge T20 (E3-1225v3 / 32GB RAM / RAID1 1TB SSD / Intel X540-AT2) the VM has 4vCPUs, 8GB RAM, and the Intel X540-AT2 passedthrou running pfSense 2.4.4-RELEASE-p3.Internet Setup:
Very simple... ISP does not support bridge mode, so I put DMZ mode and route all everything from ISP "router" 192.168.1.1 -> 192.168.1.254 (pfSense VM - WANPort - ix0) and then into my two VLANs (ix1.10 and ix1.50)Before:
So until this sunday (at first I thought its an ISP issue) everything was working (for at least a few month) just fine. Never had any issues never had any downtime. OpenVPN working flawless, NAT, etc.Now:
Since this sunday after a few minutes (i think its less minutes than - amount of traffic) pfSense is no longer reachable on ix1.10 and ix1.50 both IPs are unpingable. But the VM is still working and reachable thru ix0 (WAN). So I open up the VM. Hit "ifconfig ix1 down;ifconfig ix1 up" and everything goes back to normal for a (like I said not sure about amount of time or - traffic - because if I hit a speedtest it stops working immediately). WAN normaly is not affected but from time to time (when I wrote this i had to hit the command I wrote here previously 5 times) I also have to do "ifconfig ix0 down; ifconfig ix0 up"Debugging:
Not really sure where to start. Since it really was working flawlessly and no changes where made.Maybe you guys could tell me where to start looking for problems or issues.
Thanks a lot
Yves -
Check status / system logs
anddmesg
on console
it could be the network card, check cables -
Yup, that ^
If re-linking the NIC brings it back you might be hitting some buffer exhaustion. Something should be logged though.
Steve
-
@kiokoman: thanks for your quick reply. here is my dmesg.txt and dmesg_crashed_ix1_no_longer_reachable.txt (as far as I can tell, they are identical).
Here also the Status -> System_logs.png
Cables I did not swap so far... don't wan't to say anything wrong, but why would cables stop working if I quickly push a lot of data thru? When they are working just fine as long as I don't put load on them
@stephenw10: how can I check for buffer exhaustion?
-
It should be logged.
That screenshot only shows like 2 minutes. What time did it stop responding? Before the boot?
Try getting the complete system log with:
clog /var/log/system.log > /tmp.systemlog.txt
Steve
-
What I did is: I cleared dmesg with dmesg -c started the vm new exported the dmesg to the text file dmesg > dmesg.txt created the crash with a speedtest on speedtest.net logged in thru openvpn created the dmesg_crashed and the printscreens.
I really hope this log helps you more. tmp.systemlog.txt since I think it s quite strange that the system just starts having this troubles...
-
Can we see those dmesg outputs and printscreens?
At what time approximately did it stop responding?
Just before you rebooted here?
Sep 3 14:32:33 fw01 php-fpm: /index.php: Successful login for user 'admin' from: 10.10.1.2 (Local Database) Sep 3 17:44:40 fw01 php-fpm: /index.php: Successful login for user 'admin' from: 10.10.1.2 (Local Database) Sep 3 17:46:28 fw01 sshd[44883]: user admin login class [preauth] Sep 3 17:46:28 fw01 sshd[44883]: user admin login class [preauth] Sep 3 17:46:28 fw01 sshd[44883]: user admin login class [preauth] Sep 3 17:46:31 fw01 sshd[44883]: Accepted keyboard-interactive/pam for admin from 10.10.1.2 port 59170 ssh2 Sep 3 17:47:03 fw01 reboot: rebooted by admin Sep 3 17:47:03 fw01 syslogd: exiting on signal 15
You can see there is nothing logged there at all.
Steve
-
@stephenw10: so I am back home for more testing, the dmesg outputs and printscreens are in this post https://forum.netgate.com/topic/146231/previously-working-pfsense-2-4-4-setup-stops-randomly-accepting-lan-traffic/4
I can replicate the issue as many times as I want. Just need to reboot pfsense, connect my notebook and hit speedtest.net it crashes instantly... but no bluescreen or log as far as I can tell... nothing in dmesg nothing in /var/system.log... do I need to change something to verbose? to get more infos?
-
Ok so what time in that log did it stop passing traffic on ix1?
You might also check
netstat -m
when it fails. An mbuf exhaustion like that would normally affect all NICs though.The output of
sysctl dev.ix.1
might show you something if it's just interface.Steve
-
it should always be the last thing / time at the logs since I always did the logs after the crash beside the dmesg.txt
here are the two requested outputs (both created within seconds after the crash.
-
It somehow got worse... now the pfsense ix1 which currently for eliminating issues is directly attached to my desktop also put in vlan 50 (rechecked against the switch - because I started to get worried if I am stupid - but desktop is in VLAN50 and talks on VLAN50) but it seams ix1.50 does not accept ping / or anything from my desktop even after ifconfig ix1 down;ifconfig ix1 up or am I doing something completly wrong right now?
I can also remove vlans and test again without them...
-
This post is deleted! -
now its official.... I can't not even ping pfSense (10GbE cable directly from my workstation <-> pfSense ix1 without VLAN10 or VLAN50) anymore I think I broke the internet fun byside, it is quite strange whats happening here...
-
@Yves_ said in Previously working pfSense 2.4.4 setup stops "randomly" accepting LAN traffic:
status: no carrier
Isn't going to work very well.
-
Mmm, bad NIC maybe? Try re-assigning ix0 and ix1, does it now fail WAN side?
Steve
-
@johnpoz said in Previously working pfSense 2.4.4 setup stops "randomly" accepting LAN traffic:
@Yves_ said in Previously working pfSense 2.4.4 setup stops "randomly" accepting LAN traffic:
status: no carrier
Isn't going to work very well.
No, that actually was my fault. thats why I deleted the post. I forgot the plug the cable back from the switch directly into ix1....
-
Not sure this looks great..
dev.ix.1.mac_stats.local_faults: 31
I see 0 faults on everything I'm checking here. What does dev.ix.0.mac_stats.local_faults show there?
You might also try disabling flow control:
sysctl dev.ix.1.fc=0
If that works you can add it as a system tunable in the gui.
Steve
-
@stephenw10 okay, so I set ix0 to lan and ix1 wan no more vlans... and voila ix0 still working now as lan... and ix1 which now would be wan is still dead... so eighter one port on my card is broken (which would be the first time I hear of something like that) or there is something else seriously wrong... anyway going to create a backup now of pfsense, kill the vm completely and reinstall a new vm. if this does not work. I will switch the X540 tomorrow.
-
Mmm, yeah does seem like a hardware issue or maybe something in the way that port is passed through to pfSense.
Replacing the card will tell you that though.
Steve
-
I have some more feedback. After almost giving up, I thought why not just reboot the complete VMware ESXi server for once. Which I did and which seams to have solved all the issues... even doh the intel x540 is completely passedthru, very very strange. I will keep an eye on everything and keep you posted.
@stephenw10 THANK YOU SO MUCH FOR ALL YOUR EFFORT!
-
Hmm, I guess it retained some config then. We have seen NICs that require a complete power cycle to clear some issues.
Steve