Previously working pfSense 2.4.4 setup stops "randomly" accepting LAN traffic



  • Hello everyone,

    I ran into an interessting issue which I am not really sure how to solve.

    HW / VM Setup:
    My pfSense is virtualized with VMware ESXi 6.7 U2 (Build 13006603) on a Dell PowerEdge T20 (E3-1225v3 / 32GB RAM / RAID1 1TB SSD / Intel X540-AT2) the VM has 4vCPUs, 8GB RAM, and the Intel X540-AT2 passedthrou running pfSense 2.4.4-RELEASE-p3.

    Internet Setup:
    Very simple... ISP does not support bridge mode, so I put DMZ mode and route all everything from ISP "router" 192.168.1.1 -> 192.168.1.254 (pfSense VM - WANPort - ix0) and then into my two VLANs (ix1.10 and ix1.50)

    Before:
    So until this sunday (at first I thought its an ISP issue) everything was working (for at least a few month) just fine. Never had any issues never had any downtime. OpenVPN working flawless, NAT, etc.

    Now:
    Since this sunday after a few minutes (i think its less minutes than - amount of traffic) pfSense is no longer reachable on ix1.10 and ix1.50 both IPs are unpingable. But the VM is still working and reachable thru ix0 (WAN). So I open up the VM. Hit "ifconfig ix1 down;ifconfig ix1 up" and everything goes back to normal for a (like I said not sure about amount of time or - traffic - because if I hit a speedtest it stops working immediately). WAN normaly is not affected but from time to time (when I wrote this i had to hit the command I wrote here previously 5 times) I also have to do "ifconfig ix0 down; ifconfig ix0 up"

    Debugging:
    Not really sure where to start. Since it really was working flawlessly and no changes where made.

    Maybe you guys could tell me where to start looking for problems or issues.

    Thanks a lot
    Yves


  • LAYER 8

    Check status / system logs
    and

    dmesg
    

    on console
    it could be the network card, check cables


  • Netgate Administrator

    Yup, that ^

    If re-linking the NIC brings it back you might be hitting some buffer exhaustion. Something should be logged though.

    Steve



  • @kiokoman: thanks for your quick reply. here is my dmesg.txt and dmesg_crashed_ix1_no_longer_reachable.txt (as far as I can tell, they are identical).

    Here also the Status -> System_logs.png

    Cables I did not swap so far... don't wan't to say anything wrong, but why would cables stop working if I quickly push a lot of data thru? When they are working just fine as long as I don't put load on them

    @stephenw10: how can I check for buffer exhaustion?


  • Netgate Administrator

    It should be logged.

    That screenshot only shows like 2 minutes. What time did it stop responding? Before the boot?

    Try getting the complete system log with: clog /var/log/system.log > /tmp.systemlog.txt

    Steve



  • What I did is: I cleared dmesg with dmesg -c started the vm new exported the dmesg to the text file dmesg > dmesg.txt created the crash with a speedtest on speedtest.net logged in thru openvpn created the dmesg_crashed and the printscreens.

    I really hope this log helps you more. tmp.systemlog.txt since I think it s quite strange that the system just starts having this troubles...


  • Netgate Administrator

    Can we see those dmesg outputs and printscreens?

    At what time approximately did it stop responding?

    Just before you rebooted here?

    Sep  3 14:32:33 fw01 php-fpm: /index.php: Successful login for user 'admin' from: 10.10.1.2 (Local Database)
    Sep  3 17:44:40 fw01 php-fpm: /index.php: Successful login for user 'admin' from: 10.10.1.2 (Local Database)
    Sep  3 17:46:28 fw01 sshd[44883]: user admin login class  [preauth]
    Sep  3 17:46:28 fw01 sshd[44883]: user admin login class  [preauth]
    Sep  3 17:46:28 fw01 sshd[44883]: user admin login class  [preauth]
    Sep  3 17:46:31 fw01 sshd[44883]: Accepted keyboard-interactive/pam for admin from 10.10.1.2 port 59170 ssh2
    Sep  3 17:47:03 fw01 reboot: rebooted by admin
    Sep  3 17:47:03 fw01 syslogd: exiting on signal 15
    

    You can see there is nothing logged there at all.

    Steve



  • @stephenw10: so I am back home for more testing, the dmesg outputs and printscreens are in this post https://forum.netgate.com/topic/146231/previously-working-pfsense-2-4-4-setup-stops-randomly-accepting-lan-traffic/4

    I can replicate the issue as many times as I want. Just need to reboot pfsense, connect my notebook and hit speedtest.net it crashes instantly... but no bluescreen or log as far as I can tell... nothing in dmesg nothing in /var/system.log... do I need to change something to verbose? to get more infos?


  • Netgate Administrator

    Ok so what time in that log did it stop passing traffic on ix1?

    You might also check netstat -m when it fails. An mbuf exhaustion like that would normally affect all NICs though.

    The output of sysctl dev.ix.1 might show you something if it's just interface.

    Steve



  • it should always be the last thing / time at the logs since I always did the logs after the crash beside the dmesg.txt

    here are the two requested outputs (both created within seconds after the crash.

    netstat_m.txt
    dev.ix.1.txt



  • It somehow got worse... now the pfsense ix1 which currently for eliminating issues is directly attached to my desktop also put in vlan 50 (rechecked against the switch - because I started to get worried if I am stupid - but desktop is in VLAN50 and talks on VLAN50) but it seams ix1.50 does not accept ping / or anything from my desktop even after ifconfig ix1 down;ifconfig ix1 up or am I doing something completly wrong right now?

    I can also remove vlans and test again without them...



  • This post is deleted!


  • now its official.... I can't not even ping pfSense (10GbE cable directly from my workstation <-> pfSense ix1 without VLAN10 or VLAN50) anymore 😭 I think I broke the internet 😄 fun byside, it is quite strange whats happening here...


  • LAYER 8 Global Moderator

    @Yves_ said in Previously working pfSense 2.4.4 setup stops "randomly" accepting LAN traffic:

    status: no carrier

    Isn't going to work very well.


  • Netgate Administrator

    Mmm, bad NIC maybe? Try re-assigning ix0 and ix1, does it now fail WAN side?

    Steve



  • @johnpoz said in Previously working pfSense 2.4.4 setup stops "randomly" accepting LAN traffic:

    @Yves_ said in Previously working pfSense 2.4.4 setup stops "randomly" accepting LAN traffic:

    status: no carrier

    Isn't going to work very well.

    No, that actually was my fault. thats why I deleted the post. I forgot the plug the cable back from the switch directly into ix1....


  • Netgate Administrator

    Not sure this looks great.. dev.ix.1.mac_stats.local_faults: 31

    I see 0 faults on everything I'm checking here. What does dev.ix.0.mac_stats.local_faults show there?

    You might also try disabling flow control:
    sysctl dev.ix.1.fc=0

    If that works you can add it as a system tunable in the gui.

    Steve



  • @stephenw10 okay, so I set ix0 to lan and ix1 wan no more vlans... and voila ix0 still working now as lan... and ix1 which now would be wan is still dead... so eighter one port on my card is broken (which would be the first time I hear of something like that) or there is something else seriously wrong... anyway going to create a backup now of pfsense, kill the vm completely and reinstall a new vm. if this does not work. I will switch the X540 tomorrow.


  • Netgate Administrator

    Mmm, yeah does seem like a hardware issue or maybe something in the way that port is passed through to pfSense.

    Replacing the card will tell you that though.

    Steve



  • I have some more feedback. After almost giving up, I thought why not just reboot the complete VMware ESXi server for once. Which I did and which seams to have solved all the issues... even doh the intel x540 is completely passedthru, very very strange. I will keep an eye on everything and keep you posted.

    @stephenw10 THANK YOU SO MUCH FOR ALL YOUR EFFORT!


  • Netgate Administrator

    Hmm, I guess it retained some config then. We have seen NICs that require a complete power cycle to clear some issues.

    Steve


Log in to reply