pfSense hangs randomly every 10-20th day, please help troubleshoot



  • My pfSense have been working all good for some months but for like 4 weeks ago it started to hang itself at random times between 10 and 20 days, so I have to hard reset by holding the power button... Can anyone please help me where to start how to troubleshoot this?

    I just updated to latest firmware 2.4.4-RELEASE-p2, have always used the latest firmware


  • LAYER 8 Global Moderator

    Are you running arpwatch? Had problems with this doing the same sort of thing when running that package.



  • Nope, I don't use arpwatch. The packages I have installed are:
    Acme, Avahi, nmap, Notes, Openvpn-client-export,pfBlockerNG and snort.


  • Netgate Administrator

    It's completely unresponsive even at the console? Try Ctrl+T at the console if you see it again. That can sometimes be the only thing that produces any output. It should show the current process if it's still runnning.

    Do you see anything in the system log?

    Do you see a crash report when it reboots?

    Steve



  • This post is deleted!


  • @stephenw10
    Thanks for your reply!
    There is nothing saved in the log, it only shows the active running as far as I am concerned.
    I will look for these clues next time I encounter it thanks.



  • I experience the same. But with HAsync configured, It happens every 2-3 days.The LAN hangs and was unable to ping any vlans. But the interesting thing is DMZ and NAT works just fine.



  • Had this problem and took many months of troubleshooting to find it including using pfSense support for several months. It was hard to nail down due to the infrequency of the lockups and the need to hard reboot. In the end we found it was an issue with using a NAT pool, specifically the "Round Robin with Sticky Address" or "Random with Sticky Address". Using "Source Hash" did not seem to have the problem. A bugtrack was made here:
    https://redmine.pfsense.org/issues/8576

    Been using it in production for more than 6 months in a large network without any issues since. Upgraded to 2.4.4 but I do not know if it is fixed there or not, and am no longer in a position to test as the problem did not show up except with a fairly decent amount of traffic (something I could not easily simulate in a test environment). I note the Redmine ticket does not appear to have been touched though.

    I don't know if this is your issue but might be something to check.



  • @supertechie The problem was very weird. Only the LAN with Vlans hangs. But the DMZ servers and NAT works perfectly fine. As you said I had sticky connections enabled as I setup multi WAN load balancing setup.



  • A friend of mine had a similar problem a while back. It was caused by the fact that he was using a board with Realtek NIC's.

    He switched to Intel based NIC's and it has been smooth sailing since.

    As a workaround in the meantime you could try doing some sort of cron job, that pings known servers, and if it doesn't get a response, reboots.

    This post is a little old, but it might help:
    https://forum.netgate.com/topic/16217/howto-ping-hosts-and-reset-reboot-on-failure

    As long as it is not a hard lock, that should help (but yes, it is a hack)

    Conclusion: The rule of thumb seems to be that for anything network related, be it a server, or a pfSense box, or anything else, just do yourself a favor and avoid anything Realtek. You'll save yourself so much time and so many headaches by just going with Intel NIC's.


  • Netgate Administrator

    Mmm, older Realtek NICs can behave badly and just stop passing traffic. Though they do usually log something, not always.

    If it's still passing traffic on traffic on one interface that not a complete CPU lock or a panic. That would be like a NIC driver issue or something like Snort triggering.

    Steve



  • Since I install update to 2.4.4-p2 I had problems with random restarts... First time was couple of days after update, when I was in bed, so I saw a message about crash when I woke up... Uptime was under 10 hours so I assume that machine is restarted... Second time was next day and then for maybe a week there was no restarts... It takes maybe about 5 minutes to restart but when you are in some game its not good... Also I provide internet to some people... Even I gave it free to them for now, they are not so happy... So I decided to change version regarding I hadn't any problem before 2.4.4-p2, but didn't want to do reinstall with 2.4.4-p1, so I did change update train with development and install what it gave me...
    Now I am with :
    2.4.5-DEVELOPMENT (amd64)
    built on Tue Jan 22 09:50:11 EST 2019
    FreeBSD 11.2-RELEASE-p8

    No problems for now... I hope that problem is solved :)



  • +1 on blaming Realtek NICs.

    I built a machine in the past two years that had Realtek NICs (horrible oversight on my part). Put pfSense on the box and it locked up randomly requiring a reboot to resolve the issue. No log entries or anything else. I also put VMware ESXi onto the same box and had it purple screen a few times (even with injecting an in-line patch to support the NICs).

    IMHO, newer Realtek NICs can hang your box w/o logging issues in the OS. Avoid at ALL costs.


Log in to reply