System freeze with 2.1-RELEASE



  • Hello guys,

    I am experiencing some hard locks with 2.1-RELEASE. After some time the pfSense machine hangs requiring a press of reset button. No crash dumps, no logs, no kernel panics, no info at all.

    I made a through memory test, nothing was found. I formatted HD and made a clean 2.1-RELEASE install. The box ran a day or so, then hanged needing to reset by button.

    I went back to 2.1-RC2, the problem was gone. Then I upgraded to 2.1-RELEASE, the problem came back.

    My machine is a old AMD Duron 1GHz, with 512MB of RAM and a 20 GB hard drive. Two 100Mb NICs, a SIS onboard and a VIA PCI. The system runs smoothly with 2.1-RC2 and earlier versions, so I think the hardware is ok.

    Can someone help me to debug this? There were significant network driver changes from 2.1-RC2 to 2.1-RELEASE?

    My system settings don't have any exotic configurations, except for IPv6 with two gateways (the machine is at home). I have PowerD enabled and the default NIC optimizations.


  • Netgate Administrator

    I would definitely try disabling powerd. I don't think the Duron has any power saving features anyway.

    Define 'hard lock', no response to the capslock on the keyboard?

    Steve



  • Sorry… my english is not so good... :)

    I mean a hardware hang, as opposed to software hang wich you can reset using CTRL-ALT-DEL...

    This time the machine is totally unresponsive. It will stay at console menu and no answer by network anymore. Keystrokes on console does nothing.

    Enabling or disabling PowerD appears to make no difference. I tried to do it on 2.1-RC2 too, and the only visible difference is HD stop spinning after some time.

    With PowerD enabled or not, 2.1-RC2 works in both tests.


  • Netgate Administrator

    Sometimes a box can appear to hang bit it's not actually completely dead.
    When it freezes does the keyboard Caps Lock or Num Lock key still work?
    Try pressing Ctrl-T. It may give you some information on what process is causing the problem.

    Steve



  • Leftover bits and pieces from the old version have been a bit of an issue.  I'd say backup settings, clean install 2.1 and restore settings.

    While its indeed great fun tracking down bugs the fast easy way also has its merits.


  • Banned

    I think he allready did that…... ;)

    Hello guys,

    I am experiencing some hard locks with 2.1-RELEASE. After some time the pfSense machine hangs requiring a press of reset button. No crash dumps, no logs, no kernel panics, no info at all.

    I made a through memory test, nothing was found. I formatted HD and made a clean 2.1-RELEASE install. The box ran a day or so, then hanged needing to reset by button.

    I went back to 2.1-RC2, the problem was gone. Then I upgraded to 2.1-RELEASE, the problem came back.

    My machine is a old AMD Duron 1GHz, with 512MB of RAM and a 20 GB hard drive. Two 100Mb NICs, a SIS onboard and a VIA PCI. The system runs smoothly with 2.1-RC2 and earlier versions, so I think the hardware is ok.

    Can someone help me to debug this? There were significant network driver changes from 2.1-RC2 to 2.1-RELEASE?

    My system settings don't have any exotic configurations, except for IPv6 with two gateways (the machine is at home). I have PowerD enabled and the default NIC optimizations.

    Leftover bits and pieces from the old version have been a bit of an issue.  I'd say backup settings, clean install 2.1 and restore settings.

    While its indeed great fun tracking down bugs the fast easy way also has its merits.



  • Maybe you should have send me this:


  • Banned

    :D yes…..



  • Yeah - In that case I'm not sure whats up?  And with no error messages its abit hard to figure out I imagine.  Sucks.


  • Banned

    Seems that 2.1 is alot more picky with hardware than 2.0.x was.

    Its should actually be the other way round…



  • I didn't want to be the one to bring that up again, but yeah…  Thats true.
    I think it will get patched quick.


  • Banned

    I believe so too.

    What power-scheme is the OP running?? When booting in, then it would be nice to know what scheme he is running and if he had tried with other options?



  • Not sure - Given the number of people having issues, I might advise to wait for the next patch and try again then.  2.03 was pretty stable for me.


  • Banned

    I run 2.0.3 and dont intend to upgrade for that exact reason. Since the mainpart of mine are running in production, I cant afford to be the testbed for something thats not rock solid.

    I have had 2.1 running in test environments and it works reasonably well. Since you know I use PPTP, it seems a little flaky in 2.1. And then my widescreen theme didnt work and that was a major bummer…



  • I can confirm I have the exact same issue. System completely unresponsive as well. From the console, the screen just shows the login prompt, but no keyboard events get registered, not even the caps lock, etc.



  • I can confirm too. I have 2 different boxes with same issue. Easy way to freeze the kernel in my case is using IGMPv2 on WAN + multicast.



  • I'm on the same boat! I have two almost identical hardware boxes and one of them keep on hanging randomly since 2.1 upgrade (more than 10 times now).  Before that they've been running v2.x for years without ANY problem.


  • Netgate Administrator

    A random lockup could be any number of things so although these all appear to be the same situation you need to bare in mind they may be completely unrelated. If we assume they do they have a common cause if must be something fairly unusual to have not shown up in beta testing. Is there some commonality between your setups? Michael Sh. mentions multicast, are any of the rest of you in a multicast environment? Are you all running older hardware like the OP?

    Steve



  • I have been observing this problem on 2.1. Since May 2012, when I became interested in the multicast.
    For a long time trying to figure out what the reason, but so far without success. I have a lot pfsens'es under the supervision of an entirely different platforms, but for experiments using two home boxes:
    Intel H55 - CPU G6950 2.80GHz - 2G RAM - re + em
    Zotac NVIDIA nForce MCP79 (7A?) - CPU U2300 1.20GHz - 4G RAM - em + nfe + ath
    Provider routers - Brocade Communications Systems, Inc.

    My ISP uses IGMPv2 for muiticast IP-TV and I use Local Peer Discovery (Transmission daemon). Sysctl net.inet.igmp.default_version=2 does not work as expected, so allow the passage of IGMP on the WAN. Kernel switches WAN to IGMPv2 and all works from 12 hours to about 5 days.
    Enough to freeze two conditions - permits IGMP on WAN and permits LPD in Transmission. Without this uptime may be monthes.

    I was really hoping that 2.1 will be released on the basis of 8.4, as there are many changes in the locking in TCP/IP - maybe it would take away the problem. Maybe not just this one. For example crashes the kernel in the presence of IPv6 traffic and reconfigure the interface simultaneously.

    Edit:
    Both work on 2.1 amd64 from ALPHA. On the first enabled hardware whatchdog. How to use hardware whatchdog on NVIDIA nForce MCP79/7A I have never found. If someone tell me will be happy.



  • Hi guys,

    I was absent for some time since I downgraded to 2.1-RC2 (fully removed RELEASE and installed a clean RC2).  All things are nice since then. So I wondering how do I do to debug what is the trouble. I wanna help to catch this bug, but since my system crashes with no info, I don't know how to start.

    I already changed some bits of hardware (as NIC and other cards in PCI slots) and started with clean pfsense config, with no success. System hangs randomly. Sometimes it stays for 24 hours or more, sometimes it hangs on boot when kernel loads and first pfSense logo is shown. I have no ideas at this time.

    My friend has other different machine and had same issue. I took my CD to him. So where are the old snapshots? I want to burn another CD of 2.1-RC2 in case of trouble.



  • I'm in the same boat. Pfsense seems to completely lock up after some time. In my case "seems" is the right word because the box is still somewhat responsive - it only takes a long amount of time to respond (over a minute to register a single keystroke) in the shell.
    Any attempts to remotely connect to the box time out, only local shell works when you have activated it before the "hang" and then I have to be extremely patient to be able to enter anything….
    I did some troubleshooting and found out that at least in my case it has to do with traffic shaping obviously.
    I removed package after package retesting - ruled all out.

    Entering pfSsh.php playback removeshaper in the shell (oh man that took a long time) brought the box back to life.
    Baffled me so i had the machine running for over 2 days without TS, not a single "hang".
    Recreated TS with wizard (dedicated links) and box ran fine for about 2-3h then perceived "hang".
    I have not yet made a fresh install of 2.1 on the affected machine. I did a fresh install on an older test machine (same main board, same NICs, slower processor, only 1Gb of RAM instead of 8, HDD instead of SSD) and restored my config from the troublesome machine. I could reproduce the behavior there although after a much extended period of time (perceived hang occurred after about 8h).

    Syslog does not show any Info that would shed light to the matter on any of the both machines. What can I do to troubleshoot the matter further?



  • @Luzemario:

    I was absent for some time since I downgraded to 2.1-RC2 (fully removed RELEASE and installed a clean RC2).  All things are nice since then. So I wondering how do I do to debug what is the trouble. I wanna help to catch this bug, but since my system crashes with no info, I don't know how to start.

    IIRC there were a large number of daily snapshots that were called RC1 and RC2, so that makes bisecting between a problem release and a good release much more difficult.  If you know the date of your RC2 that works, you could look through the patch activity on redmine between that date and 2.1 release date for non-trivial patches.  But without old snapshots available, it's tough to bisect.



  • I ran to to issue too after new 4GB nanoBSD flash install + restore backup froom previuos build. I had to reboot to get into webconsole. I got tired reset to 'Factory default' from webconsole and reconfig LAN/WAN nic after reboot. Iv new NATd things needed. Now it have been running fine like prevouos build without issue. Hope this fix the hang and non-respond and high fan problem iv had. I thought i share this issue perhaps helps you guys.



  • I have the same issue since i upgraded to 2.1 RELEASE. I've been using the 2.1 snapshots before that and everything was perfectly fine. Can someone refer me to a download page for the old 2.1 snapshots, i can't find them anywhere on the site?

    Last night i downgraded to

    2.1-BETA1 (i386)
    built on Mon Apr 29 20:54:19 EDT 2013
    FreeBSD 8.3-RELEASE-p8

    because it was the only thing i have on cd and i know it doesn't freeze.



  • I've been having the same issues where the box will just lockup and become unresponsive to keyboard commands, using bge interface with vlans for both lan and wan connects along with ipv6 connectivity from he.net and igmproxy for iptv service in the house.

    Was working great on the pfsense-2.1 snapshots before the release edition came out so not sure what would be causing the issues but I hope it gets resolved soon as this just is to unstable to continue using like this even in a home enviroment.

    I however have just switched from i386 to amd64 arch so will be interested to see if it resolves it self now after a fresh install and config restore, uptime usally lasts for 5 days tops and they dies.

    I was remote syslog'n to try and catch any errors before the freeze but found nothing in the logs to pinpoint anything down.

    Thanks


  • Netgate Administrator

    What does your memory usage look like? Some people have been seeing a memory leak that might exhibit like this.
    Do you have a console connected to the box? Are you seeing a kernel panic? Anything else on the console that doesn't make it into the logs?

    Steve



  • No memory leak from what I can see and console looks just fine so not sure what it is that causes this freeze but it happened again so I had to hard reset the box and then like less then a hour went by and it froze up again so it seems like its getting worse.

    What snapshot from the 2.1 branch does not have this freezing in it cause I would love to switch back to it so I don't have to have all this downtime.


  • Netgate Administrator

    I have no idea, all my 2.1 boxes are running fine.
    Are you running 64bit? Perhaps try 32bit. Edit: forget that I see you already tried it.

    Steve



  • Yeah I had 32bit running then did a fresh install of 64bit and both arch's freeze up.

    Now I do note that I'm running igmproxy for IPTV such like another fellow in this thread so I wonder if multicasting is related to this or perhaps upnp.

    Being a technical orientated guy such as my self I hate to just give up and rollback to snapshots but this really is just not stable enough for even home usage.

    Hopefully the powers that be see this as a problem that needs swift action against it and release a fix sooner then later.



  • Here's one more odd observation I should have noted … I run FreePBX for the phone system in the house and it would appear that most of the freezing happens when using the phone.

    This last freeze right after I reset the box I used the phone and then it froze again in under a hour so not sure how it relates but just thought I'd throw that out there.

    Fairly sure pbx uses UDP so ...



  • Ok so I've just gitsync'd to the latest RELENG_2_1 branch which from what I read contains the updates and fixes that should be present in the 2.1.1 release whenever that happens so hoping this will resolve my issues but will see how it goes.

    For now I'm running top in a terminal and will be able to figure out if this is memory related such as you say or not for sure.

    Thanks



  • Froze again today well I was using my sip phone in the house that routes through my pfSense box so thats the 5th time so far that it's froze well using SIP protocol.


  • Netgate Administrator

    Are you using multicast at all?
    What NICs are you using? What sort of internet connection do you have?

    Steve



  • Yes I'm using multicast (iptv and upnp).
    NIC is bge0: <broadcom gigabit="" ethernet="" controller,="" asic="" rev.="" 0x004001="">.
    25Mbit down and 6.5Mbit up ADSL.

    This is all on vlans as well.</broadcom>



  • I attempted to upgrade to 2.1 last week and also ran into random system hangs. Ctrl-T at the console would show different processes usually each hang.

    Like others have reported, the system didn't panic. It just became slow as nothing would work due to timeouts.

    I'm running multiple supermicro dual processor XD7CU boards with 8GB of RAM, and nothing special multi-cast wise. The systems hang started before I could even configure the box. The hang would some times happen during boot as well. I was running with the Live CD and a usb pendrive for the config file.

    I did not attempt to turn of VTx or multi-cores since I didn't see that hint until recently. I downgraded back to 2.0.3 and all is fine.



  • @fatsailor:

    I attempted to upgrade to 2.1 last week and also ran into random system hangs. Ctrl-T at the console would show different processes usually each hang.

    Like others have reported, the system didn't panic. It just became slow as nothing would work due to timeouts.

    I'm running multiple supermicro dual processor XD7CU boards with 8GB of RAM, and nothing special multi-cast wise. The systems hang started before I could even configure the box. The hang would some times happen during boot as well. I was running with the Live CD and a usb pendrive for the config file.

    I did not attempt to turn of VTx or multi-cores since I didn't see that hint until recently. I downgraded back to 2.0.3 and all is fine.

    Just an FYI - turning off VTx and multi-cores did solve the problem for me.



  • Updating to the most recent Github 2.1 release branch has solved the problems for me all together.


Log in to reply