PfSense crashes ever few weeks - log is blank



  • Hi folks,

    I've happily been using pfSense for home for about 6 months now. I am using the following:
    2.2-RELEASE (amd64)
    built on Thu Jan 22 14:03:54 CST 2015
    FreeBSD 10.1-RELEASE-p4

    I am running this on a Gigabyte GA-J1900N-D3V, 4GB RAM and a 64GB SSD.

    I have a problem where I would lose internet connection from some clients. Oddly, it affected most clients on my network by some were ok.
    When I tried logging in to the GUI I would get php errors about /tmp/session <something or="" other="">was not found.
    I SSH'd into the the box and could see the menu but trying to select an option, 8 for shell, spewed back php errors and I was still in the menu.

    So, I did a hard-boot on it and all is well now. It came up, nothing seems wrong.
    Having a look at the system.log in /var/log/ shows a massive blank from about the time my wife said "there is something wrong with the internet" and the reboot. It's as if the machine was off. (pasted below)

    This has happened twice when I was using the alpha builds way back when and now on the release. I said nothing back then because it was alpha.
    I'm more curious now as to why it happened and if there is any interest from the (awesome) writers of pfSense as to why it happens. Is it hardware or software bug?

    Thanks,
    Fred

    Mar 14 09:03:07 pfSense kernel: ue0_vlan10: link state changed to UP
    Mar 14 09:03:07 pfSense check_reload_status: Linkup starting ue0_vlan10
    Mar 14 09:03:07 pfSense kernel: ue0: link state changed to DOWN
    Mar 14 09:03:07 pfSense kernel: ue0_vlan10: link state changed to DOWN
    Mar 14 09:03:07 pfSense kernel: ue0: link state changed to UP
    Mar 14 09:03:07 pfSense kernel: ue0_vlan10: link state changed to UP
    Mar 14 09:03:07 pfSense check_reload_status: Linkup starting ue0
    Mar 14 09:03:07 pfSense check_reload_status: Linkup starting ue0_vlan10
    Mar 14 09:03:07 pfSense check_reload_status: Linkup starting ue0
    Mar 14 09:03:07 pfSense check_reload_status: Linkup starting ue0_vlan10
    Mar 14 09:03:07 pfSense check_reload_status: Linkup starting ue0
    Mar 14 09:03:07 pfSense check_reload_status: Linkup starting ue0_vlan10
    Mar 14 09:03:07 pfSense check_reload_status: Linkup starting ue0
    Mar 14 09:03:07 pfSense check_reload_status: Linkup starting ue0_vlan10
    Mar 14 09:03:08 pfSense php-fpm[8651]: /rc.linkup: Linkup detected on disabled interface...Ignoring
    Mar 14 09:03:08 pfSense php-fpm[8651]: /rc.linkup: Linkup detected on disabled interface...Ignoring
    Mar 14 09:03:08 pfSense php-fpm[8651]: /rc.linkup: Linkup detected on disabled interface...Ignoring
    Mar 14 09:03:08 pfSense php-fpm[8651]: /rc.linkup: Linkup detected on disabled interface...Ignoring
    Mar 14 09:03:08 pfSense php-fpm[8651]: /rc.linkup: Linkup detected on disabled interface...Ignoring
    Mar 14 09:03:08 pfSense php-fpm[8651]: /rc.linkup: Linkup detected on disabled interface...Ignoring
    Mar 14 12:52:08 pfSense syslogd: kernel boot file is /boot/kernel/kernel
    Mar 14 12:52:08 pfSense kernel: Copyright (c) 1992-2014 The FreeBSD Project.
    Mar 14 12:52:08 pfSense kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
    Mar 14 12:52:08 pfSense kernel: The Regents of the University of California. All rights reserved.
    Mar 14 12:52:08 pfSense kernel: FreeBSD is a registered trademark of The FreeBSD Foundation.
    Mar 14 12:52:08 pfSense kernel: FreeBSD 10.1-RELEASE-p4 #0 36d7dec(releng/10.1)-dirty: Thu Jan 22 15:12:35 CST 2015
    Mar 14 12:52:08 pfSense kernel: root@pfsense-22-amd64-builder:/usr/obj.amd64/usr/pfSensesrc/src/sys/pfSense_SMP.10 amd64
    Mar 14 12:52:08 pfSense kernel: FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512
    Mar 14 12:52:08 pfSense kernel: CPU: Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz (2000.05-MHz K8-class CPU)
    Mar 14 12:52:08 pfSense kernel: Origin = "GenuineIntel"  Id = 0x30673  Family = 0x6  Model = 0x37  Stepping = 3
    Mar 14 12:52:08 pfSense kernel: Features=0xbfebfbff <fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,htt,tm,pbe>Mar 14 12:52:08 pfSense kernel: Features2=0x41d8e3bf <sse3,pclmulqdq,dtes64,mon,ds_cpl,vmx,est,tm2,ssse3,cx16,xtpr,pdcm,sse4.1,sse4.2,movbe,popcnt,tscdlt,rdrand>Mar 14 12:52:08 pfSense kernel: AMD Features=0x28100800 <syscall,nx,rdtscp,lm>Mar 14 12:52:08 pfSense kernel: AMD Features2=0x101 <lahf,prefetch>Mar 14 12:52:08 pfSense kernel: Structured Extended Features=0x2282 <tscadj,smep,erms>Mar 14 12:52:08 pfSense kernel: VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID</tscadj,smep,erms></lahf,prefetch></syscall,nx,rdtscp,lm></sse3,pclmulqdq,dtes64,mon,ds_cpl,vmx,est,tm2,ssse3,cx16,xtpr,pdcm,sse4.1,sse4.2,movbe,popcnt,tscdlt,rdrand></fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,htt,tm,pbe> 
    ```</something>

  • LAYER 8 Netgate

    USB Ethernet acting up?  Try a real card.



  • Nope. Not using that. It's plugged in but not actually used for anything.


  • Banned

    @FarmerB3rd:

    Nope. Not using that. It's plugged in but not actually used for anything.

    So remove it!!!



  • No - Don't unplug the USB thingy.  Investigate it for a few weeks.  Trouble shoot another couple of months.  Try compiling a dozen different drivers…  Don't give up on the USB NIC (that you aren't using)



  • I've had the USB NIC in there for a while. There is a problem with pfSense going belly-up. I am trying to understand why it is doing that. If it is the NIC then, and the logs or something else points towards it, I will happily drive over it. However, I am more interested in understanding why it went belly-up and why the logs have a black hole in them and if there is a bug, to log it so it can be improved further.

    I have no doubt that the box will stay up for a few months now without a problem. It works really well, handles 5 concurrent VPNs, multiple VPN servers and moves about 50GB a day through the WAN. pfSense is very good.


  • Banned

    Yeah, as said above, don't give up. The galore of "Linkup detected on disabled interface" log entries is definitely not good enough reason to remove crappy unused hardware!  ;D ;D ;D



  • How do you know its not the USB port going bad or the USB device going bad.  Its every bit as likely as some other piece of hardware going bad.

    Plus they are crap…  If you don't need it, that alone would motivate me to unplug it.

    Will it cost you something to unplug it?



  • ok ok, it's out now :)



  • Cool - Now lets wait for a crash.  Might it be a pfsense 2.2 distro issue?  sure.  Or maybe some other hardware issue? Maybe.  We are 1 step closer to finding out.  (-:



  • Ok, happy to wait for the next crash (well, not much choice there ;) ) but what can be done now to look for why it crashed previously? Any other logs I don't know about?



  • I don't know.  Hardware doesn't always crash in a graceful way thats lets you know whats going on.  Plus, I'm not super expert at finding the cause of weird crashes.

    It is good advice to keep your hardware limited to exactly what you need and to take away anything not needed, especially if support for said hardware is flakey at best.



  • Fanless computer makes me think of heat issues, is PFSense reading your CPU temps? I might poke the heatsinks, and any chips on the board while the system is running to see if they're heating up. If you have a cruddy power supply, now would be an excellent time to get a proper one. 80Plus Bronze rating, and if an affordable unit claims to be C6/C7 or Haswell Ready, that's really important for this application. Most systems will draw 80Watts or more at idle, so many cheap PSUs don't bother testing lower power draw (like an atom board!).

    64GB SSD sounds old… if it's an old OCZ drive or something sketchy and you don't need the space, try cloning PFSense to a flash drive. I had a Vector Plus R2 in my PFSense box and every few weeks the partition table decided to not exist. It's pretty hard to troubleshoot a bootloader error when the machine is buried in a closet. SMART info in the gui should be able to check for bad sectors, maybe even run a surface test. My OCZ completes an extended test in 10 seconds (64GB @ Sata II...impossibru!), and incidentally has a back SMART checksum according to gsmartcontrol.

    If you can afford to take the machine down for a day or so, running Memtest would be a good idea.



  • @FarmerB3rd:

    ok ok, it's out now :)

    Good! :) I suspect there's a good chance that's the root of the issue, given it was triggering log noise before the reboot.

    Did you get a crash report prompt? Having the backtrace should significantly reduce the possible causes.



  • The board hovers around 44C so don't think heat is a problem. While it is fanless it is in a very perforated case: http://linitx.com/images/products/M350_Universal_Mini-ITX_Enclosure_main_large.jpg

    Yes, the SSD is an old repurposed one but was healthy (SMART) when I took it out of the previous machine. I'll check SMART again and see what it says.

    I had another look at the log file - it's not missing sections. It has whole block of information out of order - as if it wrote in the middle of the file, then the end and then back in the middle. I can only assume the partition table is dodge… Will focus on that.



  • CLOGs…  Perhaps?

    https://doc.pfsense.org/index.php/Why_can't_I_view_view_log_files_with_cat/grep/etc%3F_(clog)

    Don't break it thinking you have an issue there.  At first glance, this seems normal to me.

    I'd leave it alone and wait for more crashes since you removed that USB thingy.  Give it a chance to be stable.  Unless its already crashing again?



  • ok, it may well be that. The "writing in the middle" continues.

    Mar 17 08:10:06 pfSense php-fpm[2384]: /index.php: Successful login for user 'admin' from: 10.10.50.X
    Mar 17 08:10:06 pfSense php-fpm[2384]: /index.php: Successful login for user 'admin' from: 10.10.50.X
    Mar 17 09:06:43 pfSense sshd[14419]: error: PAM: authentication error for root from 10.10.50.X
    Mar 17 09:06:43 pfSense sshd[14419]: error: PAM: authentication error for root from 10.10.50.X
    Mar 17 09:06:49 pfSense sshd[14419]: Accepted keyboard-interactive/pam for root from 10.10.50.X port XXXX ssh2
    ad_status: Syncing firewall
    Feb  1 12:10:30 pfSense kernel: ovpns2: link state changed to DOWN
    Feb  1 12:10:30 pfSense check_reload_status: Reloading filter
    Feb  1 12:10:30 pfSense kernel: ovpns2: changing name to 'tun2'
    Feb  1 12:10:31 pfSense check_reload_status: Syncing firewall

    I see most of my log files are exactly 500KB so it stops at that point and writes from the top again.

    Thanks for that - removes my biggest worry.

    It's not crashing and I don't expect it to crash for a long time. This is the second or third time it has crashed since early June - most of which was running Alpha nightlies.

    Also, the crash, as far as I can see, is not a panic (as I know it). The system is still up and working but just really badly.

    thanks
    FB



  • Well - with nightlies I'd be expecting some glitches anyway.  Basically you are beta testing, which is nice but its certainly not the way I would start out.  I think stable releases are a better bet for someone just getting to know pfsense.

    Did I say beta testing…  I should have said alpha testing  :o



  • TBH, I was surprised how good the Alphas were. Only issue was this one I have now. I needed Alpha because the hardware I bought was not supported by the previous release of BSD.



  • Are the alphas using a different version of BSD than 2.2?  (I'm not sure - I haven't tried)



  • afaik the early ones used 10.0 ,  while now they are on 10.1



  • Right, well, didn't this go badly!

    I woke up this morning at 3am (sick 18 month old). Walked downstairs and notices text scrolling past at a rapid rate on the pfSense monitor. "vfs error <something-or-other>"

    Well, I guess this is it then. Internet is still working. Try log into pfSense and nudda. Error writing to /tmp/session bla bla.
    I can however ssh into it. Supidly, I did not get the config file because I religiously backup after I make any changes. Of course I did…

    Right, reboot - cannot. Pull power - does not reboot. FreeBSD hangs just after bring the NICs up. Oh well, we're toast.

    Grab spare SSD out the drawer (as you do)
    Swap it with the now-dead one.
    Download the latest installer (USB live image)
    Burn it
    Boot from USB
    ta-da. We're back.  ;D

    Now to restore the last backup I have... hmm, looks a bit old.  :-
    Not to worry, it will have the bulk of the config. It's out by about a month.

    Restore - what a nightmare  :'( The restore was just not happy with the NICs. Each reboot it would ask me again which is the WAN/LAN etc. Eventually it stops asking.

    None of the VPNs are working correctly, the gateways are a mess, the firewall rules and NAT rules are a shambles. I start patching them together again but cannot get them working. Dreading a redo of them all I go get a coffee. What has changed... why would the restore not work? I've done it a few times without trouble...

    Damn you kejianshi / cmb, damn you to hell  :P
    Take out the USB NIC they said. It's all the USB NIC's fault they said.
    I had forgotten about it.
    Realising that the missing NIC is the issue with the restore I put it back in, restored again and ta-daaa. Perfect restore :D Happy bunny.

    So, disk was on it's way out. I would still like to try and get the latest config off it because my backup is one set of changes short. (Bad me)
    The restore works perfectly if the hardware is identical. If not, it's a bit of a headache.

    Having said all that - to rebuild a busted firewall in an hour and be back up and running is remarkable. Full credit to the devs and community of pfSense for making such an awesome bit of kit.

    Cheers,
    FB</something-or-other>



  • haha



  • Well, this is odd. Second SSD is now complaining the same as the first.
    Both SSDs used to sit in my NAS (ZFS) as cache drives so either they both got porked while in there or this motherboard is killing them or psSense is killing them.
    Both SSDs are 4 years old (found the invoice, was hoping on warranty).

    Guess a new on is needed and will see from there.

    pfSense is still running so no rebooting until spare drive arrives…


Log in to reply