Watchguard Firebox x550e hang freeze affer a couple days



  • Greetings and noob warning.

    I have been successfully running pfSense on the Alix platform for several years, including the recent update to 2.2. No problems, no issues other than it feels a bit slow.

    Recently the Alix box was replaced with a repurposed WatchGuard Firebox x550e, with much information garnered from this forum, especially the very helpful information posted by stephenw10 - THANK YOU.

    The Problem
    After a couple days of operation - say 4-6, all interfaces of the box go silent, including the console.  Unit appears to have hung.

    The Configuration
    Firebox bios flashed to b7
    IDE header added and 32Gb off brand SSD installed (no CF installed)
    Both Memory slots populated (hw.physmem=1568047104)
    pfSense v2.2-RELEASE(i386) Thu-22 Jan, pfsense mode (not nanoBSD)
    FreeBSD 10.1-RELEASE-p4
    /tmp and /var repositioned to RAM via Preferences to reduce pressure on the SSD

    Packages
    LCDProc - to manipulate the LCD panel
    WGXepc - to manage the LED and Fan Speed
    Shellcmd - to control the above, running these shell commands in this order
      /conf/WGXepc -l green  (turn LED green)
      /conf/WGXrpc -f OB
      /usr/bin/nice -20 /usr/local/sbind/LCD -r 0 -c /conf/LCD.conf > /dev/null &
      /usr/bin/nice -20 /usr/local/bin/lcdproc C T U &
    SSH is enabled and exposed to the outside (only because I forgot to turn it off)

    Generally speaking, when the system is operating it works well.  Temps hover between 40-43deg C, fans are quiet and I don't see much alarming information being logged(to the best of my ability to interpret)

    Last Hang Details
    The last time the system went down(just now) I observed the following symptoms;
      √ Power: Amber, Storage: inactive, Arm/Disarm: Green, Expansion: not illuminated LAN/WAN Activity:  none
      √ LCD panel backlight dark, would not respond to front panel buttons.  Data on the LCD was not updating but was expected data.
      √ Box temp felt normal
      √ IP's via DHCP were not being delivered to either LAN
      √ No ping response to any interface
      √ Could not connect to admin interface via web nor SSH from LAN nor WAN side
      √ Console port not responding
    ..upon power cycle, all systems returned to normal.

    Other Observations
      …in writing this post I noticed that I may have dorked up the LCDProc shellcmds.  I had a little trouble getting this running using the forum notes and after the position of some of the dependent files changed.  In fact, there is no LCD.conf file in the /conf directory so I suspect that I may only be getting LCD action because of the latter call to LCDProc and am possibly running an errant extra process(??)

    ...oddly, I can't seem to get the system to load the /System/Packages list via the GUI.  It just hangs and I can't get anywhere else without a new browser connection

    ...I believe because the /var and /tmp have been repositioned to RAM I've lost my ability to go back and dig the logs for possible causes.

    Any suggestions/advice/guidance?  Any information that I have missed that needs to be provided?  My *nix skills are limited but given a pointer or direction I am happy to dig.

    Thanks in advance...any input is much appreciated.



  • I had similar issue, though on a x1250e (same hw but with extra nics)
    Some advice I have followed: upgraded to 8.1 bios. Enabled ACPI, DMA in bios, afterward did the boot mod. (see the other x550 thread)
    Disabled PowerD, disabled snmp, and I am using the LCDproc-dev package. (and a few others packages)
    It survided 3 days on test-bench (without doing much) So far uptime is like 30hs on a production environment, hoping it holds.
    See also p82 of the thread (https://forum.pfsense.org/index.php?topic=20095.msg497779#msg497779), someone (harrowed) reported it was the lcdproc in his case who killed the box.
    I am still unsure what caused my issues (isolated cause or combination of events or …), as when it dies you are totally cut off as you say (no log no console)

    Hope you figure out your issue  ;)



  • Thanks bennyc.

    Many of the items you suggested have already been done but were not mentioned in my original post.
      ACPI, DMA, use LCDProc-dev

    Others will be tried at next convenience
      disable PowerD, snmp, boot mod (will have to find that)

    Any hints on locating the 8.1 bios?

    I will probably remove LCDProc-dev fully until I have isolated this issue, or at least until I can be more clear about the shellcmd to invoke it properly (as mentioned, I think I  have a mixed method)

    …on the x1250 of yours - I expect you have seen the fair amount in the forums pertaining to the expansion card interfaces becoming flaky after a period of time.  Don't recall if the flakiness included whole system hang or just the interfaces on the expansion module, though.

    Thanks for the tips.



  • Are you running the original Celeron inside with PowerD?  I ask because I am pretty sure I read somewhere (have no clue where to start looking…) that the Celeron and PowerD doesn't work well.  I know that Celeron chips don't support Intel SpeedStep either and SpeedStep is only supported in pfsense after changing a few things anyway.  I am not sure what kind of reaction PowerD gets on a plain x550e with no mods/config changes.

    If you haven't, you should check out this section of doc that Stephen wrote up here:  https://doc.pfsense.org/index.php/PfSense_on_Watchguard_Firebox#X-Core-e  I quoted the relevent portion below.

    I am running an SL7EP and followed the instructions below and it seems to be rock solid.  An SL7EP can be had off of eBay for 5 to 10 dollars... I would consider such a mod to be almost manditory.  The enhanced speedstep would be worth it.

    CPUs: The board in the X-Core-e can run a large number of CPUs. Any Celeron-M or Pentium-M will probably work including both Banias and Dothan core variants. The standard Celeron-M CPU is Banias Core and 400MHz FSB, if you replace it with a Dothan Core processor you must set the DIP switches correctly on the motherboard. There are two sets of DIP switches and the correct setting for both are marked on the board.

    Upgrading the CPU to a Pentium-M provides a useful increase in processing power especially the Dothan Core models with 2MB cache (4X the Celeron-M). In addition the Pentium-M supports Enhanced Intel Speedstep and will actually use less power than the standard CPU when it is enabled via powerd. Unfortunately the BIOS does not pass voltage/frequency information to the OS correctly so the only CPUs that can use speedstep under pfSense are those for which the values are already known by the driver. In practice this means only the 400MHz Pentium-M variants. E.g. Pentium-M 735.

    For a complete list see the est(4) driver source code.

    These are CPU's I have personally tested:

    • SL6N7 1.3GHz Banias Celeron-M (original cpu)

    • SL7GL 1.5GHz Dothan Pentium-M

    • SL7EP 1.7GHz Dothan Pentium-M

    Enabling Speedstep: To get it up and running you need to do a few things:

    • Set the timecounter to use the i8254 device with:
    sysctl kern.timecounter.hardware=i8254
    

    To make this setting permanent add it to the system tunables table in the webgui:System: Advanced: System Tunables:

    • Enable powerd in the webgui in System: Advanced: Miscellaneous:

    • To force it to use EST rather than throttling or p4tcc add the following lines to loader.conf.local

    hint.p4tcc.0.disabled=1
    hint.acpi_throttle.0.disabled=1
    

    ACPI throttling and p4tcc do not provide any measurable power saving.



  • @rvoelker:

    Any hints on locating the 8.1 bios?

    See here:  https://forum.pfsense.org/index.php?topic=20095.msg474892#msg474892

    If you don't have a video card with a ribbon/riser and keyboard input (something like this: http://www.ebay.com/itm/221541140503), then I would recommend using the method and CF image here:  https://doc.pfsense.org/index.php/PfSense_on_Watchguard_Firebox#Flashing_the_BIOS and just copy the BIOS that I linked above onto the CF card.


  • Netgate Administrator

    My own home box crashed out in similar fashion yesterday but I think it's a hardware issue. Do you just see the LCDd output on the display 'waiting for client' or some such?
    Running powerd on the Celeron shouldn't hurt it just doesn't help at all so it's further complexity that's unnecessary.

    Steve



  • I didn't have much time to work on this last night after my day job but am planning to set up a new baseline hardware config for foundational testing;
      BIOS firmware 8.1  (done)
      LCDProc-Dev - removed (not done)
      WGXepc - removed (not done)
      IPV6 - disabled (done)
      Deselect Powerd (not done)

    …essentially going for as stock a pfSense install as I can stand and build up from there.  Fan speed/noise is my biggest/first gripe, but that I will handle within the BIOS.

    Do you just see the LCDd output on the display 'waiting for client' or some such?

    No, but I had something like that when I was first setting LCDProc-dev up - before I had all the arguments and dependencies set properly.  LCDProc is a little funky for me, but because of me I am sure - sometimes it works, sometimes it works and shows the data I've selected from within the package configuration, sometimes it shows the data from the command line arguments, sometimes sometimes it just shows some default/meaningless data(Heartbeat, scroll speed, ??).

    In my freeze:  LCD backlight was off, control panel buttons were unresponsive(wouldn't even illuminate backlight).  The data on the LCD screen was frozen, but was one of the pages of data I had configured to be displayed (%CPU..and I think it was stuck displaying, like, 1.6%)

    Unfortunately, this particular deployment is network critical so I had to replace the Firebox with something else to insure network continuity.  Getting good under-load data is going to be hard from here forward.

    I will update the topic as I go along if for no other reason than to make the info available to others but it may take a while due to the fact that the period of failure is as many as several days.


  • Netgate Administrator

    Ah, slightly different then. My box had time for the client to crash and the daemon to write to the display before it  died.
    I will say that it was particularly hot at that time, CPU didn't show hot but that's the best cooled part. I turned up the fans a bit it has been OK since.
    I was running the fans at 32 now I'm at 64. You were running 0b, quite a bit slower. I wonder if the power supply overheated and glitched? It has no fan other than the case fan.

    Steve



  • I wonder if the power supply overheated and glitched? It has no fan other than the case fan

    Reasonable consideration.  My particular box has a thermal conductive pad between the PS heat sink and the case cover - not sure if that is stock or something the previous guy added.

    Case temps, measured by hand, didn't seem alarming (but then I wasn't necessarily touching it at the time the event occurred).  I have since bumped the speed up to 0F, which is the audible threshold for me :o

    …I've never really liked this kind of switch-mode regulated power supply - seems they burn up caps pretty reliably over time.



  • For anyone that might find this thread again;
      - after a complete reload and more careful install of the various FireBox add-on's the box appeared to have become stable again, at least with no traffic load on it.

    The only conclusion I can make, which isn't very scientific, is that during the first install I must've gotten something configured sideways, or my particular box has a hardware issue, or that the network interfaces hang after load.

    Unfortunately, I couldn't risk another freeze like that and because I don't have enough time to troubleshoot it properly I ended up just switching to different embedded appliance I had laying around.

    Thanks to those who offered input.


Log in to reply