WatchGuard Firebox: Core-e and Peak-e series



  • This is a short mission to get the Core-e and Peak-e series production ready. I'll be covering issues pertaining to pfSense 2.2 on the nanobsd platform on the x1250e and the x8500e-F because that is what I have. I have a medium sized network with 1000+ devices including virtual infrastructure, network printers, voip phones, thin clients, site-to-site vpn, wireless access points, and hundreds of guest users. Some topics referenced here will be:

    • Watchdog timeouts causing sk0 (or other interfaces) to hang.
    • Unable to stop transfer of Tx and Rx descriptors
    • LCDProc and WGXep issues
    • DMA issues
    • Encryption accelerator boards
    • CPU and RAM upgrades.

    Also I would like to recommend the SanDisk SDCFJ compact flash cards, and for Suricata IDS/IPS to function a minimum of 2GB CF is required. Suricata is necessary if you have an internet facing firewall.

    All testing with pfSense was done with the 8.1 BIOS flashed available from HERE.

    Please follow the wiki instructions first, as they will be updated and easier to follow than a troubleshooting thread.

    If you can't boot because your getting messages that say "READ_DMA. CAM status: Command Timeout" then here's a helpful note from Stephenw10:

    If you've already upgraded or you're booting a fresh card then you need to interrupt the boot loader when it starts counting down from 4. You'll see the OK prompt. At the prompt enter:

    set hint.ata.0.mode=PIO4
    boot
    

    That will allow the card to boot and you can then add the line to /boot/loader.conf.local
    You can create it and put the line into it by executing this in the Diagnostics > Command Prompt Execute Shell command box:

    echo 'hint.ata.0.mode=PIO4' >> /boot/loader.conf.local
    

    Also, if your /var RAM disk is running out of space you can adjust the RAM Disk Settings by going to System > Advanced > Miscellaneous



  • You got my attention here… I have a x1250e in production which since 2.2 upgrade is requiring way more attention than I currently have available. (worked flawless on 2.1.x)
    Keep going... What can you share about those sk interfaces & timeouts? Eagerly trying to get it stable for production?  8)



  • I can trigger a watchdog timeout on demand now. This is on the x1250e, with sk0 as LAN, sk1 as LAN_GUEST and msk0 as WAN and set up as a basic firewall. WAN passes traffic to LAN, but when LAN_GUEST is connected to the guest network (a Zyxel GS1910-24HP POE switch) the firebox immediately says "sk1: watchdog timeout" followed by multiple "sk0: watchdog timeout" and "sk0: can not stop transfer of Tx descriptor"

    EDIT: Looks like some misconfigured vlan issues on these switches, switching over to the pfsense im trying to replace (a Dell Poweredge 1950 with four quad-core xeons, and 16gb of ram) killed the lighttpd interface. Need to look into securing lighttpd against attacks internally.

    EDIT 2: Still investigating, but so far mitigating the VLAN issue has resulted in watchdog timeouts no longer occurring on the x1250e.

    Also, the last guy to configure pfsense for the guest network had a /16 network setup in the DHCP Server, which just crashed dhcpd on the aforementioned Dell PowerEdge. Nothing like setting aside address space for 65,000 users for no reason. Keep your subnets lean people, else your routing gear will suffer.

    EDIT 3: On a side note, it makes sense that the drivers weren't tested for large subnets and that they might choke on them. Can someone else test this? Just create a DHCP server with a /16 or greater pool, and let it run to a few clients on your test network.



  • On a side note, it looks like this particular motherboard series can accept a Pentium M 780 @ 2.26GHz - has anyone snagged one of these as an upgrade to a Celeron-based system yet? The x8500e-F comes with a Pentium M 760 @ 2.0GHz, and it has tested well at raw transfers so far - benchmarks to come soon after DMA issues are handled. These Dothan CPUs have an integrated thermal monitor, which makes for a nice addition on the pfSense dashboard. There is a virtually unlimited supply of these processors on eBay.

    EDIT: An even more economical upgrade is the Pentium M 770 @ 2.13GHz, whose frequency is a matched clock multiplier to the FSB and can be obtained for less than $8 shipped on eBay



  • Found some information regarding the sk driver experiencing a watchdog timeout at freebsd.org HERE

    Then there is this thread that suggests contacting the sk driver maintainer for assistance, Pyun YongHyeon.

    Found a directory with ?current? drivers for FreeBSD http://people.freebsd.org/~yongari/msk/

    Stephenw10: Syslog gives few clues to whats happening besides the traffic issue, and I lose connection when the interface goes down. Is there some debug output that can be enabled in the driver or kernel that can be monitored from a console? BTW, you have really put a ton of work into this project! Thank you! I've been eyeballing these watchguard appliances since I got one back in 2007 - the progress that has been made is astounding!



  • Its only been 30 minutes, but I'm hammering all interfaces with massive packet transfers and no timeout yet. I think I may have inadvertently solved the watchdog timeout issue, and if so the fix is incredibly simple. Keep your fingers crossed!

    EDIT: I killed one of the msk interfaces (cable modem) and got a watchdog timeout error, but it was different this time. I never lost connectivity to pfSense, and I'm pretty sure it was me enabling all hardware offloading in System > Advanced > Networking that caused the hangup.

    msk1: watchdog timeout
    msk1: prefetch unit stuck?
    msk1: initialization failed: no memory for Rx buffers
    

    Before the changes were made, connecting the x1250e to the guest network (Zyxel switch) would cause sk1 watchdog timeout messages within 60 seconds. I went into the bios to enable ACPI so I can get a CPU temperature monitor on the pfSense dashboard, and then I decided to poke around in the PnP/PCI Configurations and changed the Maximum Payload Size from 128 bytes to 4096 bytes. Now everything seems to be running stable with hundreds of guests on the wifi. I found THIS article explaining what this bios setting does, and what the default should be.

    I will say that the 1300MHz Celeron is pegged occasionally at 100%, but at least this x1250e is showing signs of stability now.

    EDIT 2: something happened with x1250e and several dumb switches (GigaFast EZ500-S) that caused them to hang up and require a powercycle to restore services. I guess its time to bust out the Fluke network analyzer, and launch wireshark on a few machines.


  • Netgate Administrator

    Hey Fibrewire, great discovery on that bios setting if it's right.  :)

    A couple of things. You should change the title of this thread to specify the Core-e and Peak-e boxes because the originial X-Core and X-Peak were very different beasts.
    If you are up[grading the CPU you should look for one with a 400MHz FSB if you intend to use powerd because there is no direct support for the 533MHz CPUs in est(4).

    I've never managed to get any sort of error on the sk NICs though it seems like you're giving them a far better test than I ever have.
    Following this thread with interest.

    Steve



  • Stephenw10: Thanks for the advice! I probably won't progress with powerd since with ACPI enabled in the BIOS the temp gauge is now active on the dashboard. I have confidence that the Maximum Payload Size setting has resolved one issue, but created another. The CPU is definitely more active now - is there any way to look at the bios configuration before the flash for pfSense?

    EDIT: I failed to mention that suricata was running on 4 WAN connections, since downgrading to one protected WAN the CPU seems to act normally.



  • @fibrewire:

    On a side note, it looks like this particular motherboard series can accept a Pentium M 780 @ 2.26GHz - has anyone snagged one of these as an upgrade to a Celeron-based system yet? The x8500e-F comes with a Pentium M 760 @ 2.0GHz, and it has tested well at raw transfers so far - benchmarks to come soon after DMA issues are handled. These Dothan CPUs have an integrated thermal monitor, which makes for a nice addition on the pfSense dashboard. There is a virtually unlimited supply of these processors on eBay.

    EDIT: An even more economical upgrade is the Pentium M 770 @ 2.13GHz, whose frequency is a matched clock multiplier to the FSB and can be obtained for less than $8 shipped on eBay

    Ive run an M780 for a couple of years now. No issues other than what Steve mentioned regarding powerd.



  • The developer of the FreeBSD msk driver states "LRO is not supported" so I've disabled LRO and the issue with Hardware Prefetch Stuck? has not resurfaced. It's only been an hour but knowing that Hardware Checksum and Hardware TCP Offloading are supported makes me feel confident about the Firebox with pfSense as a solid platform. If this configuration can survive the day I will start adding internal network devices to stress-test this box.

    EDIT: still getting the odd one-off that causes a timeout on a msk interface, but oddly the firewall never hangs up on the sk interface (although i'm not really trying to break the LAN at this point). However, the firewall must be powercycled because the msk interface becomes wedged.


  • Netgate Administrator

    @fibrewire:

    • is there any way to look at the bios configuration before the flash for pfSense?

    Sorry I meant to reply here earlier
    The original Watchguard BIOS setup was just using the default values. I never changed the default values for that setting so it will be the same. The only thing that may be a factor is that each setting has two default values, one for fail-safe default and one for perfomance-default. Pretty sure Watchguard just used the fail-safe values.

    Steve



  • I currently have 300+ clients streaming whatever at an average of 100MBit across three WAN connections, and everything remains solid without error during spring break. Looks like that bios setting did it for me.

    I have discovered a new issue that causes a network interface to hang with Suricata installed. To replicate the issue, create a Suricata monitor on one WAN interface and have it enabled and blocking. When a second Suricata interface is created, it will cause the interface is is created for to error out. This error will occur regardless if Suricata is enabled for the second interface. In my case anything over one Suritata interface crashes, but I also have three WAN interfaces with the same gateway (DHCP address from Time Warner) for my gateway pool.

    EDIT: 88.5 MBit is the max sustainable with the 1300MHz Celeron CPU across three WAN connections, the 2.13GHz CPU should be here any day for testing. Despite maxing out the CPU for speedtests, the x1250-e is running solid with hundreds of guests actively using the network. I decided to point Suricata at the internal LAN interface to figure out what is breaking my network :P

    EDIT 2: disabling Suricata allows me to max out all three WAN connections and only hit 46% CPU usage. Very interesting…



  • Although the x1250-e is a heavy-hitter with the 2.13GHz cpu and 2GB of RAM, I still get the occasional interface watchdog timeout issue. I'll update shortly with log info. At this point I would have to recommend against using the Core-e and Peak-e series in a production environment.



  • I wanted to try a MicroDrive in the Watchguard Firebox, and came across this link - is it really this easy to resolve the elusive "watchdog timeout" issue? I will post the results here.

    EDIT: Found the specifics of these tunable options on the "Tuning and Troubleshooting Network Cards" section of the pfSense documentation here


  • Netgate Administrator

    No it isn't, in my opinion.  ;)
    Not quite sure where that info first appeared from but it was in the main Xe thread for a while. Some of those settings only apply to Realtek or Broadcom cards, pointless here. The others disable msi and msix globally rather than just for msk. The final setting may be worth investigating.
    However I've still not seen a timeout with the one recommended setting so I'm clearly not testing as rigorously as you.

    Steve



  • @stephenw10:

    However I've still not seen a timeout with the one recommended setting

    I've got 2 servers and one watchguard running pfSense, and somehow in my last reinstall I put the settings into the wrong firewall. Now that the /boot/loader.conf.local on the WatchGuard Firebox reads:

    hint.ata.0.mode=PIO4
    hw.msk.msi_disable=1
    

    … and my problem hasn't resurfaced for 10 minutes or so, which is better than the 30 seconds before "watchdog timeout" that I was experiencing whenever I connect the guest wireless.

    Thank you Steve, now lets see if it stays up until friday :D


  • Netgate Administrator

    Easily done.  :)
    Yep, that's what my box reads as.

    Steve



  • It's been 6 hours and things are holding steady.

    EDIT: 22+ hours, still no issues. I see light at the end of the tunnel! :D



  • stephenw10: Thanks again! I think that the documentation could be modified to include those two settings as mandatory for the Core-e and Peak-e series ;) I have 5+ days of uptime with hundreds of users, load balancing 3 modems, 2 lans, and one static modem connection carrying dedicated sip trunks, email, webserver traffic, etc. Thank YOU!

    I deem this firewall "PRODUCTION READY!"



  • Netgate Administrator

    Nice.  :)
    Thanks for the update.
    You're right the documentation needs updating badly, it's tripping up a lot of people right now. I'll try and at least remove the parts that are actually wrong this weekend. I confess that supporting pfSense for a living has taken some of my enthusiasm for doing it in my free time!  ::)

    Steve



  • Just a quick update before I upgrade to 2.2.3, been up for over 60 days with no problems. A word of advice, make sure multiple internal networks block traffic from each other - it causes the occasional interface to hang in only one direction (receive) from noisy broadcast devices.

    Thanks again for everyone who made this possible. pfSense on WatchGuard - a professional combination.




  • I had numerous issues with the firewall because I mistyped a configuration option upon first setup. This setting is not included in any pfSense backup, and must be performed BEFORE the watchguard firebox fully boots pfsense.  When booting a fresh CF or Microdrive on a WatchGuard box you need to interrupt the boot loader when it starts counting down from 4. You'll see the OK prompt. At the prompt enter:

    set hint.ata.0.mode=PIO4
    set hw.msk.msi_disable=1
    boot
    
    

    That will allow the card to boot and you can then add the lines to /boot/loader.conf.local
    You can create it and put the lines into it by executing this in the Diagnostics > Command Prompt Execute Shell command box:

    /etc/rc.conf_mount_rw
    echo 'hint.ata.0.mode=PIO4' >> /boot/loader.conf.local
    echo 'hw.msk.msi_disable=1 ' >> /boot/loader.conf.local
    /etc/rc.conf_mount_ro
    
    

    The Hitachi 4GB Microdrives are much faster than any CF card that I've used so far, and don't suffer from write limitations of flash memory (I've had to replace CF several times due to logging wearing out the CF card.) Also, they are $4 apiece on eBay - an actual tiny hard drive! When using a Microdrive, one can set NanoBSD to permanent read/write mode which eliminates slowdowns that users experience with the WebGUI.


Log in to reply