New system rebooting / locking up



  • Hi all,

    I tried to deploy a new system recently….and it didn't go so well.

    Unit is a http://unixsurplus.com/product/2u-rack-server-intel-s5000psl-2x-xeon-e5345-quad-core

    pfSense installs great, and the system sat on a bench and burned in for a week with a fresh install ( installed packages, reloaded a few times). Had no problems.
    I added a 3rd M+NIC (Broadcom) to the system.

    Things went south when I plugged it into my network.

    I believe the issues my be the Intel NIC's in this box.
    Initially the system would lock up hard and leave me at a "tracing command xxx pid " screen.
    So far I have been unsuccessful at getting a full capture of the errors. (tried a local install of syslog-ng and a remote syslog server)

    I thought I had it with the network card tuning parameters found here: https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards
    I thought I was on to something as the system rebooted rather than locking up hard.
    Then…it locked up hard again.

    Any advise on where to go from here ????

    Any other Intel NIC tunables I should be looking at ??


  • Netgate Administrator

    Sounds like a hardware issue. You only had any problems after you put the Broadcom card in?
    Could be a badly seated card. Perhaps it moved slightly when you plugged in the network. Perhaps you moved the box to the network location and something moved internally.

    Steve



  • No, sorry, Broadcom card has been in it from the get-go.

    I have three of these boxes….I'll try pulling the suspect off the rack tomorrow and try a clean reload on a different box, with a different Broadcom card.

    My suspicion is still the em(x) driver ...  do you happen to know if it's changed from the 2.1 release to the 2.1.4 release ??

    I was also thinking of dropping back to the 2.1 release version of pfSense


  • Netgate Administrator

    Yes it did, in 2.1.1:
    @https://doc.pfsense.org/index.php/2.1.1_New_Features_and_Changes:

    Updated em/igb/ixgb/ixgbe drivers that add support for i210 and i354 NICs and fix issues with ix* cards.

    The drivers gave a lot of trouble as well they were backed out before being reinstated after some bugs were squashed. There were issues with AltQ, are you doing traffic shaping?

    https://forum.pfsense.org/index.php?topic=72763.0

    Many people have been using them in production for a while though on a variety of hardware. What NICs do you have?

    Steve



  • Controller Intel® 82563EB Network Connection

    http://www.intel.com/cd/channel/reseller/apac/eng/products/server/boards/dp/s5000psl/feature/index.htm

    Looks like 82563eb according to the manual


  • Netgate Administrator

    Hmm, that chip appears to be a little unusual as it doesn't use a traditional PCIe connection. Instead it's connected directly to the ICH via some specialised interface, the Kumeran Interface. I have no idea if that's common or not. Perhaps you have an outlier device that hasn't seen much testing. More research needed.

    Steve



  • FWIW…

    I reloaded the box with 2.1 (release) and have been pounding it for the last half hr with iperf. via the Intel built in NIC(s).

    no lock up so far....

    C:\Users\sowen\Desktop>iperf -c 10.1.199.199 -t 30

    Client connecting to 10.1.199.199, TCP port 5001
    TCP window size: 8.00 KByte (default)

    [128] local 10.1.8.100 port 64084 connected with 10.1.199.199 port 5001
    [ ID] Interval      Transfer    Bandwidth
    [128]  0.0-30.0 sec  336 MBytes  94.0 Mbits/sec

    C:\Users\sowen\Desktop>iperf -c 10.1.199.199 -t 60
    –----------------------------------------------------------
    Client connecting to 10.1.199.199, TCP port 5001
    TCP window size: 8.00 KByte (default)

    [128] local 10.1.8.100 port 64114 connected with 10.1.199.199 port 5001
    [ ID] Interval      Transfer    Bandwidth
    [128]  0.0-60.0 sec  675 MBytes  94.4 Mbits/sec

    C:\Users\sowen\Desktop>iperf -c 10.1.199.199 -t 90
    –----------------------------------------------------------
    Client connecting to 10.1.199.199, TCP port 5001
    TCP window size: 8.00 KByte (default)

    [128] local 10.1.8.100 port 64163 connected with 10.1.199.199 port 5001
    [ ID] Interval      Transfer    Bandwidth
    [128]  0.0-90.0 sec  1014 MBytes  94.5 Mbits/sec

    Anything else I can to to stress test the NIC / Driver ?


  • Netgate Administrator

    You can try making it do thing like jumbo frames or VLANs and enabling all the hardware offload features. You could try disabling all the hardware off loading on the newer version.
    You could try a 2.2 snapshot which is using the native drivers from FreeBSD 10-rel rather than backported versions.

    Steve



  • No Joy, system locked up again overnight.

    That was with the 2.1 release install.
    I think that there is a hardware issue betwixt pfSense/FreeBSD and this motherboards on board NIC's.

    I guess this machine will make a nice doorstop, I'm done fudging around with it….my user are getting kind of testy....

    Which is a shame, I really had high hopes for these machines (they are a short 2U, and fit nicely on a telco rack).

    Any knowledge of the ASUS KFSN4-DRE/SAS Motherboard ?
    -Dual Broadcom BCM5721 Gigabit Ethernet
    -Dual AMD Opteron Dual-Core 64 Bit 2356 2.2GHz CPU

    Otherwise it's back to HP hardware.
    DL 380 G5's are getting cheap.

    Thanks for your support and knowledge, I truly appreciate your time and effort.


  • Netgate Administrator

    If it's a driver issue you might consider running pfSense as a VM on the box.

    Steve



  • I thought about that…..

    but...

    I guess I'm just to "old school".
    It may work fine, but I have serious reservations about running my firewall/proxy as a VM.
    It just "feels wrong" to me.

    I'm not sure I'm quite ready to make that jump yet.

    (I may give it a shot just for academic purposes, but I don't think I'd go into production with it...)



  • Just a FYI on this :

    These boards (S5000PSL) appear to be buggy in several ways. I may go chuck these servers in the river.

    –- wont run pfSence reliably on bare metal.
    System burn in on-bench is fine, sits and burns with no issues (even with me pounding on it w/iperf). But, when they are plugged into my network, they last anywhere from 2-4 days before they start going south, and hard rebooting.

    --- Entire system hard reboots under ESXi5 after a few days (running only a single pfSense virt. machine)
    --- Entire system hard reboots under ESXi4.1 (never even made it past the pfSense config restore...)

    I have had the same behavior on two completely separate machines.
    My opinion....... stay away from the S5000PSL boards.