Hardware fault / possibly replace

andyh

HI,

Looks like we may be looking to replace our existing hardware platform due to reliability issues. Currently running:

Asus P5N-E SLI Motherboard
Intel Pentium Dual Core E2220 2.4GHz
2Gb RAM
Antec 550W PSU
Intel PRO/1000 PT Quad Port PCIe Server Adapter
Intel PRO/1000 PT Dual Port PCIe Server Adapter
76Gb SATA HDD

This is all housed in an Antec 3u case, using a non standard power supply.

We are seeing intermittent failures (generally around 18:00) of the system with no error messages in the logs. My initial thought was a PSU fault, this was replaced in February with no subsequent failures but last night it failed again. Came in this morning and powered up ok, went into BIOS (with network cables connected) and the power failed again. Disconnected network cables and ensured that Wake-On-LAN (got an odd feeling about remote shutdown ?!?) was disabled in BIOS, reconnected network cables and the system appears to have been OK since this.

Due to the importance of the system, I need to look at providing a more resilient/reliable platform to run it on. To this end I'm looking for suggestions upon which to build on, it needs to have at least five nics (ideally reusing existing cards).

I also plan at looking into the possibility of duplicating the installation and using CARP for failover, not sure if this can be achieved with so
many interfaces.

Any advice / pointers gratefully received.

Andy

mrbostn

Well the experts here (I am not one of them) really want to know more details such as

how much bandwidth
how many users, what packages, etc.

When I saw your mobo with the SLI "desktop class" came to mind.

Get yourself a real server mobo. Supermicro/Tyan/Asus server class and start from there.

jaime

It may be possible that if you disabled the wake on lan (WOL) in the BIOS and the machine hasn't shutdown since then that could have been the issue, if you need to have WOL enabled it could be possibly receiving instructions from another source on your network, you may consider looking into another machine triggering that some how. or if the machine actually is giving signs of hardware failure?

and as the other post above me said, can you give us more specs along with more info about whats going on and symptoms too?

andyh

We're running imspector. squid and squidguard on the firewall, with around fifty users connected through it. We have plans to offload the squid/squidguard to a different solution on another device.

The problem we see is that all connectivity fails and when checking the physical device there is no power to the unit, everything else in the rack is fine. To our knowledge there are no users of WOL on our network, this may be unrelated but it was something I though of.

We are aware that it's a desktop board and we should be looking to a more suitable solution, hence the thoughts of moving platforms.

jaime

and in this case "unit" being the entire box or as in "unit" being the rack its self (thats providing power to the computer, etc)?

if you mean "unit" being the computer its self (MoBo, PSU, CPU, etc) then its very possible that the PSU is faulty, I would check that. Now when you say in your original post "My initial thought was a PSU fault, this was replaced in February with no subsequent failures but last night it failed again." are we just talking about the PSU only? is this what failed again?

also besides the BIOS adjustments what other trouble shooting steps have you done on the unit? have you also done any thing recently like upgrading RAM, replacing HDD, etc?

andyh

Apologies for not being clearer, by the unit I am refering to the firewall itself.

With relation to upgrades, the RAM in the firewall was upgraded at the start of February and also the dual port nic card was installed. The power outage issues appeared to start shortly after this, the additional coponents were then removed and the firewall was run in it's original config. The outages still occurred :( This led me to suspect the PSU, resulting in the replacement, I believe the original was only 350W.

Having replaced the PSU back in Febriuary, no outages were experienced. The additional components were then re-installed a few days after the PSU replacement and ran without issue until last night.

jaime

ok from this one of three (maybe more but for now we are going to isolate) things is causing this PSU, RAM, or that dual NIC, now first off we need to get to a configuration to stop the outages, (original?) then ONE AT A TIME add the PSU first and test with just the PSU in, other add-ons/upgrades left out, if no outage occurs then add one piece of RAM (this is assuming you have more then one stick of "newer" RAM) at a time, if no outage then put both sticks in together with PSU (repeat testing individual sticks until all sticks were successfully tested and then add them together), test for a good amount of time, if no outage occurs, add the next piece, the dual NIC, NO RAM (original config), test, if no outage, then if with the PSU and dual NIC in and if no outage occurred add the RAM back one piece at a time.

main point it to try and test each piece in an isolated manor to see if one of them is the issue, and since it started near the time the upgrades happened the parts in question that are related to the upgrade are where were going to start.

sorry if it seems a tad rushed, im actually leaving work so if needed Ill try to give more details if you are at all confused when I get home…but at least its a place to start :)

netmethods

If you're looking for something more suited for firewall/network environments, check out nexcom or lanner.

andyh

When I replaced the PSU, I negelcted to mention that I reinstalled the new components over the course of a number of days. This allowed me to test the stability of the new compoenents.

Interestingly, things have seemed stable for the past few days. It may have been connected to rogue WOL packets, although I can't find much information with to relatiion to shotdown via WOL packets.

With relation to devices such as Nexcom and Lanner, I presume both of these devices will run pfSense OK?

jaime

hmmm…ok, well rogue packets (WOL) will defiantly have the ability to cause an issue like this, but since it seems stable now just keep watch on it and if possible when the issue does happen again see if the connection logs (don't know which exactly to look at/for at this time) and see what machine IP was connected at the time issue occurred (assuming you have a general idea of a time frame the issue occurs)

andyh

I have a pretty good idea of timings, as I use Xymon to monitor my servers/services.

It looks like the WOL packets run at layer 2 - wikipedia
"The magic packet is sent on the data link or layer 2 in the OSI model and broadcast to all NICs within the network of the broadcast address; the IP-address (layer 3 in the OSI model) is not used."

Would I be right in thinking that firewalls operate at layer 3, so disabling WOL within the BIOS is a must….

jaime

yep, thats correct (unless the OSI model changed with out my knowledge) lol! but yea that is pretty much spot on so, if you disabled WOL and the issue went away then you should be good to go.

andyh

Woah…... conspiracy theories abaound !!!! LOL!

jaime

LOL! well hopefully the OP will kinda let us know if the issue has reoccured at all after disabling the WOL…