Intel Atom C2xxx LPC failures
-
True… Of course, "critical" in this particular instance is because when the router goes down, my wife/kids whine. So, probably "critical" might be too strong of a term.
When my wife is upset, it's critical to me. That's why I have a spare SG-4860.
:)
-
At least the experience has taught me how to get the router working in a VM in case I ever have issues again. I guess what I need to learn is how to set up a VM so that it can more easily take over the router functions without my having to manually edit config.xml files (to change interface names from igbX to hnX) and then manually restore them. (I'd still have to manually move the WAN ethernet cable… I think.)
CARP and the redundancy features in pfSense work wonderfully for this… provided you have at least three static IP addresses on your WAN... probably you don't on a standard residential connection. Otherwise, a simple nightly config backup and something like sed will easily provide you with a ready-to-go config file for your backup router (if the hardware is different). Otherwise just the backup is fine.
-
'll be glad to get the critical router back to dedicated hardware.
If it is critical you should have at least two.
That's perfect valid point for an enterprise environment.
The SG-2440 is advertised and targeted at:
https://store.pfsense.org/SG-2440/
Small Businesses
Small to Medium Sized Business Networks
Small to Medium Sized Branch Office
Managed Service Provider / Managed Security Service Provider (MSP/MSSP) On Premise Appliance
Teleworkers needing an "Always-Up" network or VPN connectionsI fail to see why a small branch office should have a HA Router Setup.
An then there are claims about:
"No moving parts to wear out. This system is designed for a long deployment lifetime."That should read: "This system is designed for a long deployment lifetime with increased likely hood of failure after 18 months and almost certain failure after 3 years"
Would anyone here purchase a second car - just because the car manufacturer refuses to fix a faulty engine component???
An HA setup is system architecture.
Not fixing a known issue is neglecting customers.
If driving is critical to you - just buy two cars?
-
Yeah. critical for those of us geeks that have pfsense at home is different than critical for the enterprise :)
My backup plan for my critical piece of equipment is to pull out my old dd-wrt router that's still configured to properly work with about 2/3rds of my stuff, and stick it in place until I can fix my pfsense router :)
Sure, at that point, I'm limited to 200-300mbit/s speeds but I can live with that and I'm sure my wife and kids would never notice :)
As long as they can get to their netflix and local plex server, and I can vpn to work, we're all happy.
-
I fail to see why a small branch office should have a HA Router Setup.
Because it's an incredibly cheap way to ensure that an office full of people doesn't go idle because there's a network problem. I'd also recommend a backup internet connection if it's at all possible. I guess if salaries are really low in your area then the extra couple hundred bucks isn't worth it, but in most places nowadays a day without internet is going to cause a lot more than a couple hundred bucks worth of loss.
You seem to think that the C2xxx errata is the only thing that can possibly go wrong with a firewall, but most people would be sad if their office went offline for some period of time because (to use an example I've seen personally) their $5 wall wart failed. An HA configuration prevents multiple failure modes.
-
Of course, there are many things that possibly can and will go wrong.
That's why you'd address the obvious and known issues first.Looks like the C3000 has been announced…. which will most likely replace my C2000 kit. I'm sure the C3000 will have its own quirks, but hopefully intel learned from the clock issue.
I wouldn't want a "reworked" C2000 board anyway (if the soldering is done by a human).
Preference would be the intel fix - or - an entirely new CPU (i.e. C3000)I decided that next time I will "Roll my Own" Hardware, so that I can blame myself if anything goes wrong. I still like the fanless ADI/Netgate kit, but I lost confidence in the company.
The advantage of the Netgate appliances is that they have no moving parts, don't require assembly and can be deployed quickly. But they are certainly not really that cheap, considering you can get more powerful hardware for the same money.
And despite the generous warranty extension I do not like the fact that Netgate won't proactively replace the systems for affected customers.
Let alone - continues to sell faulty units. -
Wasn't necessarily talking about HA but, since the term critical was used in the context of the unlikely event of an LPC component issue, there should be at least a spare on the shelf. If it's a critical router.
-
since the term critical was used in the context of the unlikely event of an LPC component issue, there should be at least a spare on the shelf. If it's a critical router.
That is even true if your that afraid of your wife.. ;D
I usually just offer taking mine out to dinner and all is good! ;)
-
For those with supermicro boards who were approved for, and submitted the cross-ship agreement, have any of you had any news from supermicro since returning the cross-ship agreement?
I got the impression from that document (in particular, the part stating "*The replacement will be shipped by FedEx/UPS Standard Overnight for domestic destinations…") that they try to get those cross shipments out and delivered quickly. If that's the case, at least some of you would have received a replacement board by now...
(I sent my form back yesterday, but it's been over 24 hours and I haven't seen anything back from them. Not a tracking number, or anything else.) I'm starting to wonder if they've actually shipped out ANY replacement boards yet (that have the issue resolved.)
That leads me to wonder if the "platform level change" that Intel claims works around the problem requires manufacturing a new revision of motherboards, or if it's something they can do as a "repair." If the former, I'm wondering if Supermicro is just going to wait for a new stepping from Intel instead of bringing up new tooling/etc to produce new boards.
THAT, in turn, makes me wonder if the supermicro boards that pfsense got as "advance replacements" are actually boards with the problem resolved, or if they are just more with the same issue, but held in reserve to quickly service failed devices. (I don't think that tidbit of info was mentioned.) If the latter, it would certainly explain why pfsense hasn't commented on if "current" stock has the issue already resolved or not. The answer MIGHT be that current (and even current replacement) stock doesn't resolve the issue, because their vendor hasn't sent them anything yet with the issue resolved.
(Of course, all this is pure speculation and guessing.)
Edit: 45 minutes after posting the above, I got a tracking number from supermicro... which kind off negates this entire post.
-
For those with supermicro boards who were approved for, and submitted the cross-ship agreement, have any of you had any news from supermicro since returning the cross-ship agreement?
…
That leads me to wonder if the "platform level change" that Intel claims works around the problem requires manufacturing a new revision of motherboards, or if it's something they can do as a "repair." If the former, I'm wondering if Supermicro is just going to wait for a new stepping from Intel instead of bringing up new tooling/etc to produce new boards.
...
I received my replacement on Friday and it's already in place. The board had a Tested 2/21/17 sticker on it from QA.
I honestly have no idea if it has the platform fix or not, but supposedly the platform fix can be retro-fitted. This is what servethehome has listed for supermicro:Supermicro: RMA for platform-level workaround available for concerned customers. We also did confirm that Supermicro has implemented the platform level workaround in products shipped from January 2017 onwards.
I suppose at this point I'll have to wait and see :)
FYI - my backup plan? ordering a 4 port intel nic to stick into my virtual host and setup a backup pfsense instance there. I'm also thinking of picking up one of those $50 edgerouter x devices to play with :)
-
I received my replacement on Friday and it's already in place. The board had a Tested 2/21/17 sticker on it from QA.
I honestly have no idea if it has the platform fix or not, but supposedly the platform fix can be retro-fitted. This is what servethehome has listed for supermicro:Did you notice if it's a REV 1 board, or if they bumped the REV to 02? Is there any sign of jumper wires on the board?
-
Did you notice if it's a REV 1 board, or if they bumped the REV to 02? Is there any sign of jumper wires on the board?
I found nothing different other than the QA sticker.
I sent an email asking about any confirmation of the fix
Thanks for processing this. I received the RMA on friday. There doesn't appear to be any distinguishing marking (rev bump stuff like that) to note whether or not the board has the platform level workaround implemented for the atom cpu flaw. Is there anyway to get some kind of confirmation that it actually has that workaround implemented?
And this was the response
Hello
The replacement has the issue fixed.Guess I just have to trust them :)
-
-
Does the new replacement board have a different stepping if you check via command line?
I got my replacement last night, and I see NOTHING different whatsoever on the board (other than its an obvious refurb that hasn't been as gently handled as my original.) The CPU stepping is also identical: Origin="GenuineIntel" Id=0x406d8 Family=0x6 Model=0x4d Stepping=8
On my replacement (which was a cross-ship), I've had a few problems already. I've had to clear CMOS a couple times to get it booting, and then it crashed (kernel crash) in the middle of booting pfsense, which then resulted in a corrupt filesystem (and we all know how poorly pfsense 2.3.x deals with that.)
All of these issues COULD be related to the CMOS being whacked out.
Since then, I pulled the CMOS battery, erased CMOS again (several times), reconfigured BIOS, completely reinstalled pfsense and restored a backup configuration. So far, it doesn't seem to be doing anything bad… but it hasn't even been 24 hours since I got it working properly.
-
Does the new replacement board have a different stepping if you check via command line?
The platform level workaround doesn't change the cpu stepping. I'm not sure if Intel is shipping any new silicon yet.
On my replacement (which was a cross-ship), I've had a few problems already. I've had to clear CMOS a couple times to get it booting, and then it crashed (kernel crash) in the middle of booting pfsense, which then resulted in a corrupt filesystem (and we all know how poorly pfsense 2.3.x deals with that.)
All of these issues COULD be related to the CMOS being whacked out.
Since then, I pulled the CMOS battery, erased CMOS again (several times), reconfigured BIOS, completely reinstalled pfsense and restored a backup configuration. So far, it doesn't seem to be doing anything bad… but it hasn't even been 24 hours since I got it working properly.
I didn't have any of those problems and mine's been in place since Friday afternoon without problems so far.
-
My replacement is getting even worse, keeping shutdown itself for no reason within minutes after rebooting. Installed back to my original board, working again. Obviously the replacement isn't fix at all.
-
I also got a replacement from Supermicro. So far it's been good, no issues. Compared both boards side-by-side, I see no physical differences on the board itself. I also don't see any QC sticker or anything. I did notice that they removed a sticker on the top of the LAN port (that previously had a serial number on it). I only know because I can still see some adhesive from where the old sticker used to be. They replaced it with a similarly sized sticker that has the barcode, serial, and the date (2/17). When I emailed them asking how I can tell the difference, I literally got the same "The replacement has the issue fixed." response. Very frustrating.
One thing I did notice that is very different, is now on my pfSense dashboard, under System it says:
System Super Micro C2758
Serial: zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzzWith my old board, it actually showed me the serial number. Now, it just shows what looks like a randomly generated UUID. I have no idea what caused that… Does anyone else have the same thing?
-
My replacement is getting even worse, keeping shutdown itself for no reason within minutes after rebooting. Installed back to my original board, working again. Obviously the replacement isn't fix at all.
Within 24 hours… The replacement board started to have NIC dropouts on all 3 of the i354 controllers in use. When this happens, the switch reports that the cable is unplugged (and then later plugged in.)
I called supermicro and complained. A lot. (They wanted me to send them email. I explained that email wasn't acceptable and wasn't going to make me go away.) I ended up having to fill out another RMA. I have to RMA this replacement board for another replacement board while the original RMA is left open (and the hold on my CC still in place.) Once I get a working board with the intel issue supposedly resolved, I'll send back my original board and they'll release the CC hold.
Damn annoying, but still better than having to wait until the original board completely fails before they'll replace it.
@pfcode, I'd suggest calling supermicro's RMA dept and complaining a bit...
Take care
Gary -
yeah. so this is why I'm not rushing to replace my supermicro avoton gear…
-
Damn annoying, but still better than having to wait until the original board completely fails before they'll replace it.
All boards, Clock Component or not, might fail. Yours might never fail.