Intel Atom C2xxx LPC failures
-
I feel compelled to add my two cents here… Everyone who is talking about HA is missing the mark imo. The decision on whether or not to deploy a HA solution is based on a risk and cost analysis, it is not pulled out of a hat.
This increased failure rate does not expose poor designs (the "haha should have designed in HA solution" attitude), it potentially changes the equation a design was built on. It's easy to talk about HA when that means you only had to by 1 more firewall, but at a certain scale you have to weigh the costs on both sides.
-
What happens when it fails and the client is down for days while a new one is shipped (hey HA people, the client didn't know they would need HA at time of purchase. If they had known that there was an expected failure, they could have used that fact while determining what to purchase)
So, why didn't they know? Someone should have told them. Hardware breaks, all the time, everywhere, suddenly. So there's always an expected failure.
They have no spare or support contract for their device, they are down. This is not specific to some Intel issue.Of course they know hardware can break. They didn't know this specific piece of hardware will break (queue the various arguments about if it will or not - not relevant here. Cisco certainly seems to think it will.). And of course, no matter what, if something breaks, it will take them down. So when purchasing, they did the math and said "Chance of it breaking vs cost of HA - we will risk it". Now that the chance is significantly higher due to this specific Intel issue…, that math doesn't work anymore - doing that math now makes it look like "chance of it breaking vs ___ - we will buy another product".
Oh noes, I'm absolutely doomed because my oh so supercritical router's downtime is costing me gazillions $$$s per hour – but of course I did not know it could ever break so I have no backup solution anywhere. Netgate, you suck big time!!!
::) ::) ::)
(I know, don't feed the troll.) Where did I write anything about doom, super critical, or gazillions $$$s per hour. Or not having a backup. Or anyone sucking.
Have you read this post? https://blog.pfsense.org/?p=2297
The following part is the most important for our customers:
"Although most Netgate Security Gateway appliances will not experience this problem, we are committed to replacing or repairing products affected by this issue for a period of at least 3 years from date of sale, for the original purchaser."
Before the clock signal component issue was revealed, our units have had 1 year warranty. Now our units have at least 3 years long warrenty because of the clock signal component issue, which may not even occur.
As others have said, not having HA is a different issue that has nothing to do with the clock signal component.
Yes, I read that. And I came to the following conclusion… Netgate doesn't have confidence that the SG-xxxx won't be affected by this issue - because if they did they would have instead done a warranty extension for this issue for 5+ years to lifetime. After all, if you suspect that the clock signal issue won't cause an issue, whats the problem with putting your money where your mouth is?
I'm not saying offer lifetime warranties for all issues - only for clock signal issues. (And if Netgate did that, then I would consider the case solved from my end). Which is why I (and many others) are calling for a proactive replacement - because we all think that all affected products will fail, and we rather replace them on our schedule instead of after they fail.And I agree with you - HA is a different issue that should be left out of this conversation.
Have you read this post? https://blog.pfsense.org/?p=2297
The following part is the most important for our customers:
"Although most Netgate Security Gateway appliances will not experience this problem, we are committed to replacing or repairing products affected by this issue for a period of at least 3 years from date of sale, for the original purchaser."
Before the clock signal component issue was revealed, our units have had 1 year warranty. Now our units have at least 3 years long warrenty because of the clock signal component issue, which may not even occur.
As others have said, not having HA is a different issue that has nothing to do with the clock signal component.
ivor, I think the point that some are trying to make is that Netgate has made no comment on the status of the CURRENTLY SHIPPING units. Is the "platform level fix" already applied to new units shipping from netgate/pfsense or not?
The closest netgate has come to answering that is :
This workaround is being cut into production as soon as possible after Chinese New Year. Additionally, some of our products are able to be reworked post-production to resolve the issue.
(quoted from that same link you posted.)
That doesn't say "okay, from this point forward it's all good!" In fact, it's pretty open-ended.
And this is another valid point. How can I now knowingly buy from the store when I might be buying a problem?
I feel compelled to add my two cents here… Everyone who is talking about HA is missing the mark imo. The decision on whether or not to deploy a HA solution is based on a risk and cost analysis, it is not pulled out of a hat.
This increased failure rate does not expose poor designs (the "haha should have designed in HA solution" attitude), it potentially changes the equation a design was built on. It's easy to talk about HA when that means you only had to by 1 more firewall, but at a certain scale you have to weigh the costs on both sides.
Thank you. I could not have said it better myself.
-
Yep, I wonder why they don't do a "pro-active" replacement. Perhaps the Supermicro style - send you a refurb god knows where it came from, with no sign of fix anywhere… Would cost just the postage. Placebo effect in action. Sigh.
-
The discussion about HA, critical uptime, cold spares, expensive Cisco Support contracts etc does nothing but distract from the real issue.
Netgate continues to sell SG series appliances in their online store, despite knowing they have a CPU with a fault.
I wonder if Netgate/pfSense understand that "word of mouth" is a major contribution to their sales and marketing efforts.
Somebody else has already mentioned it - that those Edge Routers look better by the minute.
They may not do everything pfsense can do, but hey - they're cheap. -
http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf
Conclusion: Everyone should stop selling computers immediately, because they're all selling CPUs with known faults.
::) ::) ::)
-
The discussion about HA, critical uptime, cold spares, expensive Cisco Support contracts etc does nothing but distract from the real issue.
No it doesn't. It exposes the piss poor planning on the part of the people who designed and actively maintain that particular system. My educated guess is that you have more to worry about with a close by lightning strike than you do with this particular "fault" showing up on your equipment tomorrow.
But much like a hurricane warning, you have time to plan and have a spare in hand to mitigate any "downtime" potential.
Lightning strikes (which I deal with on a reoccurring basis) tend to hit more like earthquakes or tornadoes. You better have a plan in place to get yourself or your clients back on your feet in the event that one does happen. Otherwise your clients will come running to someone else like me that does have that particular plan in place.
Quit worrying about who's fault it is right now and plan for the worst as to keep your people happy. If you have spares on the shelf "you will be able to sleep on a windy night".
Over my office door- "Your lack of preparation is not my emergency!" ;) my 2 pennies.
-
So when purchasing, they did the math and said "Chance of it breaking vs cost of HA - we will risk it". Now that the chance is significantly higher due to this specific Intel issue…
When purchasing, they didn't "do the math," they "made a guess." Doing the math would have required MTBF projections to work with. To my knowledge, ADI has not published any MTBF projections. And anyone that would actually go to the point of analyzing MTBF and cost to the business of failure would not be asking "should we have a spare," they would be asking "how many spares should we have?"
Netgate, while saying that they believe that most customers will not encounter the issue at all, have extended their one year warranty to three years to cover this issue. That's a pretty strong commitment for a small business to make. They certainly didn't have to do that. In my view, this is more than satisfactory.
Perhaps another way to look at this might offer you some comfort: Netgate extended their warranty to three years, effectively tripling their liability. They must have very good reason to believe that the number of units that will actually fail in three years of operation will be quite low. Otherwise, the commitment would constitute corporate suicide. I'm pretty sure that was not their intent.
-
So when purchasing, they did the math and said "Chance of it breaking vs cost of HA - we will risk it". Now that the chance is significantly higher due to this specific Intel issue…
When purchasing, they didn't "do the math," they "made a guess." Doing the math would have required MTBF projections to work with. To my knowledge, ADI has not published any MTBF projections. And anyone that would actually go to the point of analyzing MTBF and cost to the business of failure would not be asking "should we have a spare," they would be asking "how many spares should we have?"
Netgate, while saying that they believe that most customers will not encounter the issue at all, have extended their one year warranty to three years to cover this issue. That's a pretty strong commitment for a small business to make. They certainly didn't have to do that. In my view, this is more than satisfactory.
Perhaps another way to look at this might offer you some comfort: Netgate extended their warranty to three years, effectively tripling their liability. They must have very good reason to believe that the number of units that will actually fail in three years of operation will be quite low. Otherwise, the commitment would constitute corporate suicide. I'm pretty sure that was not their intent.
"They did the math" is an expression. Although, that doesn't matter here.
Lets say I told you the client had 1000 locations. And had a SG-2440 or better at each. That's a half million spent at SG-2440 prices for HA.
And lets say we had HA (after spending half a million for that) - we would use the same firewall most likely for the HA unit…
Which means we would have 2000 devices all affected by the same bug.
And chance are we put them both in and turned them on same day. Which means that they would probably fail at the same time (or very close to it).They did a cost vs risk analysis. The information they used in that is now wrong - the likelihood of failure is higher now. We can debate what information they used and should they have used that information all day, but it really is irrelevant. We can debate should they have HA or spares on a shelf, but that is also irrelevant to the discussion. In fact, the only things relevant to this discussion is
- whats going to happen to the equipment in the field with regards to potential LPC failures
- what's netgate going to do
- and the reason for 2.
Right now we are told the equipment will work fine, netgate will extend the warranty to 3 years, and that is because the equipment will be fine.
Some of us don't believe that to be true (the equipment will be fine). As such, I have proposed another 2 options, namely a 5 year or better warranty for hardware that does have a clock signal failure (since it shouldn't happen according to netgate, this should be a no brainer), or a proactive replacement.
I'd be interested to hear your reasoning as to why you or anyone else (really I want to hear from Netgate) are opposed to this. I'm not really interested in further discussing "should someone have HA or spares on the shelf" in this topic - that's a valid topic for another thread and has nothing to do with this thread. This thread, as the title says, is about the Intel Atom C2xxx LPC failures.
-
So when purchasing, they did the math and said "Chance of it breaking vs cost of HA - we will risk it". Now that the chance is significantly higher due to this specific Intel issue…
Netgate, while saying that they believe that most customers will not encounter the issue at all, have extended their one year warranty to three years to cover this issue. That's a pretty strong commitment for a small business to make. They certainly didn't have to do that. In my view, this is more than satisfactory.
Perhaps another way to look at this might offer you some comfort: Netgate extended their warranty to three years, effectively tripling their liability. They must have very good reason to believe that the number of units that will actually fail in three years of operation will be quite low. Otherwise, the commitment would constitute corporate suicide. I'm pretty sure that was not their intent.
Correct, Denny. (And thanks!)
Exactly your last sentence above. We are a small company. We know how many systems we've sold, and we know the modeled failure rate and timeline to failure of the affected component.
No, I'm not going to let on with any numbers. Some of it is proprietary to Netgate, some of it is covered by NDA.
-
So when purchasing, they did the math and said "Chance of it breaking vs cost of HA - we will risk it". Now that the chance is significantly higher due to this specific Intel issue…
And chance are we put them both in and turned them on same day. Which means that they would probably fail at the same time (or very close to it).
You're wrong here, and I'm bound to not explain why or how.
-
I wonder if Netgate/pfSense understand that "word of mouth" is a major contribution to their sales and marketing efforts.
Somebody else has already mentioned it - that those Edge Routers look better by the minute.
They may not do everything pfsense can do, but hey - they're cheap.-
I believe we do, yes.
-
And that's all you can really say about them.
-
-
…and the thread goes on and on and on... And no one from pfSense/Netgate is commenting on if new units will have the issue resolved before shipping or not.
There's plenty of HA debate, accusations of wanting free hardware, and so on...
Look... extending the warranty to 3 years is a good deal. It's better than some of the other things I've seen out there concerning this issue, and most definitely better than those who experienced the failure BEFORE intel acknowledged the issue got.
Yet, there's still this nagging question about if new units being shipped are having the work-around performed before shipment or not...
Maybe netgate doesn't want to go through the expense of fixing existing stock, and they are planning the next revision of hardware instead? C3xxx based?
I don't know. In truth, the answers don't impact me whatsoever. I'm just curious.
-
@jwt:
So when purchasing, they did the math and said "Chance of it breaking vs cost of HA - we will risk it". Now that the chance is significantly higher due to this specific Intel issue…
And chance are we put them both in and turned them on same day. Which means that they would probably fail at the same time (or very close to it).
You're wrong here, and I'm bound to not explain why or how.
Fair enough.
If you can respond to this at least….
I'm wrong on which part - the chance of it failing being significantly higher or the chance of 2 systems put into service same day will fail near each other?I understand that you may be restricted from revealing information - don't hold it against you or netgate. Part of the general frustration some of us have is that no one will reveal any information. In the past, when information like this is withheld, it usually turns out it as bad if not worse then people are guessing. And all that does is further feed the general negative feelings.
-
They did a cost vs risk analysis. The information they used in that is now wrong - the likelihood of failure is higher now.
You're just making stuff up. (Or, if there was a risk analysis, someone made up numbers to go into it.) There is no baseline failure rate, and the delta is unknown. So the math is something like "unknown * unknown = even more unknown". You're not basing this on any kind of real analysis, you're reacting to a scare story.
Some of us don't believe that to be true (the equipment will be fine). As such, I have proposed another 2 options, namely a 5 year or better warranty for hardware that does have a clock signal failure (since it shouldn't happen according to netgate, this should be a no brainer), or a proactive replacement.
I'd be interested to hear your reasoning as to why you or anyone else (really I want to hear from Netgate) are opposed to this.
Because then the company is saddled with an ongoing responsibility to deal with incoming claims, whether valid or not. E.g., if someone static zaps their board 4 years from now, netgate is going to have to deal with the claim that the c2xxx bug was the problem. They're going to have to either maintain spares and just hand them to anyone who asks for one, or keep people around who remember how to deal with a long-obsolete board, or they'll have to just give people free new computers whenever they ask for one. For something that's unlikely (with an unknown magnitude) that's excessive for a small company to commit to.
Repeat after me: cisco is only giving out free computers to people who are paying something around the parts value of a netgate firewall every year for maintenance. If I offered to replace your netgear routers proactively if you would agree to enter a 5 year $150/yr service contract, which would also cover future failures, would you take me up on that deal? Heck, if enough people say yes I'd actually consider it–there's a decent profit to be made.
-
Let's assume for a moment that I have redundant power, UPS'es, hot spares, cold spares, followed all the best practices and have mature process in place….
All of the above considered - Netgate still sold me a unit with a component that is likely to fail prematurely.
Even worse, Netgate is continuing to sell these units.
It is not of Netgate's concern what I do with my SG series appliance - and how I use it. As far as Netgate is concerned I could use it as a coaster to put my beer on.
What IS of Netgate's concern is that they have an unhappy customer who'se beer coaster has a faulty CPU. And I want it fixed.
The advertisement says "This system is designed for a long deployment lifetime." This is misleading because a key component of the system has - according to its manufacturer (intel) - a higher than projected failure rate, starting at around 18 months of use.
Therefore the SG series products are not fit for purpose and the "lemon law" should apply.
-
Netgate is continuing to sell these units.
Says you. Ive seen no evidence of that either way. Remember NDA's cover allot of ground.
Until I hear from them either way Im going to hold any judgement. AFAIC Im not so sure that I won't get 100 years out of them until I see differently. So what if Im wrong. Ill deal with that road when I get to it. Im going to continue to put equipment out knowing what I know now and choose what I put out based on the knowledge I have. If that means I put something else out from someone else that is my decision. That product might have an unknown bug which does show up and ruin my day in the summer of 2019 in which the Netgate product would have still worked flawlessly. I don't have a crystal ball and I don't pretend I could read it if I did. I wish others would follow suit but I digress.
No one is forcing you to buy or distribute something you don't trust. But just because you don't trust it doesn't mean anyone else shouldn't.
Personally I blame no one but Intel on this one. Im of the belief that this could be an attempt to limit the life of a product in order to increase future profits, and that they screwed up the math . But that's just me.
I will hold companies like Arris and Linksys (among others) accountable for the PUMA6 debacle because that screwup was apparent from the get go, and testing should have shown it. Netgate and others could have never know this "fault" (thread subject) was possible from their provided documentation from Intel and Im willing to take a little responsibility with that. My customers appreciate that and will not fault me for not understanding the Crystal Ball instructions. 8)
:)
-
unhappy customer who'se beer coaster has a faulty CPU. And I want it fixed.
GOTO. Now, you'd better use one of these for any of your firewalls.
-
All of the above considered - Netgate still sold me a unit with a component that is likely to fail prematurely.
You have no real basis for your hysteria. Please cite a credible source for "likely to fail prematurely".
-
I noticed ADI has a new 01.00.00.12 BIOS out for the RCC-VE platform. I haven't tested it, and am not recommending you run out & flash it. Just posting this for informational purposes. The release notes can be found in this pdf on Github. But here's a nugget from the last page:
RELEASE ADI_RCCVE-01.00.00.12
Release Date: 03/01/2017The versions of software components used in this release are:
• SageBIOS: SageBios_Mohon_Peak_292.
• FSP: RANGELEY_FSP_POSTGOLD3.
• microcode: M01406D8125 for B0 stepping.
• Descriptor: ADI unlockedNew Features
• Workaround for Intel C2000 Errata AVR.58
A software workaround for Intel C2000 Errata AVR.50 has been implemented in this release. The
workaround disables SERIRQ to prevent indeterminate interrupt behavior for systems that do not have
external pull up resistor on SERIRQ PIN. -
They did a cost vs risk analysis. The information they used in that is now wrong - the likelihood of failure is higher now.
You're just making stuff up. (Or, if there was a risk analysis, someone made up numbers to go into it.) There is no baseline failure rate, and the delta is unknown. So the math is something like "unknown * unknown = even more unknown". You're not basing this on any kind of real analysis, you're reacting to a scare story.
This is veering off topic - you can speculate all you want on how we did our analysis, I won't get into that.
Some of us don't believe that to be true (the equipment will be fine). As such, I have proposed another 2 options, namely a 5 year or better warranty for hardware that does have a clock signal failure (since it shouldn't happen according to netgate, this should be a no brainer), or a proactive replacement.
I'd be interested to hear your reasoning as to why you or anyone else (really I want to hear from Netgate) are opposed to this.
Because then the company is saddled with an ongoing responsibility to deal with incoming claims, whether valid or not. E.g., if someone static zaps their board 4 years from now, netgate is going to have to deal with the claim that the c2xxx bug was the problem. They're going to have to either maintain spares and just hand them to anyone who asks for one, or keep people around who remember how to deal with a long-obsolete board, or they'll have to just give people free new computers whenever they ask for one. For something that's unlikely (with an unknown magnitude) that's excessive for a small company to commit to.
Repeat after me: cisco is only giving out free computers to people who are paying something around the parts value of a netgate firewall every year for maintenance. If I offered to replace your netgear routers proactively if you would agree to enter a 5 year $150/yr service contract, which would also cover future failures, would you take me up on that deal? Heck, if enough people say yes I'd actually consider it–there's a decent profit to be made.
Actually, if you offered me a SG-2440 for $100 a year (so $500 in 5 years) in a HaaS (Hardware as a service), I would consider it strongly. Heck, I already work with a company that does WaaS (Wireless as a Service) that does something similar.
To all those who say "well cisco is so expensive, blah blah blah… smartnet... "... well then maybe Netgate should charge more if more needs to be charged. Some of us rather pay a premium for a premium product and not worry then not pay that premium and then have to worry. The clients I service made that call when they picked me - I'm certainly not the cheapest one around (not even close).
Netgate is continuing to sell these units.
Says you. Ive seen no evidence of that either way.
Umm, you have seen no evidence that Netgate is continuing to sell these units?
May I redirect you to here? https://store.netgate.com/SG-2440.aspxAll of the above considered - Netgate still sold me a unit with a component that is likely to fail prematurely.
You have no real basis for your hysteria. Please cite a credible source for "likely to fail prematurely".
Absolutely. How about this pdf from Intel?
http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-c2000-family-spec-update.pdf
And I quoteAVR54.
System May Experience Inability to Boot or May Cease Operation
Problem:
The SoC LPC_CLKOUT0 and/or LPC_CLKOUT1 signals (Low Pin Count bus clock outputs) may stop functioning.
Implication: If the LPC clock(s) stop functioning the system will no longer be able to bootThis PDF should establish the failure part. As for prematurely, well, the ettera wouldn't have been made if it was normal spec. I'm not sure where the 18 months number comes from that I have seen flying around… but I'm willing to bet it's source is credible.
As for the likely part - the only thing I can point to is all the other people who are pointing to failed systems.EDIT: More credible sources:
Intel's Robert Holmes Swan, the new CFO and executive vice president, stated:
"But secondly, and a little bit more significant, we were observing a product quality issue in the fourth quarter with slightly higher expected failure rates under certain use and time constraints…"