Intel Atom C2xxx LPC failures
-
We're still investigating internally, we'll put out an official response once we have enough information.
You can also follow some additional conversation on the topic here: https://www.reddit.com/r/PFSENSE/comments/5s8pwi/intel_c_series_processor_recalls_are_pf_official/
-
crap!!
CPU: Intel(R) Atom(TM) CPU C2558 @ 2.40GHz (2400.06-MHz K8-class CPU)
Origin="GenuineIntel" Id=0x406d8 Family=0x6 Model=0x4d Stepping=8 -
FWIW, you probably don't need to go check your stepping. By Intel's data sheet, there has only been one stepping (B0) released to date for the Atom C2000 family.
-
And in case anyone missed it:
https://blog.pfsense.org/?p=2297
A very respectable response from Netgate.
-
@Jim:
Although most Netgate Security Gateway appliances will not experience this problem, we are committed to replacing or repairing products affected by this issue for a period of at least 3 years from date of sale, for the original purchaser.
That is a good post and I only have one related question now.
Does this mean that only devices that actually fail within the three years will be repaired/replaced or any devices with the susceptible CPU will be repaired/replaced within the three years regardless of whether they have actually suffered from the problem or not?
-
@Jim:
Although most Netgate Security Gateway appliances will not experience this problem, we are committed to replacing or repairing products affected by this issue for a period of at least 3 years from date of sale, for the original purchaser.
That is a good post and I only have one related question now.
Does this mean that only devices that actually fail within the three years will be repaired/replaced or any devices with the susceptible CPU will be repaired/replaced within the three years regardless of whether they have actually suffered from the problem or not?
For a lot of enterprise customers the replacement cost of a faulty device is insignificant. Spending 500$ on a pfSense SG Appliance or 5000$ on a Cisco Router isn't really the problem. The problem for them is unpredictability and the impact and risk a component failure may produce. That's despite redundancy and the knowledge that failures will always happen.
Consider the potential downtime (and asdociated loss of business), change control, travel cost, overtime etc…
Cisco have chosen to pro-actively replace affected components because tgey do not want to expose their customers to any additional risk. The life expectancy of enterprise kit is approximately 3-5 years because by then the technology will be technically superseded. I've cetainly seen kit run 10+ years.I'm a pfSense customer (for my employer and home) and my purchase decissions were made because of:
- quality intel Nic's in a purpose built product
- a large commnity develpping/supporting the software/pfsense project
- low power consumption
- a long life expectancy, certainly greater than 3 years
- no moving parts and fans
I am also keen to understand whether pfsense/netgate will do a pro-active replacement (like Cisco) or whether this will be a "fix-on-fail" program ?
"Fix-on-fail" means that pfSense is asking its customers to wear the risk mentioned above.
So back to liontaur's question?
- will my 4 appliances be replaced within 3 years irrespective of fault ?
- will my appliances only be replaced if they fail?
- what happens if my appliances fail after 3years+1days?
My expectation as a consumer is that my appliances will last well beyond 3 years of operation.
-
A very respectable response from Netgate.
Opinions differ. :) 3 years is a pretty short clock for fundamental design flaw.
^^^ When I wrote this I didn't realize the netgate warranty was only 1 year; I was thinking of supermicro's 3 year warranty and read it as a brush off. My bad.
-
My expectation as a consumer is that my appliances will last well beyond 3 years of operation.
First, you'll need to appreciate that, while I know the modeled failure rates of the component in-question, I can't release same.
Second, your appliance will, in all likelihood, last longer than three years. The majority of at-risk Netgate products will not experience this failure over their entire service lifetime.
Third, Cisco's offer isn't as "pro-active" as you suggest. A careful read of Cisco's Ts & Cs should reveal the truth.
Fourth, we feel we have a strong replacement policy, as it is not limited to the original warranty period or to systems covered by an existing support agreement, as others have announced. Considering the likelihood of the failure occurring, we feel our limited extended warranty is the best course of action, because it results in less overall inconvenience, downtime, and demands on our customers and partners.
-
I also contacted Supermicro (EU) and asked them if it only affects one specific stepping.
Apparently even they don't know if it only affects the B0 stepping, because Intel doesn't want to give out too many details.
Point in-fact, it's not that Supermicro doesn't know, as much as Supermicro can't tell you.
Big difference.
-
odd no news stories on the web about this.
-
The reality is that no matter what Netgate, Cisco, SuperMicro post about whether they'll proactively replace all devices that contain the affected CPUs, or just failed devices and try to limit their exposure to some lame 3 year limit. They'll quickly learn that a class action is going to change their position very quickly, and empty their pockets much quicker than if they just replaced all affected units from the get go. intel is obligated to support all the costs. They've already incorporated a charge for this in their latest earnings. There are provisions within in the law that don't allow companies to hide behind time limits on manufacturing defects (latent or otherwise). Just ask Apple.
My advice, if you own an affected device, notify the supplier/manufacturer respectfully in writing that you expect a fixed replacement free of the defect within 90 days. If they don't comply, and/or don't reply, the law will set them straight and then some. Document your communications. A class action will be announced at some point.
-
@jwt:
My expectation as a consumer is that my appliances will last well beyond 3 years of operation.
First, you'll need to appreciate that, while I know the modeled failure rates of the component in-question, I can't release same.
Second, your appliance will, in all likelihood, last longer than three years. The majority of at-risk Netgate products will not experience this failure over their entire service lifetime.
Third, Cisco's offer isn't as "pro-active" as you suggest. A careful read of Cisco's Ts & Cs should reveal the truth.
Fourth, we feel we have a strong replacement policy, as it is not limited to the original warranty period or to systems covered by an existing support agreement, as others have announced. Considering the likelihood of the failure occurring, we feel our limited extended warranty is the best course of action, because it results in less overall inconvenience, downtime, and demands on our customers and partners.
A trade up program would also be a good way to reduce the risk to the manufacturer as well as the customer. It is a win-win situation. Develop a system with the new InHell Pentagram processor family, give it a few fancy upgrades (Ram, interfaces, etc.) then offer the customer a pro-rated discount for their product based on service life. One thing I did find interesting about Netgate response was their assertion that their products won't be affected by this flaw… How exactly do they know that?
-
@jwt:
My expectation as a consumer is that my appliances will last well beyond 3 years of operation.
First, you'll need to appreciate that, while I know the modeled failure rates of the component in-question, I can't release same.
Second, your appliance will, in all likelihood, last longer than three years. The majority of at-risk Netgate products will not experience this failure over their entire service lifetime.
Third, Cisco's offer isn't as "pro-active" as you suggest. A careful read of Cisco's Ts & Cs should reveal the truth.
Fourth, we feel we have a strong replacement policy, as it is not limited to the original warranty period or to systems covered by an existing support agreement, as others have announced. Considering the likelihood of the failure occurring, we feel our limited extended warranty is the best course of action, because it results in less overall inconvenience, downtime, and demands on our customers and partners.
jwt - thanks for this information.
I appreciate that pfSense offers an extended warranty to affected customers. That said - I purchased 4 pfSense/NetGate appliances and each time I paid 90$US for shipping via FedEx. Now imagine my 4 appliances die within your extended warranty I essentially have to pay 720$US roundtrip to get all my appliances replaced / fixed.
Of course - my appliances may never fail, but why should I carry that risk?
With a replacement programme I could get all my appliances exchanged in one go.
Of course the bigger problem is being without a firewall/router while it gets replaced. I know you cannot share any NDA information, but it is fair to say that the c2000 processor experiences higher failure rates - due to an inherent design flaw.So - search your feelings (Star Wars Quote): If you had a choice between two alomost identical appliances to run your business. One appliance has a known higher risk of failing - the other has the known lower risk of failing.
Which one would you chose ?
This is not pfSense/Netagte fault. It is intel fault. My expectation and that of many other customers is that Netgate will work with Intel to find a workable solution for customers. The Pandora's Box is now open - and telling me that my appliance "might not fail" is not an excuse.
-
This is not pfSense/Netagte fault. It is intel fault.
Absolutely. And we can hope that Intel will work with its customers, and not just the big ones. I mentioned in another thread that a friend had a SuperMicro board fail in a manner that is entirely consistent with this reported issue, and it took them 3 months (!) to get it back to him. The board went from California to Taiwan and back in that time. That, IMO, is unacceptable. And I'd consider SuperMicro to be, if not a "big" customer, at least one of the larger ones offering Intel's embedded hardware in what is advertised as enterprise class hardware. And I suspect, but don't know, that at least some of the Netgate/pfSense hardware is SuperMicro stuff.
This will not be easily glossed over. Intel needs to step up first, and give its customers a clear and easy path to remediation. And if that path has to trickle down through the OEMs like SuperMicro and whoever else Netgate might contract with, then those companies need to step up too.
-
https://www.crn.com.au/news/cisco-partners-pay-for-massive-product-replacement-450313
-
This is not pfSense/Netagte fault. It is intel fault.
I mentioned in another thread that a friend had a SuperMicro board fail in a manner that is entirely consistent with this reported issue, and it took them 3 months (!) to get it back to him. The board went from California to Taiwan and back in that time. That, IMO, is unacceptable.
Agreed that three months is entirely too long. Supermicro is supplying us with advanced stock so we can turn an RMA for this issue around in a day, rather than the timeline experienced by your friend. That said, your friend (probably) didn't buy from us, and there isn't much I can do if someone isn't a customer.
As I said in the blog post, we're standing behind our products, and will continue to do the right thing. If, via negotiation, we can extend the warranty for this issue to 5 or even 7 years, we will. (As a reminder, I wrote "at least 3 years". This is why.)
To get a replacement from Cisco, you'll have either purchased in the last 90 days, or entered into a very expensive "extended warranty" at the time of purchase. They are NOT replacing your Cisco device outside of these two eventualities. Their announcement is wordsmithed to lead the public to the conclusion that they are.
-
@MiB:
The reality is that no matter what Netgate, Cisco, SuperMicro post about whether they'll proactively replace all devices that contain the affected CPUs, or just failed devices and try to limit their exposure to some lame 3 year limit. They'll quickly learn that a class action is going to change their position very quickly, and empty their pockets much quicker than if they just replaced all affected units from the get go. intel is obligated to support all the costs. They've already incorporated a charge for this in their latest earnings. There are provisions within in the law that don't allow companies to hide behind time limits on manufacturing defects (latent or otherwise). Just ask Apple.
My advice, if you own an affected device, notify the supplier/manufacturer respectfully in writing that you expect a fixed replacement free of the defect within 90 days. If they don't comply, and/or don't reply, the law will set them straight and then some. Document your communications. A class action will be announced at some point.
True but the action may only affect america.
Like e.g. the nvidia gtx 970 scandal only gave americans a rebate.
I also think intel's NDA is quite possibly illegal in some countries, especially the EU, a NDA to hide design/manufacturing defects breaches various sales laws. This is not some new tech they want to keep underwraps but a released commercial product people have purchased. e.g. in the UK its illegal to sell something to someone with a known defect and not disclose it. The countries law that applies is in the place of sale, not where the company is HQ'd.
-
NDAs are simply standard practice. If you want access to proprietary information like futures, projected failure rates, direct purchasing requiements, etc., you sign an NDA. Like most companies, Intel uses bidirectional NDAs that cover all disclosures in the relationship. You should expect that any hardware manufacturer who uses Intel chips in their designs will have an an NDA with Intel. Many software companies do as well.
In other words, it's not a conspiracy. the NDAs the companies are citing are not new, nor specific to this issue. The "educated guess" in the serve the home article is simply wrong. No one who has been under NDA with Intel would suggest that.
-
Law here and law there is not the real thing we are talking about, it is more the thing that Intel or Supermicro are
able to serve us with a small program that is perhaps let us say installed on an USB Stick and with that we are all
able to deactivate this registers and then we reboot an were able to stich in a second USB pen drive with an inside
installed i2C chip that is then overtaking this part of work, then we would be all fine! If this is not able to work
around we should be waiting for another trail that will be shown by Intel or another vendor (producer) we all
are able to march. And if nothing helps out, we all know that problem now and we are able to get adequate
replacement by our own money, because our networks should be safe and secured and after this we are
sitting not in a really hard deep black hole and don´t came out. For sure this might be not ideal the most
peoples will think now, but if this units are not booting anymore the pain and stress factor is perhaps
much higher then the knowledge that something must be done before this units are failing!Its a really time bomb for sure, but really able to talk about that would we all only after a failure that is
able to show up! And not month or years before this failure will be perhaps occurring. -
@BlueKobold:
deactivate this registers and then we reboot
From what I've read, that could be as far as your proposal gets you. It might have been the last reboot ever initiated on that system… :)
-
I think gonzopancho mentioned at reddit that the ADI systems boot from i2c flash.
https://www.reddit.com/r/PFSENSE/comments/5s8pwi/intel_c_series_processor_recalls_are_pf_official/
So even id the LPC signal is not required during boot, the signal might still be required by other components ???
Then again, depending how the LPC processor component fails it may affect other parts of the processor, too.
Reading between the lines it appears that "usage" -or - heat may accelarate the deteriotion of said LPC component.
I'm sure Cisco would drive CPU's very hard as that is how you get "value" from your CPU. They wouldn't overspec the CPU to run it at 20% load. -
I agree with above poster who said class action is inbound pretty quick. This is going to get real ugly before it gets better.
-
@jwt:
Agreed that three months is entirely too long. Supermicro is supplying us with advanced stock so we can turn an RMA for this issue around in a day, rather than the timeline experienced by your friend. That said, your friend (probably) didn't buy from us, and there isn't much I can do if someone isn't a customer.
No, he didn't. Didn't mean to imply that he did at all. Glad to hear they're (they being SuperMicro) stepping up.
-
NDAs are simply standard practice. If you want access to proprietary information like futures, projected failure rates, direct purchasing requiements, etc., you sign an NDA. Like most companies, Intel uses bidirectional NDAs that cover all disclosures in the relationship. You should expect that any hardware manufacturer who uses Intel chips in their designs will have an an NDA with Intel. Many software companies do as well.
In other words, it's not a conspiracy. the NDAs the companies are citing are not new, nor specific to this issue. The "educated guess" in the serve the home article is simply wrong. No one who has been under NDA with Intel would suggest that.
depends on what is in the NDA, but a NDA that doesnt allow companies to disclose flaws is illegal in the UK. e.g. if I sold you a product with a predicted 18 month self life and didnt tell you that, then I have breached the sales act. It is a clear breach of "fit for purpose" tests. Projected fail rates doesnt quite fall into this category.
As the old motto goes, contracts cannot override law.
-
https://www.theregister.co.uk/2017/02/06/cisco_intel_decline_to_link_product_warning_to_faulty_chip/
Updated to add at 18:23 UTC, February 8
Once again, Synology has been in touch, seemingly now able to use the I word, to say: "Intel has recently notified Synology regarding the issue of the processor’s increased degradation chance of a specific component after heavy, prolonged usage."So what exactly does "heavy, prolonged usage" mean ? Can someone at pfSense Netgate confirm what Synology is saying?
Did Synology just let it slip that CPU usage (=heat) contributes to the issue?
Will my pfSense/Netgate ADI appliance die sooner if I have heavy CPU use ?
Will my appliance last longer if I add cooling ?
Will my appliance last longer if I disable CPU intensive services ?If not heat - how else could a solid state component degrade over time ?
-
thumbs up to synology for finally naming intel and the component.
What can intel do as punishment? not much as synology could just say we use a competitor cpu in future for new products.
-
If not heat - how else could a solid state component degrade over time ?
It sounds like a simple counter. If that is the case then the counter is incremented when the system is up and when reaching a certain threshold the device stops working.
Are we experiencing Planned obsolescence? ::)
-
The pfSense store still shows SG series products with the C2000 CPU.
Can someone at pfSense confirm whether these products contain the "platform level fix" - or is pfSense knowingly selling products affected by intel AVR54?
That would be very disappointing! -
That would be very disappointing!
What on earth is disappointing with that? :O As far as I was informed by as much as Netgate, Supermicro and other may already say about that problem, there is ATM no one, that actually HAS a working fix for the problem. Intel just said the know about some silicon workaround for boards with B0 and will work on a new stepping for further boards.
Are you suggesting immediatly stopping sales on all products that incorporate a C2000 SOC without actually knowing the full scale of the problem and its dependencies? Also all things related to "heat" as a problem are speculative, as nowhere was heat or even the CPU named a problem. So as far as I read - and that isn't meant as an insult - you are just panicking and jumping to conclusions.Are we experiencing Planned obsolescence? ::)
As far as I know of other manufacturers of devices, that is not the case. Many other vendors of products are already running for more than the mentioned 18 months with very few to zero error cases and those few not related to that failure. So I am waiting for more specifics to come to light for the circumstances of that failure occuring in the first place.
Greets
-
Hi JeGr,
you are correct in that I'm panicking and may be jumping to conclusions. However the reason I'm panicking is not that fact that there's an inherent flaw to the C2000 CPU's. I'm very well aware that hardware components will always fail.
I've worked nearly 20 years with Enterprise IT vendors and the reason I'm panicking is because I know how such issues have been dealt with.Cisco have chosen a proactive approach while others might take a 'fix-on-fail' approach. As far as I understand Netagte/pfSense are taking a fix-on-fail approach. You need to understand that fix-on-fail does not resonate well with the technical types, especially the security conscious technical types. What people want is peace of mind and irrespective of 'failure likelihood', fix-on-fail does not imply peace of mind.
If you ever worked in medium sized business (and upwards) you will easily find yourself in boardroom situation where you have to explain why pfSense/Netgate is taking a fix-on-fail approach while Cisco is being proactive.The fact that NDA's are in place and information is sparse will not help the IT Admin defend Netgate/pfSense's approach in the boardroom.
In this particular Case I am 100% certain that the issue is 100% understood by intel. The reluctant release of information is very considered.
Every word in intel's statement (AVR54) is very very carefully chosen to protect intel. This has a trickle down effect where component makers that use the C2000 will also very carefully chose every word in their respective statements.In order to stop customers from panicking, vendors will deliberateley steer away from too much technical information, because the real issue that needs addressing is customer confidence. Will a fix-on-fail approach retain enough customer confidence without detrimentally affecting future sales?
Company owners and shareholders that sell Systems based on C2000 want to sleep at night, too. A proactive approach would most certainly imply profit errosion in the short term (balanced against long term customer loyalty).
So whenever intel and vendors chose to share more 'technical specifics' - it has very little to do with root cause analysis, because we're way past that. At this stage customers are being Risk Managed. And when reworked boards and CPU's finally come available, you may be Case Managed.
-
I view the explanation as being relatively straightforward. On one hand you have a expensive unit with an expensive service agreement, and on the other hand you have an inexpensive unit with a simple one year warranty (which the vendor has committed to extend to 3 years for this issue). Business people understand risk/reward and will appreciate the difference. If you are really concerned, create a spreadsheet with a 3 or 5 year cost analysis. It should show quite positive, even if you allow for the purchase of a number of cold standby units.
-
Hi dennypage,
we're now comparing an expensive Cisco system with expensive smartnet contract against a cost effective ADI/Netgate system.
In you words:
Expensive + Support Contract (=peace of mind) - vs - cheap with flawed CPU component & warranty (=risky)You're reasoning that I should accept the risk because after all the ADI system is cost effective and I could afford having a cold spare sitting around.
That would be fine - if - the customer would have known about the risk - before - making that purchase decision.
Of course nobody knew about the issue until recently…. and now I'm sitting on a time bomb.Now that we do know about the issue - it would only be fair if potential customers of the pfSense and Netgate store would be informed about the processor flaw. So they can make an informed decision about the risk they're willing to take.
So going back to my original post - I would be "disappointed" if Netgate/pfSense knowingly sold me a flawed unit.
Why not put sales on hold, and wait for the reworked units ?
-
You're reasoning that I should accept the risk because after all the ADI system is cost effective and I could afford having a cold spare sitting around. That would be fine - if - the customer would have known about the risk - before - making that purchase decision.
You have always known the risk. It was inherent in the original decision. Sans purchasing cold spares, the original equation was high upfront plus recurring cost giving you 4-72 hours of potential outage (service contract depending) in the event of failure, vs. low upfront with no recurring cost giving you 4-7 days of potential outage in the event of failure during the first year (with no guarantee after). This is a simple risk tradeoff that any business person should be able to get their head around.
If the unit is critical to the business, you stock spares. This is a very common risk mitigation strategy which any business should be able to understand and accept.
-
High Availability is a thing.
-
High Availability is a thing.
If I purchase a 4860 HA now from the pfSense store - will the CPU be affected by intel AVR54 ? I think yes.
Running a C2000 CPU side by side in a HA config time works against me - and instead of lowering my risk - I essentially double the exposure to AVR54.
I could follow dennypage's advise and purchase a cold spare, too.All this logic makes a lot of sense when looking at it from a vendors' perspective.
I'm a customer - and all I want is a system not affected by AVR54. Why is that so hard to understand ?
I do understand that this issue is not caused by ADI/Netgate, but intel. We all know that.
For all I know - the Netgate store may not even be around in 3years time when my "clock runs out". Is that something that I should have considered before buying, too?
So let's not worry about stuff thats beyond our control.What IS in Netgate's direct control - are the products currently sold on the Netgate/pfSense store. Unless somebody tells me otherwise - Netgate/pfSense are knowingly selling products affected by intel AVR54.
Why ?
So far I have heard:
"Trust us we'll do the right thing…when it fails"
"What do you expect ? It's cheap"
"Buy an HA system if you need uptime"Neither of above solves my problem. What do you think the resale value is about right now of a SG Series Product ?
-
I'm a customer - and all I want is a system not affected by AVR54. Why is that so hard to understand ?
We understand it fine, we just disagree that it's reasonable. I understand that there have been scary stories that every avoton/rangeley will magically stop working after 18 or 36 months, but they're BS.
YOU DID NOT PAY FOR A DEVICE WITH A 0% FAILURE RATE. INCREASING THE NON-ZERO FAILURE RATE TO A SOMEWHAT HIGHER NON-ZERO FAILURE RATE DOES NOT CONSTITUTE A MAJOR CHANGE IN THE FUNCTIONALITY OF THE DEVICE.
If you need a high degree of uptime, no single device is the right solution, you need to deploy something in a HA configuration. Arguing that deploying 2 rangeleys will increase your risk is nuts; the odds that both will fail simultaneously–even with the errata--is still tremendously low. If that's not good enough, deploy 3. If that's not good enough, why on earth are you even using something like this instead of giving a truck full of money to cisco or juniper?
-
I'm a customer - and all I want is a system not affected by AVR54. Why is that so hard to understand ?
We understand it fine, we just disagree that it's reasonable.
I've been kind of staying out of this aspect of the debate simply because I understand both sides of the argument. This is a case where both sides are right. However, the above quoted material looks really bad.
It's not reasonable to want a system not impacted by a known bug?
Taking that to an extreme (just to make the point):
So… should we all abandon pfsense and go use little linksys toy routers instead? They have many more "known" issues (both in hardware and software limitations) but I guess it'd be unreasonable to want a router that functions properly? Better yet, let's all run out and buy samsung galaxy note 7's and use THOSE as routers. Sure, they would be extremely slow and might burst into flames at any time, but it'd be unreasonable to expect anything else, wouldn't it?
-
I'm a customer - and all I want is a system not affected by AVR54. Why is that so hard to understand ?
We understand it fine, we just disagree that it's reasonable.
I've been kind of staying out of this aspect of the debate simply because I understand both sides of the argument. This is a case where both sides are right. However, the above quoted material looks really bad.
It's not reasonable to want a system not impacted by a known bug?
Correct. There is no such thing as a modern CPU with no known bugs. This issue is referred to as AVR54 because it is the 54th issue listed in a 37 page document concerning bugs in the C2000 processor family. Why did this sudden demand for perfection not apply to the first 53 issues? You can dig up similar errata documents for AMD CPUs or other families of Intel CPUs. The only reason you care about this one is because of sensationalized and inaccurate reporting, and expecting the entire industry to do something different on that basis is unreasonable.
Taking that to an extreme (just to make the point):
So… should we all abandon pfsense and go use little linksys toy routers instead? They have many more "known" issues (both in hardware and software limitations) but I guess it'd be unreasonable to want a router that functions properly? Better yet, let's all run out and buy samsung galaxy note 7's and use THOSE as routers. Sure, they would be extremely slow and might burst into flames at any time, but it'd be unreasonable to expect anything else, wouldn't it?
You're not making a point, you're sounding ridiculous. You seriously want to assert that an issue that slightly increases the failure rate of a family of microprocessors is the same thing as a failure mode that causes uncontrolled combustion and actually has the potential for injury or death? Dial down the rhetoric and focus on reality, please.
First question: what was the expected service life of a motherboard with an embedded C2000? (Expected answer: you have no idea.) Second question: what was the expected failure rate for that motherboard? (Expected answer: you have no idea.) Third question: how much higher than anticipated is the failure rate for that motherboard, given AVR54? (Still no idea, right?) Look, your motherboard could fail prematurely due to any number of factors: bad capacitor, cold solder joint, contaminant in the silicon, whatever: motherboards fail from this sort of thing as a matter of course. But now that AVR54 got some press, you're asserting that since your motherboard could fail due to this specific issue, ignoring the fact that it could fail due to any number of other issues, you deserve a new motherboard. That's unreasonable.
IF it were really the case that every C2000 would stop working after 3 years, then there'd be an argument that there should be a recall–but that's not the case even if some scare stories suggested that it was.
As I've said before, if you have a contract which specifies a failure rate, and this flaw means that your supplier can't meet the contractual requirement, then you've got a legitimate beef. (E.g., if you're cisco and you contracted to intel as a supplier...) But if you're an end user with no idea what the failure rate was before this errata was published, you can't suddenly pretend that the failure rate is something you were always concerned about.
-
Correct. There is no such thing as a modern CPU with no known bugs. This issue is referred to as AVR54 because it is the 54th issue listed in a 37 page document concerning bugs in the C2000 processor family. Why did this sudden demand for perfection not apply to the first 53 issues?
How many of those other issues results in a dead system? That's the difference.
You're not making a point, you're sounding ridiculous. You seriously want to assert that an issue that slightly increases the failure rate of a family of microprocessors is the same thing as a failure mode that causes uncontrolled combustion and actually has the potential for injury or death? Dial down the rhetoric and focus on reality, please.
To some people, if their network goes down, it can result in injury or death. (Granted, those people really should have HA systems.) My idea was to take it to an extreme to try and show you how other people might see this… I obviously failed either because I didn't express myself well enough, or you just don't want to see both sides of the argument.
First question: what was the expected service life of a motherboard with an embedded C2000? (Expected answer: you have no idea.)
Surpise: 7 years. From supermicro's website for their C2758 motherboard (point #10 in their key features list): http://www.supermicro.com/products/motherboard/atom/x10/a1sri-2758f.cfm
Being that you made a wrong assumption on your first question, the rest aren't really relevant, are they? You are also ignoring the difference between an unexpected failure and a known defect that can cause a failure.
IF it were really the case that every C2000 would stop working after 3 years, then there'd be an argument that there should be a recall–but that's not the case even if some scare stories suggested that it was.
Actually, we don't actually know the failure rate (as you pointed out.) Intel has hidden that information. If it was an extremely low number, I wouldn't expect them to hide it, though…. Instead, they'd advertise that "only 1 in xxx million will ever have this issue, so don't worry about it!" That leads me to believe that it's higher than most people would be comfortable with.
(Even netgate stated that the "majority" of people won't have the issue. That could be interpreted as "49% of the people WILL have this issue.")
-
You're not making a point, you're sounding ridiculous. You seriously want to assert that an issue that slightly increases the failure rate of a family of microprocessors is the same thing as a failure mode that causes uncontrolled combustion and actually has the potential for injury or death? Dial down the rhetoric and focus on reality, please.
To some people, if their network goes down, it can result in injury or death. (Granted, those people really should have HA systems.) My idea was to take it to an extreme to try and show you how other people might see this… I obviously failed either because I didn't express myself well enough, or you just don't want to see both sides of the argument.
There is no other side of the argument where a single C2000 failing causes injury or death unless there's gross incompetence involved BECAUSE THAT WAS ALREADY A POSSIBILITY THAT SHOULD HAVE BEEN ACCOUNTED FOR. Again, we're not going from a 0% failure rate to a non-0% failure rate, we're going from non-0% to non-0%.
First question: what was the expected service life of a motherboard with an embedded C2000? (Expected answer: you have no idea.)
Surpise: 7 years. From supermicro's website for their C2758 motherboard (point #10 in their key features list): http://www.supermicro.com/products/motherboard/atom/x10/a1sri-2758f.cfm
I don't think that means what you think it means. If you build a solution based on the a1sri-2758f you can expect to be able to get the part for a seven year period. That's a good selling point for places that want to plan the entire lifecycle of a deployment, but irrelevant to this discussion. I don't see any sign that the warranty on the C2000 motherboards is anything other than their standard 3yr/1yr (I certainly didn't get anything else in my box). And n.b. that the design service life is not the same as the warranty period. (You can simply make a good bet that the design service life is longer than the warranty period, but you'd need more information to figure out what it is. In general that's something that's only useful to someone providing contractual support, because it doesn't matter to the consumer how much longer than the warranty the design life is–once the warranty is up so is your guarantee.)
Being that you made a wrong assumption on your first question, the rest aren't really relevant, are they? You are also ignoring the difference between an unexpected failure and a known defect that can cause a failure.
No, the other points are extremely relevant, you just can't/don't want to address them.
IF it were really the case that every C2000 would stop working after 3 years, then there'd be an argument that there should be a recall–but that's not the case even if some scare stories suggested that it was.
Actually, we don't actually know the failure rate (as you pointed out.) Intel has hidden that information. If it was an extremely low number, I wouldn't expect them to hide it, though…. Instead, they'd advertise that "only 1 in xxx million will ever have this issue, so don't worry about it!" That leads me to believe that it's higher than most people would be comfortable with.
It leads me to think it requires detailed analysis of the actual deployed system and that a blanket answer isn't possible. It also seems consistent with the general level of information published about any CPU (if you want an actual failure rate, then get a contract with them; you still probably won't get their internal information, but you'll get a number you can plan for or get some kind of compensation if they miss the number–which will almost certainly be higher than what they think they'll actually achieve.) Hysterically insisting "IT MUST BE BAD" isn't useful, nor is it congruent with what we know from places which have actually deployed them at scale (they simply aren't failing in large numbers).