Intel Atom C2xxx LPC failures



  • Get ready folks… this is going to be a fun ride soon. :) Cisco has started having routers and switches fail due to an LPC clock failure. Coincidentally, Intel has updated the errata of their Atom C2000-series chips, indicating that an LPC clock failure can prevent the system from booting. Cisco didn't name-and-shame the company producing the failed part in their gear, but it's pretty coincidental that Intel happens to update their errata at the same time Cisco announces issues with their hardware indicating the same failure.

    Cisco claims that the failure can start after as little as 18 months of use.

    Intel claims to have a platform-level workaround that can be used. Of course, there are no details in the errata about the workaround.

    This could make for some fun times soon, given all of the Rangeley chips being used in systems running pfSense.

    Article on The Register, the update at the bottom indicates the Atom may be at fault in Cisco's gear.



  • http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-c2000-family-spec-update.pdf
    Not fun at all and not only for pfSense users. But what "A platform level change has been identified and may be implemented as a workaround for this erratum." means?



  • Oh Dear.
    I've got a ASA5506, SG-2220, 2 x SG-2440 and a SG-4860.

    Can't wait to hear about that "workaround".

    Would love to learn how pfSense/ADI are planning to handle this issue?

    Would prefer a software/firmware patch, and hope there won't be a need for Hardware replacement.

    I think cisco are doing HW replacements.



  • if it turns out to be true it's gonna really chap a lot of people who were convinced they really needed to buy "server grade" components. :D





  • More Info here:

    https://www.reddit.com/r/homelab/comments/5sb89p/psa_so_it_seems_that_intels_c2xxx_series_of_cpus/

    Anybody from pfSense/Netgate/Adi care to comment ?



  • Has anyone contacted Supermicro about this?



  • All the information available online leads me to believe that the issue is not limited to Cisco. I.e. Intel is very clear about AVR54 and the affected processors. It's safe to assume that the ADI/NetGate/pfSense boxes are affected, too (as well as some Synology NAS')

    Unless there's some magic "firmware bullet" I'm convinced that most vendors will just ride this issue out - or at best manage the issue on a Case by Case basis. If Cisco could've pushed out a firmware fix they would've done it in a heartbeat.

    I'd be more than happy to receive replacements boards from Netgate/pfSense (for my 2240, 2440 and 4860's) but my hopes are indeed very slim.
    The pfSense store shows One year manufacturer's warranty and my boxes are already 1+ year old. That absolves pfSense of the responsibility to replace my SG Series Appliances - which by the way have not yet even failed.

    Do I sleep comfortable that my network sits on a time bomb ? No.
    Do I sleep comfortable that my employer's network sits on a time bomb ? No.

    Perhaps ADI/NetGate/PfSense do not have the same level of clout with intel as Cisco does, but I'd surely hope ADI/NetGate/PfSense will work out some sort of arrangement with intel to reduce the impact on existing customers.

    I would suggest that affected Rangely/Avoton customers should receive a heavy discount when buying from the pfSense Store again.



  • This issue has nothing to do with warranty since its a design flaw…replacement or upgrade program is needed.
    Personally I have my core network based on C2000, pfSense and Synology units.
    There is an open thread on Synology. I also wrote to Supermicro in order to get a solution before failure occurs.



  • It's eerily quiet here ….  :-\



  • iXsystems FreeNAS Mini at risk too.



  • @Wolf666:

    I also wrote to Supermicro in order to get a solution before failure occurs.

    Please let us know what they say… there are plenty here that have SuperMicro boards with the Atom C2xxx processors on them.



  • I also contaced Supermicro yesterday, seems at least the European support did not know about the issue yet. Sent them an explanation and some links. Now they are checking with the PM of the motherboard.



  • Hmm, the Intel doc says stepping B0 is affected. My Supermicro board says:

    CPU: Intel(R) Atom(TM) CPU  C2758  @ 2.40GHz (2400.07-MHz K8-class CPU)
      Origin="GenuineIntel"  Id=0x406d8  Family=0x6  Model=0x4d  Stepping=8
    
    

    From what I have gathered about steppings, the version normally consist of a letter followed by a number. So what could "8" mean?
    I'd love to think that Cisco got the whole B0 stepping, but then again my (and all the googled dmesg) results are missing the letter…

    Any experts on this?



  • CPU: Intel(R) Atom(TM) CPU  C2558  @ 2.40GHz (2400.06-MHz K8-class CPU)
      Origin="GenuineIntel"  Id=0x406d8  Family=0x6  Model=0x4d  Stepping=8
    

    Well, yeah, that is not very helpful at all.



  • I also contacted Supermicro (EU) and asked them if it only affects one specific stepping.

    Apparently even they don't know if it only affects the B0 stepping, because Intel doesn't want to give out too many details.

    The hardware update Supermicro has in place (or will have in place) is for all A1 motherboards though.



  • @Creep89:

    CPU: Intel(R) Atom(TM) CPU  C2558  @ 2.40GHz (2400.06-MHz K8-class CPU)
      Origin="GenuineIntel"  Id=0x406d8  Family=0x6  Model=0x4d  Stepping=8
    

    Well, yeah, that is not very helpful at all.

    Have a look here:

    https://www-ssl.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-c2000-family-spec-update.pdf

    Page 15 Table 9

    CPUID: 406D8



  • Oh, well. Thanks!  :-X

    Guess I will buy/build a new pfSense appliance and then RMA my board. Not cool at all.



  • Supermicro support told me to RMA my board to get a "reworked" one. They do not handle RMAs directly with the customer, though. I bought it from Amazon.de. So much for my reworked version…


  • Rebel Alliance Developer Netgate

    We're still investigating internally, we'll put out an official response once we have enough information.

    You can also follow some additional conversation on the topic here: https://www.reddit.com/r/PFSENSE/comments/5s8pwi/intel_c_series_processor_recalls_are_pf_official/



  • crap!!

    CPU: Intel(R) Atom(TM) CPU  C2558  @ 2.40GHz (2400.06-MHz K8-class CPU)
      Origin="GenuineIntel"  Id=0x406d8  Family=0x6  Model=0x4d  Stepping=8



  • FWIW, you probably don't need to go check your stepping. By Intel's data sheet, there has only been one stepping (B0) released to date for the Atom C2000 family.



  • And in case anyone missed it:

    https://blog.pfsense.org/?p=2297

    A very respectable response from Netgate.



  • @Jim:

    Although most Netgate Security Gateway appliances will not experience this problem, we are committed to replacing or repairing products affected by this issue for a period of at least 3 years from date of sale, for the original purchaser.

    That is a good post and I only have one related question now.

    Does this mean that only devices that actually fail within the three years will be repaired/replaced or any devices with the susceptible CPU will be repaired/replaced within the three years regardless of whether they have actually suffered from the problem or not?



  • @liontaur:

    @Jim:

    Although most Netgate Security Gateway appliances will not experience this problem, we are committed to replacing or repairing products affected by this issue for a period of at least 3 years from date of sale, for the original purchaser.

    That is a good post and I only have one related question now.

    Does this mean that only devices that actually fail within the three years will be repaired/replaced or any devices with the susceptible CPU will be repaired/replaced within the three years regardless of whether they have actually suffered from the problem or not?

    For a lot of enterprise customers the replacement cost of a faulty device is insignificant. Spending 500$ on a pfSense SG Appliance or 5000$ on a Cisco Router isn't really the problem. The problem for them is unpredictability and the impact and risk a component failure may produce. That's despite redundancy and the knowledge that failures will always happen.
    Consider the potential downtime (and asdociated loss of business), change control, travel cost, overtime etc…
    Cisco have chosen to pro-actively replace affected components because tgey do not want to expose their customers to any additional risk. The life expectancy of enterprise kit is approximately 3-5 years because by then the technology will be technically superseded. I've cetainly seen kit run 10+ years.

    I'm a pfSense customer (for my employer and home) and my purchase decissions were made because of:

    • quality intel Nic's in a purpose built product
    • a large commnity develpping/supporting the software/pfsense project
    • low power consumption
    • a long life expectancy, certainly greater than 3 years
    • no moving parts and fans

    I am also keen to understand whether pfsense/netgate will do a pro-active replacement (like Cisco) or whether this will be a "fix-on-fail" program ?

    "Fix-on-fail" means that pfSense is asking its customers to wear the risk mentioned above.

    So back to liontaur's question?

    • will my 4 appliances be replaced within 3 years irrespective of fault ?
    • will my appliances only be replaced if they fail?
    • what happens if my appliances fail after 3years+1days?

    My expectation as a consumer is that my appliances will last well beyond 3 years of operation.



  • @dennypage:

    A very respectable response from Netgate.

    Opinions differ. :) 3 years is a pretty short clock for fundamental design flaw.

    ^^^ When I wrote this I didn't realize the netgate warranty was only 1 year; I was thinking of supermicro's 3 year warranty and read it as a brush off. My bad.


  • Netgate

    @gcu_greyarea:

    My expectation as a consumer is that my appliances will last well beyond 3 years of operation.

    First, you'll need to appreciate that, while I know the modeled failure rates of the component in-question, I can't release same.

    Second, your appliance will, in all likelihood, last  longer than three years.  The majority of at-risk Netgate products will not experience this failure over their entire service lifetime.

    Third, Cisco's offer isn't as "pro-active" as you suggest.  A careful read of Cisco's Ts & Cs should reveal the truth.

    Fourth, we feel we have a strong replacement policy, as it is not limited to the original warranty period or to systems covered by an existing support agreement, as others have announced.  Considering the likelihood of the failure occurring, we feel our limited extended warranty is the best course of action, because it results in less overall inconvenience, downtime, and demands on our customers and partners.


  • Netgate

    @nifoc:

    I also contacted Supermicro (EU) and asked them if it only affects one specific stepping.

    Apparently even they don't know if it only affects the B0 stepping, because Intel doesn't want to give out too many details.

    Point in-fact, it's not that Supermicro doesn't know, as much as Supermicro can't tell you.

    Big difference.



  • odd no news stories on the web about this.



  • The reality is that no matter what Netgate, Cisco, SuperMicro post about whether they'll proactively replace all devices that contain the affected CPUs, or just failed devices and try to limit their exposure to some lame 3 year limit. They'll quickly learn that a class action is going to change their position very quickly, and empty their pockets much quicker than if they just replaced all affected units from the get go. intel is obligated to support all the costs. They've already incorporated a charge for this in their latest earnings. There are provisions within in the law that don't allow companies to hide behind time limits on manufacturing defects (latent or otherwise). Just ask Apple.

    My advice, if you own an affected device, notify the supplier/manufacturer respectfully in writing that you expect a fixed replacement free of the defect within 90 days. If they don't comply, and/or don't reply, the law will set them straight and then some. Document your communications. A class action will be announced at some point.



  • @jwt:

    @gcu_greyarea:

    My expectation as a consumer is that my appliances will last well beyond 3 years of operation.

    First, you'll need to appreciate that, while I know the modeled failure rates of the component in-question, I can't release same.

    Second, your appliance will, in all likelihood, last  longer than three years.  The majority of at-risk Netgate products will not experience this failure over their entire service lifetime.

    Third, Cisco's offer isn't as "pro-active" as you suggest.  A careful read of Cisco's Ts & Cs should reveal the truth.

    Fourth, we feel we have a strong replacement policy, as it is not limited to the original warranty period or to systems covered by an existing support agreement, as others have announced.  Considering the likelihood of the failure occurring, we feel our limited extended warranty is the best course of action, because it results in less overall inconvenience, downtime, and demands on our customers and partners.

    A trade up program would also be a good way to reduce the risk to the manufacturer as well as the customer. It is a win-win situation. Develop a system with the new InHell Pentagram processor family, give it a few fancy upgrades (Ram, interfaces, etc.) then offer the customer a pro-rated discount for their product based on service life. One thing I did find interesting about Netgate response was their assertion that their products won't be affected by this flaw… How exactly do they know that?



  • @jwt:

    @gcu_greyarea:

    My expectation as a consumer is that my appliances will last well beyond 3 years of operation.

    First, you'll need to appreciate that, while I know the modeled failure rates of the component in-question, I can't release same.

    Second, your appliance will, in all likelihood, last  longer than three years.  The majority of at-risk Netgate products will not experience this failure over their entire service lifetime.

    Third, Cisco's offer isn't as "pro-active" as you suggest.  A careful read of Cisco's Ts & Cs should reveal the truth.

    Fourth, we feel we have a strong replacement policy, as it is not limited to the original warranty period or to systems covered by an existing support agreement, as others have announced.  Considering the likelihood of the failure occurring, we feel our limited extended warranty is the best course of action, because it results in less overall inconvenience, downtime, and demands on our customers and partners.

    jwt - thanks for this information.

    I appreciate that pfSense offers an extended warranty to affected customers. That said - I purchased 4 pfSense/NetGate appliances and each time I paid 90$US for shipping via FedEx. Now imagine my 4 appliances die within your extended warranty I essentially have to pay 720$US roundtrip to get all my appliances replaced / fixed.
    Of course - my appliances may never fail, but why should I carry that risk?
    With a replacement programme I could get all my appliances exchanged in one go.
    Of course the bigger problem is being without a firewall/router while it gets replaced. I know you cannot share any NDA information, but it is fair to say that the c2000 processor experiences higher failure rates - due to an inherent design flaw.

    So - search your feelings (Star Wars Quote): If you had a choice between two alomost identical appliances to run your business. One appliance has a known higher risk of failing - the other has the known lower risk of failing.

    Which one would you chose ?

    This is not pfSense/Netagte fault. It is intel fault. My expectation and that of many other customers is that Netgate will work with Intel to find a workable solution for customers. The Pandora's Box is now open - and telling me that my appliance "might not fail" is not an excuse.



  • @gcu_greyarea:

    This is not pfSense/Netagte fault. It is intel fault.

    Absolutely.  And we can hope that Intel will work with its customers, and not just the big ones.  I mentioned in another thread that a friend had a SuperMicro board fail in a manner that is entirely consistent with this reported issue, and it took them 3 months (!) to get it back to him.  The board went from California to Taiwan and back in that time.  That, IMO, is unacceptable.  And I'd consider SuperMicro to be, if not a "big" customer, at least one of the larger ones offering Intel's embedded hardware in what is advertised as enterprise class hardware.  And I suspect, but don't know, that at least some of the Netgate/pfSense hardware is SuperMicro stuff.

    This will not be easily glossed over.  Intel needs to step up first, and give its customers a clear and easy path to remediation.  And if that path has to trickle down through the OEMs like SuperMicro and whoever else Netgate might contract with, then those companies need to step up too.




  • Netgate

    @whosmatt:

    @gcu_greyarea:

    This is not pfSense/Netagte fault. It is intel fault.

    I mentioned in another thread that a friend had a SuperMicro board fail in a manner that is entirely consistent with this reported issue, and it took them 3 months (!) to get it back to him.  The board went from California to Taiwan and back in that time.  That, IMO, is unacceptable.

    Agreed that three months is entirely too long. Supermicro is supplying us with advanced stock so we can turn an RMA for this issue around in a day, rather than the timeline experienced by your friend.  That said, your friend (probably) didn't buy from us, and there isn't much I can do if someone isn't a customer.

    As I said in the blog post, we're standing behind our products, and will continue to do the right thing. If, via negotiation, we can extend the warranty for this issue to 5 or even 7 years, we will. (As a reminder, I wrote "at least 3 years".  This is why.)

    To get a replacement from Cisco, you'll have either purchased in the last 90 days, or entered into a very expensive "extended warranty" at the time of purchase. They are NOT replacing your Cisco device outside of these two eventualities.  Their announcement is wordsmithed to lead the public to the conclusion that they are.



  • @MiB:

    The reality is that no matter what Netgate, Cisco, SuperMicro post about whether they'll proactively replace all devices that contain the affected CPUs, or just failed devices and try to limit their exposure to some lame 3 year limit. They'll quickly learn that a class action is going to change their position very quickly, and empty their pockets much quicker than if they just replaced all affected units from the get go. intel is obligated to support all the costs. They've already incorporated a charge for this in their latest earnings. There are provisions within in the law that don't allow companies to hide behind time limits on manufacturing defects (latent or otherwise). Just ask Apple.

    My advice, if you own an affected device, notify the supplier/manufacturer respectfully in writing that you expect a fixed replacement free of the defect within 90 days. If they don't comply, and/or don't reply, the law will set them straight and then some. Document your communications. A class action will be announced at some point.

    True but the action may only affect america.

    Like e.g. the nvidia gtx 970 scandal only gave americans a rebate.

    I also think intel's NDA is quite possibly illegal in some countries, especially the EU, a NDA to hide design/manufacturing defects breaches various sales laws.  This is not some new tech they want to keep underwraps but a released commercial product people have purchased.  e.g. in the UK its illegal to sell something to someone with a known defect and not disclose it.  The countries law that applies is in the place of sale, not where the company is HQ'd.



  • NDAs are simply standard practice. If you want access to proprietary information like futures, projected failure rates, direct purchasing requiements, etc., you sign an NDA. Like most companies, Intel uses bidirectional NDAs that cover all disclosures in the relationship. You should expect that any hardware manufacturer who uses Intel chips in their designs will have an an NDA with Intel. Many software companies do as well.

    In other words, it's not a conspiracy. the NDAs the companies are citing are not new, nor specific to this issue. The "educated guess" in the serve the home article is simply wrong. No one who has been under NDA with Intel would suggest that.



  • Law here and law there is not the real thing we are talking about, it is more the thing that Intel or Supermicro are
    able to serve us with a small program that is perhaps let us say installed on an USB Stick and with that we are all
    able to deactivate this registers and then we reboot an were able to stich in a second USB pen drive with an inside
    installed i2C chip that is then overtaking this part of work, then we would be all fine! If this is not able to work
    around we should be waiting for another trail that will be shown by Intel or another vendor (producer) we all
    are able to march. And if nothing helps out, we all know that problem now and we are able to get adequate
    replacement by our own money, because our networks should be safe and secured and after this we are
    sitting not in a really hard deep black hole and don´t came out. For sure this might be not ideal the most
    peoples will think now, but if this units are not booting anymore the pain and stress factor is perhaps
    much higher then the knowledge that something must be done before this units are failing!

    Its a really time bomb for sure, but really able to talk about that would we all only after a failure that is
    able to show up! And not month or years before this failure will be perhaps occurring.



  • @BlueKobold:

    deactivate this registers and then we reboot

    From what I've read, that could be as far as your proposal gets you. It might have been the last reboot ever initiated on that system… :)

    r



  • I think gonzopancho mentioned at reddit that the ADI systems boot from i2c flash.

    https://www.reddit.com/r/PFSENSE/comments/5s8pwi/intel_c_series_processor_recalls_are_pf_official/

    So even id the LPC signal is not required during boot, the signal might still be required by other components ???

    Then again, depending how the LPC processor component fails it may affect other parts of the processor, too.

    Reading between the lines it appears that "usage" -or - heat may accelarate the deteriotion of said LPC component.
    I'm sure Cisco would drive CPU's very hard as that is how you get "value" from your CPU. They wouldn't overspec the CPU to run it at 20% load.


Locked