Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Intel Atom C2xxx LPC failures

    Scheduled Pinned Locked Moved Hardware
    168 Posts 39 Posters 57.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • G Offline
      garyd9
      last edited by

      @VAMike:

      Correct. There is no such thing as a modern CPU with no known bugs. This issue is referred to as AVR54 because it is the 54th issue listed in a 37 page document concerning bugs in the C2000 processor family. Why did this sudden demand for perfection not apply to the first 53 issues?

      How many of those other issues results in a dead system?  That's the difference.

      @VAMike:

      You're not making a point, you're sounding ridiculous. You seriously want to assert that an issue that slightly increases the failure rate of a family of microprocessors is the same thing as a failure mode that causes uncontrolled combustion and actually has the potential for injury or death? Dial down the rhetoric and focus on reality, please.

      To some people, if their network goes down, it can result in injury or death.  (Granted, those people really should have HA systems.)  My idea was to take it to an extreme to try and show you how other people might see this…  I obviously failed either because I didn't express myself well enough, or you just don't want to see both sides of the argument.

      @VAMike:

      First question: what was the expected service life of a motherboard with an embedded C2000? (Expected answer: you have no idea.)

      Surpise:  7 years.  From supermicro's website for their C2758 motherboard (point #10 in their key features list):  http://www.supermicro.com/products/motherboard/atom/x10/a1sri-2758f.cfm

      Being that you made a wrong assumption on your first question, the rest aren't really relevant, are they?  You are also ignoring the difference between an unexpected failure and a known defect that can cause a failure.

      @VAMike:

      IF it were really the case that every C2000 would stop working after 3 years, then there'd be an argument that there should be a recall–but that's not the case even if some scare stories suggested that it was.

      Actually, we don't actually know the failure rate (as you pointed out.)  Intel has hidden that information.  If it was an extremely low number, I wouldn't expect them to hide it, though….  Instead, they'd advertise that "only 1 in xxx million will ever have this issue, so don't worry about it!"  That leads me to believe that it's higher than most people would be comfortable with.

      (Even netgate stated that the "majority" of people won't have the issue.  That could be interpreted as "49% of the people WILL have this issue.")

      1 Reply Last reply Reply Quote 0
      • V Offline
        VAMike
        last edited by

        @garyd9:

        @VAMike:

        You're not making a point, you're sounding ridiculous. You seriously want to assert that an issue that slightly increases the failure rate of a family of microprocessors is the same thing as a failure mode that causes uncontrolled combustion and actually has the potential for injury or death? Dial down the rhetoric and focus on reality, please.

        To some people, if their network goes down, it can result in injury or death.  (Granted, those people really should have HA systems.)  My idea was to take it to an extreme to try and show you how other people might see this…  I obviously failed either because I didn't express myself well enough, or you just don't want to see both sides of the argument.

        There is no other side of the argument where a single C2000 failing causes injury or death unless there's gross incompetence involved BECAUSE THAT WAS ALREADY A POSSIBILITY THAT SHOULD HAVE BEEN ACCOUNTED FOR. Again, we're not going from a 0% failure rate to a non-0% failure rate, we're going from non-0% to non-0%.

        @VAMike:

        First question: what was the expected service life of a motherboard with an embedded C2000? (Expected answer: you have no idea.)

        Surpise:  7 years.  From supermicro's website for their C2758 motherboard (point #10 in their key features list):  http://www.supermicro.com/products/motherboard/atom/x10/a1sri-2758f.cfm

        I don't think that means what you think it means. If you build a solution based on the a1sri-2758f you can expect to be able to get the part for a seven year period. That's a good selling point for places that want to plan the entire lifecycle of a deployment, but irrelevant to this discussion. I don't see any sign that the warranty on the C2000 motherboards is anything other than their standard 3yr/1yr (I certainly didn't get anything else in my box). And n.b. that the design service life is not the same as the warranty period. (You can simply make a good bet that the design service life is longer than the warranty period, but you'd need more information to figure out what it is. In general that's something that's only useful to someone providing contractual support, because it doesn't matter to the consumer how much longer than the warranty the design life is–once the warranty is up so is your guarantee.)

        Being that you made a wrong assumption on your first question, the rest aren't really relevant, are they?  You are also ignoring the difference between an unexpected failure and a known defect that can cause a failure.

        No, the other points are extremely relevant, you just can't/don't want to address them.

        @VAMike:

        IF it were really the case that every C2000 would stop working after 3 years, then there'd be an argument that there should be a recall–but that's not the case even if some scare stories suggested that it was.

        Actually, we don't actually know the failure rate (as you pointed out.)  Intel has hidden that information.  If it was an extremely low number, I wouldn't expect them to hide it, though….  Instead, they'd advertise that "only 1 in xxx million will ever have this issue, so don't worry about it!"  That leads me to believe that it's higher than most people would be comfortable with.

        It leads me to think it requires detailed analysis of the actual deployed system and that a blanket answer isn't possible. It also seems consistent with the general level of information published about any CPU (if you want an actual failure rate, then get a contract with them; you still probably won't get their internal information, but you'll get a number you can plan for or get some kind of compensation if they miss the number–which will almost certainly be higher than what they think they'll actually achieve.) Hysterically insisting "IT MUST BE BAD" isn't useful, nor is it congruent with what we know from places which have actually deployed them at scale (they simply aren't failing in large numbers).

        1 Reply Last reply Reply Quote 0
        • G Offline
          gcu_greyarea
          last edited by

          Reposting:

          Cisco already have a fix in place since beginning of Dec 2016. They knew well before everybody else.

          http://www.cisco.com/c/en/us/support/web/clock-signal.html#~faqs

          If Cisco could rework the systems as early as of 3 Dec 2016 they would have known about the issue well before, considering it takes time to adjust production for the required fix for all affected platforms.

          Cisco have the expertise and resources to do their own independent research into C2000 failure rates and do not have to rely on intel to feed them BS. Cisco would also have plenty of failed units to backup their research and their claims to make a Case against intel.

          It is very likely that Cisco discovered the flaw first and confronted intel with their findings. Cisco don't just peddle C2000 CPUs, their UCS systems also use Intel - so Cisco have enough weight to throw around.

          Eventually intel had to admit flaw - and since board members high up need to have their arses covered they had to announce the issue in the recent earnings call. Then they worded AVR54 so innocuously, hoping it would slip under the radar.

          1 Reply Last reply Reply Quote 0
          • G Offline
            gcu_greyarea
            last edited by

            I don't think anybody but Cisco actually has the resources to do their own RCA into C2000 failures. So you just have to believe what Intel will tell you. Perhaps Cisco created "alternative facts" - but they thought it was serious enough to not expose their customers to that risk.
            (Please spare me the 'Cisco=expensive' talk)

            And while it is speculation - I still believe the CPU utilisation and associated heat will deteriorate the CPU faster.

            So if you don not have access to technology and resources you could simply "cook" a CPU and see which component goes first…. It'll be that clock generator.

            When - these reworked (fixed) ADI/Netgate systems finally arrive - I'm pretty sure the first ones will go into the home of Netgate/pfSense employees. That's what I would do...

            1 Reply Last reply Reply Quote 0
            • V Offline
              VAMike
              last edited by

              @gcu_greyarea:

              I don't think anybody but Cisco actually has the resources to do their own RCA into C2000 failures. So you just have to believe what Intel will tell you. Perhaps Cisco created "alternative facts" - but they thought it was serious enough to not expose their customers to that risk.
              (Please spare me the 'Cisco=expensive' talk)

              This isn't about the customers, this about cisco's own costs associated with fulfilling its support contracts. They likely have a contract with intel that stipulates failure rates–which they use when planning their own fee structure--and if the errata causes cisco's costs to be higher than expected they can go back to intel for money.

              1 Reply Last reply Reply Quote 0
              • G Offline
                gcu_greyarea
                last edited by

                This is NOT just about "Cisco's own costs associated with fulfilling its support contracts".

                As of Dec 2016 Cisco will sell you a unit with a C2000 Processor which is NOT affected by the clock issue.

                http://www.cisco.com/c/en/us/support/web/clock-signal.html#~faqs

                That is irrespective of whether I am going to purchase a Cisco SmartNet Contract or not.

                My question - which I believe is justified - was:

                How come pfSense/Netgate are knowingly selling affected units in their online store.

                Nobody has even come close to answering that question.
                Instead I was given a lesson in High Availability  :)

                1 Reply Last reply Reply Quote 0
                • V Offline
                  VAMike
                  last edited by

                  @gcu_greyarea:

                  This is NOT just about "Cisco's own costs associated with fulfilling its support contracts".

                  Cisco's costs are what gives them leverage over Intel, who is paying for at least a significant portion of this. Conversation goes like this:
                  CSCO: "You sold us some stuff with a projected failure rate of x%, but it's actually failing at x+y%, this will cost us $z in increased support costs over the life of the products. Would you like to write a check now or waste a lot of money on a legal fight first?"
                  INTC: "How big a check was that?"

                  An alternate conversation:
                  BRANDX: "We'd like to give a bunch of people new computers."
                  INTC: "That's very nice of you. Why should we care?"
                  BRANDX: "We think you should pay for it."
                  INTC: "We might help pay for actual failures, as a gesture of goodwill, but we're not going to pay for everyone to get a new computer."
                  BRANDX: "We'll sue you!"
                  INTC: "Do you even have standing? How much is this costing you?"
                  BRANDX: "Well, nothing–the systems are out of warranty--but some guy on the internet says everyone should get a new computer!"
                  INTC: "Have you met our lawyers?"

                  I get that everybody wants free stuff, but in the real world everyone can't have free stuff.

                  I guess if enough guys on the internet get pissed after reading sensationalist articles, there might be a class action. You'll get a $5 coupon off your next intel purchase of $1000 or more, some lawyers will make a ton of money, you still won't get a new computer, and pretty much everyone will forget all this because their C2000s will just keep working until they're tossed as obsolete.

                  Nobody has even come close to answering that question.

                  No, you just didn't like the answer: "because it isn't that big a deal".

                  1 Reply Last reply Reply Quote 0
                  • G Offline
                    gcu_greyarea
                    last edited by

                    Here’s another story:

                    INTEL - a company with 50 years of experience in producing CPU’s and selling millions of units says they have discovered an above projected failure rate in one of their products (C2000). They deem it important enough to inform their shareholders in their earnings forecast and release AVR54.

                    CISCO - a company with 30 years of experience with using CPU’s in their networking equipment and selling millions of units says there is an issue in several of their products using (C2000). They deem it important enough to stop-ship any affected products, put a workaround in place - and notify their customers.

                    Netgate/pfSense - a company with 2 years of experience in the hardware market, selling 100’s to 1000s of units says - nope - Intel the manufacturer of the product (C2000)and Cisco simply got it wrong.
                    We - pfSense/Netgate know better. There is no issue and it is unlikely to ever occur and we have made the conscious decision to not inform potential customers so that we can continue selling those faulty products in our online store.

                    Please do yourself a favour - and stop-ship - until you have reworked products.
                    You are gambling the value of a few units in your warehouse — against the future of your company.
                    Your customers will remember - and you risk dragging down the reputation of the pfSense software, too.

                    1 Reply Last reply Reply Quote 0
                    • V Offline
                      VAMike
                      last edited by

                      @gcu_greyarea:

                      CISCO - a company with 30 years of experience with using CPU’s in their networking equipment and selling millions of units says there is an issue in several of their products using (C2000). They deem it important enough to stop-ship any affected products, put a workaround in place - and notify their customers.

                      Let's see what, exactly, cisco said:

                      Q: Is this a product recall?
                      No, this is not a product recall. Although the Cisco products with this component are currently performing normally, we expect product failures to increase over the years, beginning after the unit has been in operation for approximately 18 months. Although the issue may begin to occur around 18 months in operation, we don’t expect a noticable increase in failures until year three of runtime. For customers that determine proactive replacement is required, Cisco is offering to provide replacement products for those products under warranty or covered by any valid services contract dated as of November 16, 2016, which have this component.

                      So if you didn't have a service contract, you only get a new computer if it's still under warranty. How long is cisco's warranty?

                      All Cisco hardware and software products are covered by warranty for a minimum of 90 days

                      Yeah. It's all about the customers, not at all about limiting cisco's ongoing costs.

                      So why is a company that made $10bn in profits on $50bn revenue last year not replacing everything, the way you want netgate to? They certainly have the money to do it. Maybe because your analysis and conclusions are just wrong?

                      A really dedicated conspiracy theorist could even wonder if this is really all a tremendously clever way to put cisco's services competitors out of business. (Cisco has very publicly announced that everyone gets free hardware, but they're only paying for the installation work if you're on a cisco hands-on service contract. If you contract with someone else to provide the services, cisco just dumped a huge expense on your contractor after setting the expectation that all the work of replacing the hardware should be "free". And by prioritizing shipments, they can pretty much guarantee that those providers won't just hit a customer once and replace everything: they'll be tied up for months if not years going back again and again to replace a few units at a time.)

                      1 Reply Last reply Reply Quote 0
                      • G Offline
                        gcu_greyarea
                        last edited by

                        VAMike, I appreciate your opinion, however - with all due respect - I think you are missing the point of my initial question. We are talking about two different things:

                        VAMike:
                        You are talking about Cisco's approach (proactive) to pull faulty units out of the field.
                        Netgate use a fix-on-fail approach (reactive), with generous warrany extension. I don't neccesarily like fix-on-fail and would prefer a proactive replacement - or as you call it "free stuff".
                        Basically this discussion is about dealing with faulty Units in the field - AFTER - the inits have been sold.

                        Me:
                        I am asking why Netgate/pfSense knowingly continue selling producs which are affected by AVR54?
                        This question is about how vendors chose to deal with the issue - BEFORE - customers purchase a product.

                        1 Reply Last reply Reply Quote 0
                        • V Offline
                          VAMike
                          last edited by

                          @gcu_greyarea:

                          I am asking why Netgate/pfSense knowingly continue selling producs which are affected by AVR54?

                          @VAMike:

                          @gcu_greyarea:

                          Nobody has even come close to answering that question.

                          No, you just didn't like the answer: "because it isn't that big a deal".

                          1 Reply Last reply Reply Quote 0
                          • G Offline
                            gcu_greyarea
                            last edited by

                            @VAMike:

                            @gcu_greyarea:

                            I am asking why Netgate/pfSense knowingly continue selling producs which are affected by AVR54?

                            @VAMike:

                            @gcu_greyarea:

                            Nobody has even come close to answering that question.

                            No, you just didn't like the answer: "because it isn't that big a deal".

                            I think we'll have to agree to disagree here …

                            We've had a very insightful discussion and I believe customers and potential future customers will be able to form their own opinions and make informed purchase decisions.

                            1 Reply Last reply Reply Quote 0
                            • G Offline
                              garyd9
                              last edited by

                              @gcu_greyarea:

                              I think we'll have to agree to disagree here …

                              I came to that conclusion trying to talk to him a while ago (which is why I stopped wasting my time trying to get him to see more than one side of the argument.)

                              If it matters, I understand your point of view on this.  From a customer point of view, it sucks being told that you have to experience a failure and have your network go down for a few days (or weeks) before any action is taken on an issue that was already known about.

                              I also can understand why a smaller company, such as netgate, might not be willing or able to take on the massive expense of replacing hardware that hasn't yet failed.  Even if Intel gives them advance replacements for CPU's, and their m/b vendor does the same, they'd still lose a significant chunk of money.

                              Of course, more directly to your point, the complete silence from netgate in regards to their existing "new" stock is concerning.  I'd imagine that their legal folks are weighing in on silence as well (even if I don't understand why.)  I suppose there's a legal can of worms involved…

                              1 Reply Last reply Reply Quote 0
                              • a-a-ronA Offline
                                a-a-ron
                                last edited by

                                I have a A1SRi-2558F and contacted SM directly yesterday, they told me to open a support ticket and then request a RMA. I did both and they are replacing the board no questions asked.

                                1 Reply Last reply Reply Quote 0
                                • A Offline
                                  apollo17
                                  last edited by

                                  I did the same, for an a1sri-2758f. I rma'd the board yesterday, they told me to make sure i include i included "Intel Atom C2000 problem, need ECO update" on the RMA form. Not sure how long it'll take, I'm in the UK and had to send it to the Netherlands.

                                  1 Reply Last reply Reply Quote 0
                                  • JeGrJ Offline
                                    JeGr LAYER 8 Moderator
                                    last edited by

                                    Nice for both of you, if you get new boards from Supermicro, I'd recheck again if they will already include a fix. Phoned our distribution of SM in Germany yesterday "no HW with fixes/workarounds available yet". So I'd be curious if I get a replacement within a few days if that really is a fixed board.

                                    Don't forget to upvote 👍 those who kindly offered their time and brainpower to help you!

                                    If you're interested, I'm available to discuss details of German-speaking paid support (for companies) if needed.

                                    1 Reply Last reply Reply Quote 0
                                    • A Offline
                                      apollo17
                                      last edited by

                                      I was very specific in my email. I asked directly if there was a fix or not, so if it is not a fix then they lied to me, for which i obviously wont be happy with considering the shipping costs. After all, it was a simple yes/no answer and thus wouldn't of given any information away. But, they deserve the benefit of the doubt. Not that i'd know one way or the other.

                                      1 Reply Last reply Reply Quote 0
                                      • PippinP Offline
                                        Pippin
                                        last edited by

                                        Found this:
                                        https://forums.freenas.org/index.php?threads/freenas-mini-motherboard-clock-signal-issue.50582/#post-349195
                                        https://support.ixsystems.com/index.php?/Knowledgebase/Article/View/289

                                        Not sure what to think of this:

                                        any FreeNAS Mini or FreeNAS Mini XL manufactured on or after February 2017 will not have this issue

                                        They have workaround?

                                        I gloomily came to the ironic conclusion that if you take a highly intelligent person and give them the best possible, elite education, then you will most likely wind up with an academic who is completely impervious to reality.
                                        Halton Arp

                                        1 Reply Last reply Reply Quote 0
                                        • G Offline
                                          gcu_greyarea
                                          last edited by

                                          @Pippin:

                                          Found this:
                                          https://forums.freenas.org/index.php?threads/freenas-mini-motherboard-clock-signal-issue.50582/#post-349195
                                          https://support.ixsystems.com/index.php?/Knowledgebase/Article/View/289

                                          Not sure what to think of this:

                                          any FreeNAS Mini or FreeNAS Mini XL manufactured on or after February 2017 will not have this issue

                                          They have workaround?

                                          It sounds to me like a company whose actions I respect, whose products I would buy and recommend to others.

                                          1 Reply Last reply Reply Quote 0
                                          • G Offline
                                            garyd9
                                            last edited by

                                            @Pippin:

                                            They have workaround?

                                            Yes, there's a platform level work-around for the issue.  That's in the original documentation from intel concerning the issue.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.