Intel Atom C2xxx LPC failures
- 
 BTW, my first C2758 board had to be exchanged because of heat problems … … and recycled for another happy customer 2 years later. ;D :P 
- 
 The effect of a pervasive component failure and the manufacturer's reaction has a strong effect on their reputation, if nothing else with their current customers. Two personal examples. On the car analogy front. 
 The BMW Mini 2002-2005 has a problem with its power steering pump, which intermittently fails, making it very hard to control at any speed due to the effort needed to turn the steering wheel. At normal road speeds you suddenly loose the ability to take corners.In 2015 in the USA and Canada the regulators forced a recall. http://www.autonews.com/article/20151028/OEM11/151029806/bmw-recalls-86018-mini-cooper-models-for-power-steering-glitch In the UK, the regulator accepted BMW's statement that the problem did not affect safety. Hah! BMW would only replace the pump at a low cost if the Mini had been serviced throughout by a BMW dealer and you could deliver it to a BMW dealer with the fault present. Cost me £850 for a new pump. I will never buy a BMW car or bike again. On the Atom clock problem on FreeNAS Minis https://support.ixsystems.com/index.php?/Knowledgebase/Article/View/289 Will iXsystems replace my FreeNAS Mini motherboard under warranty if I experience this issue? What if my warranty is expired? 
 iXsystems is proud to stand behind its products. We’re extending the warranty on all second generation FreeNAS Mini motherboards shipped before February 2017 to a total of three (3) years. Any FreeNAS Minis shipped in February and after will have our standard one year warranty and are completely free of this issue.So, my 18 month old FreeNAS Mini will be fixed if it develops this problem in the next 18 months, and my new FreeNAS Mini XL shipped in February should not suffer from it. iXsystems are taking the same stance with a recently announced problem with the BMC component (a firmware bug is wearing out the BMC flash). https://support.ixsystems.com/index.php?/Knowledgebase/Article/View/287/60/asrock-rack-c2750d4i-bmc–watchdog-issue. I am happy with iXsystems as a supplier. Moving on to Netgate https://blog.pfsense.org/?p=2297 A board level workaround has been identified for the existing production stepping of the component which resolves the issue. This workaround is being cut into production as soon as possible after Chinese New Year. Additionally, some of our products are able to be reworked post-production to resolve the issue. I recently bought a SG-2220 for my mother's home, and am keeping the ALIX box I used before as a backup. So I'll cross my fingers and hope that it doesn't die. I was considering replacing the APU box I use at home with a SG-2440 or SG-2860. Now, until Netgate confirm that shipping models will not suffer from the problem, I will not buy; and may well go the much cheaper barebone or self build route. 
- 
 VW have a problem with their Direct Shift Gearbox (DSG) where clutch discs wear out and the ride becomes very jerky. Related to this issue a mechatronic box that overheats, breaks and leaves the car without power. My car developped the dreaded gearbox shudder and jerkiness. - 
VW tells me there is no issue and I'm imagining it 
- 
VW installs a software update - problem goes away, but comes back soon 
- 
VW installs a new mechatronic box/controller - doesn't fix it 
- 
VW replaces the clutch as "good will" - problem is fixed 
- 
a woman dies as her VW loses power in Australia and becomes sandwhiched inbetween two trucks - VW acknowledge there's a problem - recall for mechatronic controller box. 
- 
clutch issues remain 
- 
20.000 km later my clutch dies again 
- 
another VW service station want to charge me for the required software update 
- 
i explained that I already had the software update - no charges 
- 
VW confirms clutch out of tolereance and replace clutch pack 
- 
20.000 km later my clutch dies again 
- 
VW dealership blames clutch issue on driving behaviour 
- 
i explain that I know about the issue, clutch gets replaced under warranty 
- 
car has 60.000km, now third clutch 
 I will not buy a VW again ! The point is that customers value after sales care. It is equally if not more important than product features. 
 At least VW didn't point the finger at the clutch manufacturer. I bought the car from VW and I do not care who they source their parts from.I bought my Netgate Appliance from the Netgate store. That's why Netgate has skin in the game. I bought from Netgate - not Intel. 
- 
- 
 - 20.000 km later my clutch dies again
 I think I see your problem. If you had driven 12,427 MILES instead of messing with those silly km's, you'd have had better luck with your clutch. :P (I wonder if this relates to the thread titled "How is throughput measured?") 
- 
 I have been watching this thread closely since the bug was disclosed (and the thread at synology forums). Today - I'll throw my 2 cents in. 
 I have skin in the game on both these vendors (as well as with Cisco) and was hoping to see more Cisco solutions then Netgate/Synology solutions.(Proactive replacement vs Reactive replacement) I do IT for a lot of businesses. Understand then, when I recommend something I do it because I think this is the best solution for the client based on their needs. And they trust me to do so - and arm them with the facts so they can make the best decisions. I have been using pfSense for 5+ years now (started in the 1.x series). I have pfsense at home (on a NUC), at multiple client sites ( on official SG-2xxx, on other custom builds, even fireboxes, etc). I probably have a 50/50 split on the hardware between store and custom. And the store hardware is more expensive - always. But I buy from the store in order to support the project. The same reason I have a gold subscription. However, I can't jeopardize / destroy my reputation to support Netgate. When I said to the client "even though it is more money, buying from the store is better since you will support the devs, + get quality purpose built hardware that should last longer then custom builds" they made that decision to do so because I recommended it. And I recommended it because I trusted Netgate to provide what they advertised and make it right if they didn't. A $500 firewall should last more then 18mos (or 36, or even 60 - 5 years on a firewall or switch is normal in this industry. 5 years on a PC with moving parts is even normal.). I understand this isn't Netgate's fault. Netgate did what Netgate was supposed to do and Intel did not. They are at fault. But the relationship is between Intel and Netgate, not Intel and me. Just like the relationship between me and my client is mine, not Netgates. And that is why I am holding Netgate responsible to make it right - and I hope they are doing the same with Intel (asking Intel to make it right to them). I'm going to have to make it right with my client - I doubt they will see the labor associated with replacement as something they should pay for. So Netgate, please make this right. 3 year warranty is nice, sure - but what happens in year 4? What happens when it fails and the client is down for days while a new one is shipped (hey HA people, the client didn't know they would need HA at time of purchase. If they had known that there was an expected failure, they could have used that fact while determining what to purchase). And all the other scenarios that can and will play out. Who pays for my time to troubleshoot, install, configure, etc? Who gets punished for the unexpected schedule change needed to replace a failed unit? You can have the best product in the world and if your post sales team sucks, your product will not sell well. And the worse product, but if your post sales team makes it right, they will come back (at least the first few times). So Netgate, please do a proactive replacement. Make Intel pay for the cost to do so (you can bet Cisco is doing that). Because if you don't, I know that I will be forced to make a choice between a product that makes me look bad when it fails and another product. As I will also have to do with my NAS recommendations. And my reputation is more important to me then lining Netgates (or anyone elses) pockets - as much as I love the project. /ball is in your court now 
- 
 What happens when it fails and the client is down for days while a new one is shipped (hey HA people, the client didn't know they would need HA at time of purchase. If they had known that there was an expected failure, they could have used that fact while determining what to purchase) So, why didn't they know? Someone should have told them. Hardware breaks, all the time, everywhere, suddenly. So there's always an expected failure. 
 They have no spare or support contract for their device, they are down. This is not specific to some Intel issue.
- 
 Oh noes, I'm absolutely doomed because my oh so supercritical router's downtime is costing me gazillions $$$s per hour – but of course I did not know it could ever break so I have no backup solution anywhere. Netgate, you suck big time!!! ::) ::) ::) 
- 
 I have been watching this thread closely since the bug was disclosed (and the thread at synology forums). Today - I'll throw my 2 cents in. 
 I have skin in the game on both these vendors (as well as with Cisco) and was hoping to see more Cisco solutions then Netgate/Synology solutions.(Proactive replacement vs Reactive replacement) I do IT for a lot of businesses. Understand then, when I recommend something I do it because I think this is the best solution for the client based on their needs. And they trust me to do so - and arm them with the facts so they can make the best decisions. I have been using pfSense for 5+ years now (started in the 1.x series). I have pfsense at home (on a NUC), at multiple client sites ( on official SG-2xxx, on other custom builds, even fireboxes, etc). I probably have a 50/50 split on the hardware between store and custom. And the store hardware is more expensive - always. But I buy from the store in order to support the project. The same reason I have a gold subscription. However, I can't jeopardize / destroy my reputation to support Netgate. When I said to the client "even though it is more money, buying from the store is better since you will support the devs, + get quality purpose built hardware that should last longer then custom builds" they made that decision to do so because I recommended it. And I recommended it because I trusted Netgate to provide what they advertised and make it right if they didn't. A $500 firewall should last more then 18mos (or 36, or even 60 - 5 years on a firewall or switch is normal in this industry. 5 years on a PC with moving parts is even normal.). I understand this isn't Netgate's fault. Netgate did what Netgate was supposed to do and Intel did not. They are at fault. But the relationship is between Intel and Netgate, not Intel and me. Just like the relationship between me and my client is mine, not Netgates. And that is why I am holding Netgate responsible to make it right - and I hope they are doing the same with Intel (asking Intel to make it right to them). I'm going to have to make it right with my client - I doubt they will see the labor associated with replacement as something they should pay for. So Netgate, please make this right. 3 year warranty is nice, sure - but what happens in year 4? What happens when it fails and the client is down for days while a new one is shipped (hey HA people, the client didn't know they would need HA at time of purchase. If they had known that there was an expected failure, they could have used that fact while determining what to purchase). And all the other scenarios that can and will play out. Who pays for my time to troubleshoot, install, configure, etc? Who gets punished for the unexpected schedule change needed to replace a failed unit? You can have the best product in the world and if your post sales team sucks, your product will not sell well. And the worse product, but if your post sales team makes it right, they will come back (at least the first few times). So Netgate, please do a proactive replacement. Make Intel pay for the cost to do so (you can bet Cisco is doing that). Because if you don't, I know that I will be forced to make a choice between a product that makes me look bad when it fails and another product. As I will also have to do with my NAS recommendations. And my reputation is more important to me then lining Netgates (or anyone elses) pockets - as much as I love the project. /ball is in your court now Have you read this post? https://blog.pfsense.org/?p=2297 The following part is the most important for our customers: "Although most Netgate Security Gateway appliances will not experience this problem, we are committed to replacing or repairing products affected by this issue for a period of at least 3 years from date of sale, for the original purchaser." Before the clock signal component issue was revealed, our units have had 1 year warranty. Now our units have at least 3 years long warrenty because of the clock signal component issue, which may not even occur. As others have said, not having HA is a different issue that has nothing to do with the clock signal component. 
- 
 Have you read this post? https://blog.pfsense.org/?p=2297 The following part is the most important for our customers: "Although most Netgate Security Gateway appliances will not experience this problem, we are committed to replacing or repairing products affected by this issue for a period of at least 3 years from date of sale, for the original purchaser." Before the clock signal component issue was revealed, our units have had 1 year warranty. Now our units have at least 3 years long warrenty because of the clock signal component issue, which may not even occur. As others have said, not having HA is a different issue that has nothing to do with the clock signal component. ivor, I think the point that some are trying to make is that Netgate has made no comment on the status of the CURRENTLY SHIPPING units. Is the "platform level fix" already applied to new units shipping from netgate/pfsense or not? The closest netgate has come to answering that is : This workaround is being cut into production as soon as possible after Chinese New Year. Additionally, some of our products are able to be reworked post-production to resolve the issue. (quoted from that same link you posted.) That doesn't say "okay, from this point forward it's all good!" In fact, it's pretty open-ended. 
- 
 I feel compelled to add my two cents here… Everyone who is talking about HA is missing the mark imo. The decision on whether or not to deploy a HA solution is based on a risk and cost analysis, it is not pulled out of a hat. This increased failure rate does not expose poor designs (the "haha should have designed in HA solution" attitude), it potentially changes the equation a design was built on. It's easy to talk about HA when that means you only had to by 1 more firewall, but at a certain scale you have to weigh the costs on both sides. 
- 
 What happens when it fails and the client is down for days while a new one is shipped (hey HA people, the client didn't know they would need HA at time of purchase. If they had known that there was an expected failure, they could have used that fact while determining what to purchase) So, why didn't they know? Someone should have told them. Hardware breaks, all the time, everywhere, suddenly. So there's always an expected failure. 
 They have no spare or support contract for their device, they are down. This is not specific to some Intel issue.Of course they know hardware can break. They didn't know this specific piece of hardware will break (queue the various arguments about if it will or not - not relevant here. Cisco certainly seems to think it will.). And of course, no matter what, if something breaks, it will take them down. So when purchasing, they did the math and said "Chance of it breaking vs cost of HA - we will risk it". Now that the chance is significantly higher due to this specific Intel issue…, that math doesn't work anymore - doing that math now makes it look like "chance of it breaking vs ___ - we will buy another product". Oh noes, I'm absolutely doomed because my oh so supercritical router's downtime is costing me gazillions $$$s per hour – but of course I did not know it could ever break so I have no backup solution anywhere. Netgate, you suck big time!!! ::) ::) ::) (I know, don't feed the troll.) Where did I write anything about doom, super critical, or gazillions $$$s per hour. Or not having a backup. Or anyone sucking. Have you read this post? https://blog.pfsense.org/?p=2297 The following part is the most important for our customers: "Although most Netgate Security Gateway appliances will not experience this problem, we are committed to replacing or repairing products affected by this issue for a period of at least 3 years from date of sale, for the original purchaser." Before the clock signal component issue was revealed, our units have had 1 year warranty. Now our units have at least 3 years long warrenty because of the clock signal component issue, which may not even occur. As others have said, not having HA is a different issue that has nothing to do with the clock signal component. Yes, I read that. And I came to the following conclusion… Netgate doesn't have confidence that the SG-xxxx won't be affected by this issue - because if they did they would have instead done a warranty extension for this issue for 5+ years to lifetime. After all, if you suspect that the clock signal issue won't cause an issue, whats the problem with putting your money where your mouth is? 
 I'm not saying offer lifetime warranties for all issues - only for clock signal issues. (And if Netgate did that, then I would consider the case solved from my end). Which is why I (and many others) are calling for a proactive replacement - because we all think that all affected products will fail, and we rather replace them on our schedule instead of after they fail.And I agree with you - HA is a different issue that should be left out of this conversation. Have you read this post? https://blog.pfsense.org/?p=2297 The following part is the most important for our customers: "Although most Netgate Security Gateway appliances will not experience this problem, we are committed to replacing or repairing products affected by this issue for a period of at least 3 years from date of sale, for the original purchaser." Before the clock signal component issue was revealed, our units have had 1 year warranty. Now our units have at least 3 years long warrenty because of the clock signal component issue, which may not even occur. As others have said, not having HA is a different issue that has nothing to do with the clock signal component. ivor, I think the point that some are trying to make is that Netgate has made no comment on the status of the CURRENTLY SHIPPING units. Is the "platform level fix" already applied to new units shipping from netgate/pfsense or not? The closest netgate has come to answering that is : This workaround is being cut into production as soon as possible after Chinese New Year. Additionally, some of our products are able to be reworked post-production to resolve the issue. (quoted from that same link you posted.) That doesn't say "okay, from this point forward it's all good!" In fact, it's pretty open-ended. And this is another valid point. How can I now knowingly buy from the store when I might be buying a problem? I feel compelled to add my two cents here… Everyone who is talking about HA is missing the mark imo. The decision on whether or not to deploy a HA solution is based on a risk and cost analysis, it is not pulled out of a hat. This increased failure rate does not expose poor designs (the "haha should have designed in HA solution" attitude), it potentially changes the equation a design was built on. It's easy to talk about HA when that means you only had to by 1 more firewall, but at a certain scale you have to weigh the costs on both sides. Thank you. I could not have said it better myself. 
- 
 Yep, I wonder why they don't do a "pro-active" replacement. Perhaps the Supermicro style - send you a refurb god knows where it came from, with no sign of fix anywhere… Would cost just the postage. Placebo effect in action. Sigh. 
- 
 The discussion about HA, critical uptime, cold spares, expensive Cisco Support contracts etc does nothing but distract from the real issue. Netgate continues to sell SG series appliances in their online store, despite knowing they have a CPU with a fault. I wonder if Netgate/pfSense understand that "word of mouth" is a major contribution to their sales and marketing efforts. Somebody else has already mentioned it - that those Edge Routers look better by the minute. 
 They may not do everything pfsense can do, but hey - they're cheap.
- 
 http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf Conclusion: Everyone should stop selling computers immediately, because they're all selling CPUs with known faults. ::) ::) ::) 
- 
 The discussion about HA, critical uptime, cold spares, expensive Cisco Support contracts etc does nothing but distract from the real issue. No it doesn't. It exposes the piss poor planning on the part of the people who designed and actively maintain that particular system. My educated guess is that you have more to worry about with a close by lightning strike than you do with this particular "fault" showing up on your equipment tomorrow. But much like a hurricane warning, you have time to plan and have a spare in hand to mitigate any "downtime" potential. Lightning strikes (which I deal with on a reoccurring basis) tend to hit more like earthquakes or tornadoes. You better have a plan in place to get yourself or your clients back on your feet in the event that one does happen. Otherwise your clients will come running to someone else like me that does have that particular plan in place. Quit worrying about who's fault it is right now and plan for the worst as to keep your people happy. If you have spares on the shelf "you will be able to sleep on a windy night". Over my office door- "Your lack of preparation is not my emergency!" ;) my 2 pennies. 
- 
 So when purchasing, they did the math and said "Chance of it breaking vs cost of HA - we will risk it". Now that the chance is significantly higher due to this specific Intel issue… When purchasing, they didn't "do the math," they "made a guess." Doing the math would have required MTBF projections to work with. To my knowledge, ADI has not published any MTBF projections. And anyone that would actually go to the point of analyzing MTBF and cost to the business of failure would not be asking "should we have a spare," they would be asking "how many spares should we have?" Netgate, while saying that they believe that most customers will not encounter the issue at all, have extended their one year warranty to three years to cover this issue. That's a pretty strong commitment for a small business to make. They certainly didn't have to do that. In my view, this is more than satisfactory. Perhaps another way to look at this might offer you some comfort: Netgate extended their warranty to three years, effectively tripling their liability. They must have very good reason to believe that the number of units that will actually fail in three years of operation will be quite low. Otherwise, the commitment would constitute corporate suicide. I'm pretty sure that was not their intent. 
- 
 So when purchasing, they did the math and said "Chance of it breaking vs cost of HA - we will risk it". Now that the chance is significantly higher due to this specific Intel issue… When purchasing, they didn't "do the math," they "made a guess." Doing the math would have required MTBF projections to work with. To my knowledge, ADI has not published any MTBF projections. And anyone that would actually go to the point of analyzing MTBF and cost to the business of failure would not be asking "should we have a spare," they would be asking "how many spares should we have?" Netgate, while saying that they believe that most customers will not encounter the issue at all, have extended their one year warranty to three years to cover this issue. That's a pretty strong commitment for a small business to make. They certainly didn't have to do that. In my view, this is more than satisfactory. Perhaps another way to look at this might offer you some comfort: Netgate extended their warranty to three years, effectively tripling their liability. They must have very good reason to believe that the number of units that will actually fail in three years of operation will be quite low. Otherwise, the commitment would constitute corporate suicide. I'm pretty sure that was not their intent. "They did the math" is an expression. Although, that doesn't matter here. Lets say I told you the client had 1000 locations. And had a SG-2440 or better at each. That's a half million spent at SG-2440 prices for HA. 
 And lets say we had HA (after spending half a million for that) - we would use the same firewall most likely for the HA unit…
 Which means we would have 2000 devices all affected by the same bug.
 And chance are we put them both in and turned them on same day. Which means that they would probably fail at the same time (or very close to it).They did a cost vs risk analysis. The information they used in that is now wrong - the likelihood of failure is higher now. We can debate what information they used and should they have used that information all day, but it really is irrelevant. We can debate should they have HA or spares on a shelf, but that is also irrelevant to the discussion. In fact, the only things relevant to this discussion is - whats going to happen to the equipment in the field with regards to potential LPC failures
- what's netgate going to do
- and the reason for 2.
 Right now we are told the equipment will work fine, netgate will extend the warranty to 3 years, and that is because the equipment will be fine. Some of us don't believe that to be true (the equipment will be fine). As such, I have proposed another 2 options, namely a 5 year or better warranty for hardware that does have a clock signal failure (since it shouldn't happen according to netgate, this should be a no brainer), or a proactive replacement. I'd be interested to hear your reasoning as to why you or anyone else (really I want to hear from Netgate) are opposed to this. I'm not really interested in further discussing "should someone have HA or spares on the shelf" in this topic - that's a valid topic for another thread and has nothing to do with this thread. This thread, as the title says, is about the Intel Atom C2xxx LPC failures. 
- 
 So when purchasing, they did the math and said "Chance of it breaking vs cost of HA - we will risk it". Now that the chance is significantly higher due to this specific Intel issue… Netgate, while saying that they believe that most customers will not encounter the issue at all, have extended their one year warranty to three years to cover this issue. That's a pretty strong commitment for a small business to make. They certainly didn't have to do that. In my view, this is more than satisfactory. Perhaps another way to look at this might offer you some comfort: Netgate extended their warranty to three years, effectively tripling their liability. They must have very good reason to believe that the number of units that will actually fail in three years of operation will be quite low. Otherwise, the commitment would constitute corporate suicide. I'm pretty sure that was not their intent. Correct, Denny. (And thanks!) Exactly your last sentence above. We are a small company. We know how many systems we've sold, and we know the modeled failure rate and timeline to failure of the affected component. No, I'm not going to let on with any numbers. Some of it is proprietary to Netgate, some of it is covered by NDA. 
- 
 So when purchasing, they did the math and said "Chance of it breaking vs cost of HA - we will risk it". Now that the chance is significantly higher due to this specific Intel issue… And chance are we put them both in and turned them on same day. Which means that they would probably fail at the same time (or very close to it). You're wrong here, and I'm bound to not explain why or how. 
- 
 I wonder if Netgate/pfSense understand that "word of mouth" is a major contribution to their sales and marketing efforts. Somebody else has already mentioned it - that those Edge Routers look better by the minute. 
 They may not do everything pfsense can do, but hey - they're cheap.- 
I believe we do, yes. 
- 
And that's all you can really say about them. 
 
- 


