Netgate RCC-VE 8860 and 23.01 Hardware Errata
-
@kesawi some info in https://forum.netgate.com/topic/177880/to-23-01-or-not-that-is-the-question/15
Sounds like a constant stream of errors. You can put it in the file before upgrading.
I hadn’t heard about disabling the LEDs, good to know before I saw them off!
-
EHCI is a much different case. Disabling that would affect USB, and would also make it fail to boot from eMMC. If you have an SSD installed then you could do that and be OK.
-
@jimp said in Netgate RCC-VE 8860 and 23.01 Hardware Errata:
fail to boot from eMMC
Since the 4860 does have an eMMC can I suggest updating the release notes to say not to disable ehci0 on that and/or other models?
-
It doesn't recommend disabling EHCI as it is, only the SMB interface which is OK to disable as it only affects the LEDs.
-
@jimp said in Netgate RCC-VE 8860 and 23.01 Hardware Errata:
It doesn't recommend disabling EHCI as it is, only the SMB interface which is OK to disable as it only affects the LEDs.
The way the errata is worded it mentions issues with both devices, and recommends disabling the device causing the problem (i.e. i the ehci0 is causing the issue, disable it). If both @SteveITS and I have interpreted it in this way, then it's likely others will.
Devices based on “ADI” or “RCC” hardware, such as the 4860, 8860, and potentially other similar models, may have issues with the ichsmb0 and/or ehci0 devices encountering an interrupt loop, leading to higher than usual CPU usage (NG 8916). This can typically be worked around by disabling the affected device.
A warning or note of the impacts of disabling either device (or a reference to your post above) would be prudent, and avoid potential frantic support requests or forum posts.
-
Have upgraded to 23.01 and encoutnered the
ichsmb0: interrupt loop, status=0x60
error message in the console on reboot. CPU usage jumped to sitting between 30-60%.Have implemented the fix in the hardware errata notes by disabling
ichsmb0
. The status indicator light is still green, and I assume disabling the device means it just won't switch its colour to red if there is an issue?Also notice after applying the fix, that CPU usuage is sitting marginally higher than it previously was 15%-40%.
Have disabled
ehci0
since I don't use the eMMC and CPU usage appears to have reduced to the lower range that I previously had under 22.05. -
@kesawi Thanks for all the detailed follow-up, which certainly presents some additional, elevated concerns that seem to have been previously glossed over. I'll give folks the benefit of the doubt that it was an honest oversight and not intentional subterfuge.
When @jimp replied to my initial inquiry in the other thread I had considered asking for confirmation about the USB piece, but decided against doing so because:
A. The given errata note is quite vague with words like "may" and "ichsmb0 and/or ehci0" - With it not including an accompanying config entry for disabling ehci0 this led me to consider that the ehci0 reference could be from an earlier time in Netgate's internal investigation and simply draft text that was mistakenly left in the release text despite no longer being applicable to the interrupt loop workaround.
B. Jim specifically cited the LED status indicator disablement and the "not critical" nature of workaround, which I assumed meant don't worry about anything else, including the USB piece you specifically asked about, as it will all continue to work as before minus the LED.Of course, the lack of a public facing Redmine entry means much less visibility into the underlying details and I think this current situation illustrates why this can be a problem.
What has been uncovered here is a much bigger issue and I would not qualify this as "not critical" for anyone using eMMC, which is the base config of a 4860. I personally would hit up against this with some systems and am glad I have refrained from touching 23.01 thus far, despite it supposedly containing a number of fixes I've been waiting on.
@SteveITS The details about this should absolutely be in the hardware errata notes as you suggest. Essentially, unless one has an SSD installed in the 4860 and can fully implement both aspects of the workaround (which need to be equally documented), don't install 23.01!
Note: SSD aside, I assume the external USB ports would not function with the full workaround in place. Needless to say, this would be a major problem for some setups. Say goodbye to UPS signaling, for example.
-
I've run with only the ichsmb0 device disabled for a while now and haven't seen any issues. I haven't noticed significantly higher CPU usage either though it could be marginally.
Also worth noting is that in fact the status LED is driven directly by GPIOs on the RCC-VE platform and disabling the ichsmb has no effect on it.Steve
-
Disabling ehci0 isn't an option for me. I'm not using eMMC, but I am using a UPS and I will not run something like pfSense without the ability to safely power down after an extended power outage.
Seems like the results are mixed on ehci0 either causing a small increase in CPU, vs. no measurable (noticed) increase.
I use telegraf to send my CPU stats to influxDB and then I can visualize in Grafana. So I'll know for certain once I upgrade to 23.01, but I'm holding off for the time being.
-
@stephenw10 Thanks, Steve. That would align with what @kesawi reported about the status indicator still being green. You might want to confer internally with others on this because it appears folks doubled down on a blanket statement for the LED piece with revised errata text that now says “To disable the ichsmb0 device, which will disable the LED status indicators, place the following in /boot/loader.conf.local…” I assume this revision is per Todo #14023, which while well intentioned I think missed the mark a bit and could have benefited from some additional Netgate triage prior to revised errata publication.
Varying LED behavior from one model to another aside, I still think it would be best to have a standing, public Redmine entry to track the overall progress and details of investigation, hardware specifics, temporary workarounds, and ultimate resolution of any interrupt loop(s). That way the errata text doesn’t necessarily need to balloon or be word smithed to death for folks that are simply looking for more technical details. And I definitely don’t want to lose track of the ehci0 piece until it has been confirmed to be of no performance consequence (there isn’t some interrupt still kicking off), which again, investigation notes in a Redmine entry could capture.
-
The initial bug report that was on our internal redmine (NG 8916) which is why it was repliacted in that todo. I agree though the LED statement needs to be updated.
-
@stephenw10 said in Netgate RCC-VE 8860 and 23.01 Hardware Errata:
The initial bug report that was on our internal redmine (NG 8916) which is why it was repliacted in that todo. I agree though the LED statement needs to be updated.
The Todo I pointed to, which is now closed, was created by Offstage Roller on 2/23/2023 merely as a request to clarify/clean up the already published errata text, not as a replication of a full bug entry for ongoing status, details, etc.