New 2.7.2CE Install on AMD/Realtek Hardware Intermittent kernel panic on boot but not while running/booted up.
-
Just happened once so far, And only on boot.
I have had the box routing traffic without issue for a few days.
Frustrating that I cannot get it to reproduce yet.
The hardware and RAM have been tested throughly before installing PFsense on it.
I am using an Intel dual NIC and not even using the Realtek Interface.
This is an AMD CPU (Ryzen 5 4600G) and "high quality" ASUS B450 motherboard, 16GB of DDR4 RAM
And 500GB NvMe SSD.
Bios settings are set for stability in general, no overclock and stock RAM speeds etc.
Boot is GPT/UEFI.
System was checked for full stability with auto-overclocks max performance settings applied and RAM/D.O.C.P at full speed.
And performed perfectly though all stress tests for several days.
So I am confident in the hardware itself.This only happened once and on bootup after a full system halt and poweroff.
And right now I have not been able to get the problem to reproduce itself.
I also reassigned the LAN interface to the Realtek in an attempt to see if I could induce any problems
by passing 1Gbps traffic in and out of it with some full bandwidth speedtests.
And that is not causing any issues or failures.
Also while trying this I am hoping to ask and find out if Realtek driver patches are still a requirement in 2.7.2 or are those now included?
Might have nothing to do with the Realtek interface but naturally I immediately suspect this first.
So far I've not been able to get it to fail again.
I will continue this post if I can get the problem to happen again.
Meanwhile should the out of box Realtek drivers be already functional or is a patch of any kind still required or recommended.
There are so many older post about this but have not found any newer stuff relating to 2.7.2 on the specific issues around Realtek.
Lines in DMESG related to the Realtek interface are:
re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0xd000-0xd0ff mem 0xf6604000-0xf6604fff,0xf6600000-0xf6603fff at device 0.0 on pci7
re0: Using 1 MSI-X message
re0: Chip rev. 0x54000000
re0: MAC rev. 0x00100000
miibus0: <MII bus> on re0
rgephy0: <RTL8251/8153 1000BASE-T media interface> PHY 1 on miibus0
rgephy0: none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX, 100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow, 1000baseT-FDX-flow-master, auto, auto-flow
re0: Using defaults for TSO: 65518/35/2048
re0: Ethernet address: 10:7c:61:XX:XX:XX
re0: netmap queues/slots: TX 1/256, RX 1/256 -
This post is deleted! -
Seems to be doing this in about 1 of every 10 or so boots (coldstart).
I've not yet figured out if this is logged to disk or not yet.
So am posting a photo. -
@N8LBV Meanwhile am searching for clues.
It works flawlessly otherwise.
Boots most of the time and never has any problems if it boots. -
@N8LBV memory itself and CPU test fine outside of PFSense.
-
@N8LBV This is pretty much a basic default install, I changed very little.
It did this after the 2nd or 3rd reboot initially.
Most of the time it boots fine.
And happens fairly early in the boot process.
Single Nvme 500GB SSD ZFS. GPT/UEFI. -
I had an appliance with a damaged Intel NIC port that would fail to boot 1 out of 10 times. I disabled the defective NIC
- pkg install nano
- nano /boot/loader.conf.local
- hint.igb.3.disabled=1
If you suspect it has something to do with the Realtek NIC, you could do something similar
- disable NIC in BIOS
- add hint.re.0.disabled=1 to /boot/loader.conf.local
- or add it to a System -> Advanced -> System Tunable
-
@elvisimprsntr It does not appear to be the NIC but I could pull the intel server nic out.
and disable the onboard nic.
all 3 nics are working perfectly when the system is booted up.
And they worked solidly for a week in Windows with heavy CPU and network loads.
I also have run memtest for hours and OCCT in Windows for hours on various memory and CPU tests.
I'm very confident the hardware is not failing in any way.
It appears to crash right after the hard disk driver changes hands from UEFI to the kernel.
"hdac1: (AMD Raven HDA Controller>) line then fail. -
@N8LBV said in New 2.7.2CE Install on AMD/Realtek Hardware Intermittent kernel panic on boot but not while running/booted up.:
AMD Raven HDA Controller
At the moment I'm chasing this lead: LINK
-
Are you sure that's a drive controller and not an HD audio device you can just disable?
-
@N8LBV It looks like this was fixed in a FreeBSD14 pre-release.
But somehow not fixed in PFSense 2.7.2
Per the link above.
Disabling onboard sound chip in the BIOS may have fixed this for me.
Being tested now. -
@stephenw10 I was not sure of anything at the time I posted the image.
But yes it is the Audio controller, which I now have disabled for futher testing and expect this is a fix.
It looks like this issue was fixed in a FREEBSD-14 pre-release in 2023.
So I'm not understanding why what appears to be the same issue is back.
Or have we just discovered a new variant or very similar issue that has not been patched? -
It looks like it wasn't committed until after 2.7.2 was branched:
https://cgit.freebsd.org/src/commit/?id=901d81c3e0f43cb0e4e10bb42ab9f0a71cfcda0aIt's in devel now: https://github.com/pfsense/FreeBSD-src/commit/015daf5221f7588b9258fe0242cee09bde39fe21
So will be in the next release. Its in Plus 24.03.But you should disable any audio devices in a pfSense install anyway. They can only cause problems!
-
@stephenw10 Yeah agreed and this is a good fix for now.
At least I fully know what is going on now as well.
It's a bit sloppy for me to go and leave an audio interface enabled.
But I figured it didn't matter as PFsense shouldn't load any drivers for the audio interface as far as I know.
This problem could come back if the BIOS is ever reset or defaulted for any reason, but I at least will know what to do in that rare event.ASUS consumer motherboards like to mildly overclock some settings by default.
This occasionally results in a failed boot and a "hit F1" to load defaults message which would re-enable the audio.
I'm not worried about it and it will get fixed in a future update obviously.I already knew what I was getting into building a box on a consumer AMD light gaming motherboard LOL.
Incidentally the Realtek NIC is doing great!
A little slower than the dual Intel server NIC (as expected) that is also installed.
And the Realtek is only going to be used for very rare management of the web interface and SSH.
I did put it through a series of long multi-hour high bandwidth testing to see if I could make it fail
and it did just fine before taking it out of the routing mix.
The install was planned to disable it and install another NIC if it gave any problems.The hit was minimal and could have been far worse!
Thanks!!