Found a bugfix, how to get it added to the wiki?

jimp

It was on the wiki page as a crutch to help with mbuf issues. Those have been fixed. It wasn't really a fix, just a workaround. The instances linked in the book as being related to queues have been fixed as well.

If you still have a crash with igb that is helped by reducing the queues, it's better to find out why than to just reduce the queue count. Dig deeper in the crash dumps/back trace and see where the crash is happening. It may not even really be from igb, but reducing the queue counts may hide the problem.

We could add that back into the wiki but without any specific guidance as to when/why someone might want to try it, I'm hesitant to do so. The fact that it panics isn't enough, we need to know more about what is actually causing that panic.

johnpoz

Great response - thanks Jimp

kieranc

@jimp:

It was on the wiki page as a crutch to help with mbuf issues. Those have been fixed. It wasn't really a fix, just a workaround. The instances linked in the book as being related to queues have been fixed as well.

Maybe the bug has reappeared, or the fixes aren't working any more? A regression?

@jimp:

If you still have a crash with igb that is helped by reducing the queues, it's better to find out why than to just reduce the queue count. Dig deeper in the crash dumps/back trace and see where the crash is happening. It may not even really be from igb, but reducing the queue counts may hide the problem.

We could add that back into the wiki but without any specific guidance as to when/why someone might want to try it, I'm hesitant to do so. The fact that it panics isn't enough, we need to know more about what is actually causing that panic.

I don't really have the knowledge, time or inclination to dig deeper, I just want the box to work. It's a cheap chinese Qotom machine which came with pfSense preinstalled (yes I've wiped it) so it could easily be something to do with their implementation.

It was rebooting every day, I tried some stuff, looked on the wiki, didn't find anything useful, tried a bunch more stuff, found an old cached version of the wiki that included this information, bang, no more reboots.

I appreciate you both taking the time to respond, but 'we don't know why this fixes it' feels like a poor reason not to include it on a troubleshooting page.

jimp

The bug didn't reappear, but it's possible your hardware has a different bug or problem.

Putting "try this, it might fix it but we don't know why" on the wiki is definitely a bad thing. We need to know why the hardware is crashing with more than one queue.

In most cases it's as simple as posting the full crash report that shows up after the panic and reboot. The backtrace will likely have better information, and the message buffer may have some clues as well.

johnpoz

"Putting "try this, it might fix it but we don't know why""

hehehehee - haahahahah… Why is is that Jim?? ROFL.... Best remark have seen all day, anywhere!!

kieranc

@jimp:

The bug didn't reappear, but it's possible your hardware has a different bug or problem.

Putting "try this, it might fix it but we don't know why" on the wiki is definitely a bad thing. We need to know why the hardware is crashing with more than one queue.

In most cases it's as simple as posting the full crash report that shows up after the panic and reboot. The backtrace will likely have better information, and the message buffer may have some clues as well.

Well, I sent the full crash logs via the GUI, so they're available somewhere. Sadly they seem to have been deleted automatically from my box.

What I'm failing to understand is the difference between this and all the other tweaks on the troubleshooting page. Are they all better understood than this one?

jimp

Each of the crashes had an identical backtrace:

db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100057 td 0xfffff80003d15560
rn_match() at rn_match+0x11d/frame 0xfffffe00f0fd2650
fib4_lookup_nh_basic() at fib4_lookup_nh_basic+0x84/frame 0xfffffe00f0fd26b0
ip_findroute() at ip_findroute+0x31/frame 0xfffffe00f0fd26e0
ip_tryforward() at ip_tryforward+0x1f7/frame 0xfffffe00f0fd2750
ip_input() at ip_input+0x3c5/frame 0xfffffe00f0fd27b0
netisr_dispatch_src() at netisr_dispatch_src+0xa0/frame 0xfffffe00f0fd2800
ether_demux() at ether_demux+0x16d/frame 0xfffffe00f0fd2830
ether_nh_input() at ether_nh_input+0x310/frame 0xfffffe00f0fd2890
netisr_dispatch_src() at netisr_dispatch_src+0xa0/frame 0xfffffe00f0fd28e0
ether_input() at ether_input+0x26/frame 0xfffffe00f0fd2900
igb_rxeof() at igb_rxeof+0x6f4/frame 0xfffffe00f0fd2990
igb_msix_que() at igb_msix_que+0x109/frame 0xfffffe00f0fd29e0
intr_event_execute_handlers() at intr_event_execute_handlers+0xec/frame 0xfffffe00f0fd2a20
ithread_loop() at ithread_loop+0xd6/frame 0xfffffe00f0fd2a70
fork_exit() at fork_exit+0x85/frame 0xfffffe00f0fd2ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00f0fd2ab0

That code path is fairly deep in routing and packet processing, but not in an area we typically see issues. Doesn't look like mbuf exhaustion, no trace of ALTQ, and it doesn't match any of the previous queue-related panics that we have seen.

It's possible there is some new FreeBSD issue that is only affected by the specific combination of hardware you have, or it could be that hardware just can't handle the load of multiple queues. Given what you said the hardware is, I am more inclined to blame the hardware.

kieranc

@jimp:

Each of the crashes had an identical backtrace:
db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100057 td 0xfffff80003d15560
rn_match() at rn_match+0x11d/frame 0xfffffe00f0fd2650
fib4_lookup_nh_basic() at fib4_lookup_nh_basic+0x84/frame 0xfffffe00f0fd26b0
ip_findroute() at ip_findroute+0x31/frame 0xfffffe00f0fd26e0
ip_tryforward() at ip_tryforward+0x1f7/frame 0xfffffe00f0fd2750
ip_input() at ip_input+0x3c5/frame 0xfffffe00f0fd27b0
netisr_dispatch_src() at netisr_dispatch_src+0xa0/frame 0xfffffe00f0fd2800
ether_demux() at ether_demux+0x16d/frame 0xfffffe00f0fd2830
ether_nh_input() at ether_nh_input+0x310/frame 0xfffffe00f0fd2890
netisr_dispatch_src() at netisr_dispatch_src+0xa0/frame 0xfffffe00f0fd28e0
ether_input() at ether_input+0x26/frame 0xfffffe00f0fd2900
igb_rxeof() at igb_rxeof+0x6f4/frame 0xfffffe00f0fd2990
igb_msix_que() at igb_msix_que+0x109/frame 0xfffffe00f0fd29e0
intr_event_execute_handlers() at intr_event_execute_handlers+0xec/frame 0xfffffe00f0fd2a20
ithread_loop() at ithread_loop+0xd6/frame 0xfffffe00f0fd2a70
fork_exit() at fork_exit+0x85/frame 0xfffffe00f0fd2ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00f0fd2ab0
That code path is fairly deep in routing and packet processing, but not in an area we typically see issues. Doesn't look like mbuf exhaustion, no trace of ALTQ, and it doesn't match any of the previous queue-related panics that we have seen.

It's possible there is some new FreeBSD issue that is only affected by the specific combination of hardware you have, or it could be that hardware just can't handle the load of multiple queues. Given what you said the hardware is, I am more inclined to blame the hardware.

Fair enough, so if we accept that it's a hardware issue and there's a workaround that mitigates it…. Does it qualify for the wiki? I'd just like other people having the same problem to be able to find the workaround. I wasted more than a few hours searching for it.

jimp

I added a note about it a few hours ago, but I'm still not terribly happy about it being there. It's a kludge and the actual problem underneath it needs to be addressed, but if it's specific to your hardware and FreeBSD in general, you'll need to work with them on it. It's also possible it's the nature of that hardware and can't be fixed.

kieranc

Thanks, I appreciate it. If I get time to dig into it further, I'll do so.