6100 Boot Loop w/ Traffic Shaper on PPPoE WAN
-
My 6100 has unfortunately discovered a way to lodge itself in a bootloop after trying to enable a PRIQ traffic shaper on my WAN interface.
It occurs on both the ix0 and ix1 interfaces (at least) and only when they're being utilized for my PPPoE WAN connection (vlan 201, if it matters).
I unfortunately did not grab the massively long stack trace that was dumped in the console, and am not overly eager to go through that again. The system does freeze up for a bit when enabling the traffic shaper, or plugging in the WAN connection while traffic shaping is enabled, then crashes into the bootloop shortly after. System functionality can be restored by disconnecting the WAN interface or by removing the shaper.
Is this something that has already been reported?
-
Do you have any of the output from the loop? Or a crash report after it rebooting?
Was it actually crashing and rebooting?
It's not an issue I'm aware of specifically.
Steve
-
@stephenw10 Thanks for the prompt response. I did not save the output while connected to the console because I was in full panic trying to get things fixed, I only nailed down the symptoms after a factory reset, when I once again tried enabling the shaper and kicked it back into a loop.
Yes, the system was crashing and rebooting continusously, I observed a "Fatal trap xx(unsure): page fault while in kernel mode" along with a few hundred lines of a dump or whatever follows. While connected to the console port, I observed this happening repeatedly, the router never initialized to the point where I could access either the console menu or the gui unless I physically disconnected the WAN interface in which case it would boot just fine the next time around.
I can trigger it again and grab a console output, but it might be a bit a little bit until I can afford to take my net down to do so.
edit - /var/crash/ was unfortunately empty as well.
-
Does your PPPoE WAN have IPv6?
-
@stephenw10 I do not have it enabled. Century Link is my ISP and I believe they offer it, though? Unsure.
-
There are some issue with v6+pppoe that might have come into play here but seems unlikely.
Just trying to replicate it and looking for anything unusual you might have set.
-
@stephenw10 the only other thing I can think of are a handful of packages I've got installed. I might be missing something but I didn't configure a whole lot after performing the factory reset. Packages were retained obviously, so if they have the potential to cause a crash, there could be an issue there. I can do some uninformed testing when I'm feeling grouchy enough to inconvenience my users/friends. I'll make sure to connect to the console and grab some of the output this time around.
-
Yeah if you can grab the console output when it boot loops that would confirm it. I'll see if I can replicate it here.
-
@stephenw10 Here is the output from the crash: crashlog.txt
-
Ok, great. And it's the same backtrace every time?
The odd thing there is that it doesn't appear to be in the traffic shaper.
-
@stephenw10 I can't say for certain, I figured a crash log would be a crash log, so I didn't really try to give it more than one go.
Pfsense became unresponsive for quite some time after enabling a few shaper queues, so I'd have to pull the plug to reboot it, let it boot without WAN plugged in, plug in WAN, then receive that dump. So I suppose it could be related to me bringing the OS to an abrupt halt, but that leaves me still stuck on my traffic shaper woes. Afterwards, I reboot, factory reset then restore my backup that doesn't utilize traffic shaper. And just in case there are any known issues I might have not known about, I'm utilizing current versions of the following packages:
- Netgate_Firmware_Upgrade
- pfBlockerNG-devel
- Service_Watchdog
- WireGuard
-
Do you have any of the console output while it was looping? It would be good to see where it panics in the boot process and if it's the same panic as that in the crash report.
-
@stephenw10 DM'd more crash logs that were triggered by adding new queues. Unfortunately, it doesn't seem any of the changes are being committed to memory this time as things return to the most recent setting and boot normally after the crash.
-
Do you have any further details of the queues you enabled and how they were configured?
Simply enabling the shaper with a few PRIQ queues on a PPPoE WAN is not triggering it here.
-
@stephenw10 I enabled three queues each on the WAN and LAN interfaces (last forced crash happened specifically when adding the LAN ones, funny enough). Priorities 3, 7 and 13 on each side I think.
All Codel Active Queue, with one default queue on each interface. The queue limit was 50 most of the time I believe, I also did try setting it to 1000 originally when things orginally crashed. Bandwidth set to 940 mbps either interface.
-
Hmm, do you have the actual config queues section that was generated?
-
@stephenw10 not on hand, the last crash would revert the save so I wouldn't have the full thing. I can try to grab something again in a few days here.
-
OK, great. I haven't been able to replicate it here yet.
-
@stephenw10 Sorry for the delayed reply, I wish I had an easier way to test this without inconveniencing some people. Alright, I sent you a log file and a config file. I created the shaper config, saved a copy of it, then applied it. The router stalled for 15 minutes, at which point I disconnected the power and replugged it back in. The boot stalls at "boostrapping clock" for more than a handful of minutes, then I send an 'enter' keystroke to putty's console connection and the attached crash begins. After collecting the evidence I unplugged the sfp+ connection, rebooted the router again, let the console fully come up, remove the traffic shaper via php shell, plug in sfp+ connection, everythings back to normal.
I grabbed the full router config this time before forcing the crash, so please let me know if you need anything else.
-
Ok that looks like something we should be able to work with:
Bootstrapping clock... codel_should_drop: could not found the packet mtag! Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 04 fault virtual address = 0x5010410 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80cd789d stack pointer = 0x0:0xfffffe00c4c04ae0 frame pointer = 0x0:0xfffffe00c4c04b60 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1