pfsense crash 2.8.0

cayossarian

@stephenw10 I was using memory based fle option for /var set atr 1Gb but had never see it get anywhere near that, usually 1% (right now at 55Mb) at most but then the crash dump ended up in there so a page fault occurred.

My suspicion is that the crash happened but then the crash report caused a page fault when the dump filled up /var. But who knows maybe something unusual filled /var up.

I guess /var isn't the best place for a crash dump or if that's the only option then I'll have to remove the memory based option. I did increase the size to 4Gb but who know how big the dump could be.

Thanks,

Bill

Gertjan

@cayossarian said in pfsense crash 2.8.0:

the crash report caused a page fault when the dump filled up /var

Crash dump are not stored 'somewhere' in the /var/ - but, afaik, in the swap space (partition).

Obtaining Panic Information for Developers

Start by saying they are stored in /var/crash/

and at the bottom you'll find : Install without Swap Space which tells me something different. And actually, as you said, more logic : what happens when there is a file system issue ? The system goes down with a trace.
Also : the small Netgate appliances don't even have '4 Gbytes' for their /var/ ....

Maybe - me even more guessing - the /car/crash/ contains some sort of symlink or just a filename or indication if a crash dump exists in the swap ?

@cayossarian said in pfsense crash 2.8.0:

But who knows maybe something unusual filled /var

Your mission, as an admin : go have a look ? What folder contains 'Gbytes' size files ?

stephenw10

You should still see the backtrace at the console if it panics even without SWAP to store it.

cayossarian

This post is deleted!

stephenw10

Backtrace:

db:1:pfs> bt
Tracing pid 11 tid 100003 td 0xfffff8026f5fd740
kdb_enter() at kdb_enter+0x33/frame 0xfffffe008e21eb20
panic() at panic+0x43/frame 0xfffffe008e21eb80
trap_fatal() at trap_fatal+0x40b/frame 0xfffffe008e21ebe0
trap_pfault() at trap_pfault+0x46/frame 0xfffffe008e21ec30
calltrap() at calltrap+0x8/frame 0xfffffe008e21ec30
--- trap 0xc, rip = 0xffffffff80d15b8d, rsp = 0xfffffe008e21ed00, rbp = 0xfffffe008e21ed60 ---
callout_process() at callout_process+0x1ad/frame 0xfffffe008e21ed60
handleevents() at handleevents+0x186/frame 0xfffffe008e21eda0
cpu_activeclock() at cpu_activeclock+0x6a/frame 0xfffffe008e21edd0
cpu_idle() at cpu_idle+0xa6/frame 0xfffffe008e21edf0
sched_idletd() at sched_idletd+0x546/frame 0xfffffe008e21eef0
fork_exit() at fork_exit+0x7b/frame 0xfffffe008e21ef30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe008e21ef30
--- trap 0xf0db229f, rip = 0x2a49b49e2199f62b, rsp = 0x996070dacc2370c0, rbp = 0x468b9de920125c59 ---

Unfortunately that's not very revealing. Doesn't really point to anything specific.

The message buffer has some entries I would investigate though.

<6>igc0: link state changed to DOWN
<6>igc0: link state changed to UP
<6>igc0: link state changed to DOWN
<6>igc0: link state changed to UP
<6>igc0: link state changed to DOWN
<6>igc0: link state changed to UP

What is igc0? Was the link intentionally being reconnected?

<6>arp: 192.168.65.70 moved from 00:14:2d:e2:70:18 to 2c:3b:70:e9:08:61 on igc1.65
<3>arp: 2c:3b:70:e9:08:61 attempts to modify permanent entry for 192.168.65.70 on igc1.65
<6>arp: 192.168.65.70 moved from 00:14:2d:e2:70:18 to 2c:3b:70:e9:08:61 on igc1.65

What are those devices and are they something that should sharing an IP address? Also that permanent entry implies either it's a local NIC or you're using static-arp which is almost always a bad idea.

<7>sonewconn: pcb 0xfffff801c6f85000 (127.0.0.1:853 (proto 6)): Listen queue overflow: 193 already in queue awaiting acceptance (1 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff801c6f85000 (127.0.0.1:853 (proto 6)): Listen queue overflow: 193 already in queue awaiting acceptance (6547 occurrences), euid 0, rgid 0, jail 0
<7>sonewconn: pcb 0xfffff801c6f85000 (127.0.0.1:853 (proto 6)): Listen queue overflow: 193 already in queue awaiting acceptance (1234 occurrences), euid 0, rgid 0, jail 0

It looks like Unbound is unable to answer queries over TLS fast enough and it exhausting the queue for some reason.

cayossarian

This post is deleted!

stephenw10

Hmm, so 192.168.65.70 is the pfSense interface in that VLAN? And c:3b:70:e9:08:61 should not be using it?

None of that should ever cause a panic but you should address it at least to clean up the logs so other more important events aren't hidden.

cayossarian

This post is deleted!

stephenw10

Are both interfaces actually connected? Both on the same subnet? That's often asking for trouble. I would try to use only one interface there.

cayossarian

@stephenw10 I don’t have control of the panel but thanks for asking as I can open a. Support ticket with SPAN.