Wan periodic reset causes system reboot.

stephenw10

Nice. Also interesting.

Ok so set the line in /etc/pSense-ddb.conf. I used:

script kdb.enter.default=capture on; bt; show registers; show pcpu; capture off; dump; reset

Then reboot to apply that.

I then tested it by running sysctl debug.kdb.panic=1 which immediately panics the box and runs the script. At the console you see:

panic: kdb_sysctl_panic
cpuid = 3
time = 1697460855
KDB: enter: panic
[ thread pid 1455 tid 100508 ]
Stopped at      kdb_enter+0x32: movq    $0,0x2344f43(%rip)
db:0:kdb.enter.default> capture on
db:0:kdb.enter.default>  bt
Tracing pid 1455 tid 100508 td 0xfffffe00b7ceaac0
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00b13afa10
vpanic() at vpanic+0x163/frame 0xfffffe00b13afb40
panic() at panic+0x43/frame 0xfffffe00b13afba0
kdb_sysctl_panic() at kdb_sysctl_panic+0x61/frame 0xfffffe00b13afbd0
sysctl_root_handler_locked() at sysctl_root_handler_locked+0x90/frame 0xfffffe00b13afc20
sysctl_root() at sysctl_root+0x216/frame 0xfffffe00b13afca0
userland_sysctl() at userland_sysctl+0x176/frame 0xfffffe00b13afd50
sys___sysctl() at sys___sysctl+0x5c/frame 0xfffffe00b13afe00
amd64_syscall() at amd64_syscall+0x109/frame 0xfffffe00b13aff30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00b13aff30
--- syscall (202, FreeBSD ELF64, __sysctl), rip = 0xb03aaf1e18a, rsp = 0xb03a86d9c88, rbp = 0xb03a86d9cc0 ---
db:0:kdb.enter.default>  show registers
cs                        0x20
ds                        0x3b
es                        0x3b
fs                        0x13
gs                        0x1b
ss                        0x28
rax                       0x12
rcx         0xffffffff814589e2
rdx                      0x3f8
rbx                      0x100
rsp         0xfffffe00b13afa10
rbp         0xfffffe00b13afa10
rsi                 0xc3b4cdc4
rdi                        0x4
r8                0x7ac3b4cdc4
r9          0xfffffe00b7ceaac0
r10         0xfffffe00b13af8f0
r11         0xcedfc2df9afff59c
r12                          0
r13                          0
r14         0xffffffff814b6685
r15         0xfffffe00b7ceaac0
rip         0xffffffff80d388c2  kdb_enter+0x32
rflags                    0x86
kdb_enter+0x32: movq    $0,0x2344f43(%rip)
db:0:kdb.enter.default>  show pcpu
cpuid        = 3
dynamic pcpu = 0xfffffe008efd7f00
curthread    = 0xfffffe00b7ceaac0: pid 1455 tid 100508 critnest 1 "sysctl"
curpcb       = 0xfffffe00b7ceafe0
fpcurthread  = 0xfffffe00b7ceaac0: pid 1455 "sysctl"
idlethread   = 0xfffffe0011fbde40: tid 100006 "idle: cpu3"
self         = 0xffffffff84013000
curpmap      = 0xfffff8012468fd38
tssp         = 0xffffffff84013384
rsp0         = 0xfffffe00b13b0000
kcr3         = 0xffffffffffffffff
ucr3         = 0xffffffffffffffff
scr3         = 0x0
gs32p        = 0xffffffff84013404
ldt          = 0xffffffff84013444
tss          = 0xffffffff84013434
curvnet      = 0xfffff80001203980
db:0:kdb.enter.default>  capture off
db:0:kdb.enter.default>  dump
Dumping 702 out of 8050 MB:..3%..12%..21%..32%..42%..51%..62%..71%..83%..92%
Dump complete
db:0:kdb.enter.default>  reset
Uptime: 3m30s

After rebooting you should see the crash report in the gui with the vmcore offered to download.

If that's all working then delete that core and try to panic it by removing the interface again. Hopefully the core is not bigger than 1G if you can trigger it soon enough after boot.

Steve

RobbieTT

@stephenw10
Thanks Steve - as the issue is intermittent for me I probably need more swap.

Can I just boot from the USB installer and manually tweak the existing partitions using gpart delete / resize and whatever ZFS uses for regrow?

(It's been a long time since I have used partition commands but probably not much has changed over a couple of decades... other than my memory...)

Hmm, may be easier to get a new install USB but does it offer an option to set the swap partition size during the install (ie I don't remember one)?

️

stephenw10

Yes, you can just set the size during the install:

Screenshot from 2023-10-16 14-33-55.png

RobbieTT

@stephenw10
Ok, even my phat fingers can cope with that.

Now all I need is some WAN time to myself.

️

RobbieTT

I've racked-up the Supermicro and it has taken-over for pfSense duties, leaving the Netgate 6100 free for testing. What could possibly go wrong?

IMG_2387 copy.jpeg

️

stephenw10

So bluuuuuue!

RobbieTT

@stephenw10

It's a Monday night, rack mood lights to blue.

️

RobbieTT

@stephenw10

Reinstalled everything on the 6100 and presuming you guys are running more 23.09d than anything else, I pushed it on to the latest dev load. I'll run 23.05.1 on the other device for now, so much swapping around today. Probably missed something along the way.

Anyway, partitioned for a 4 GB Swap - hopefully that will be spacious enough for you:

2023-10-20 at 17.14.08.png

[23.09-BETA]/root: gpart show
=>       40  115189680  nda0  GPT  (55G)
         40     532480     1  efi  (260M)
     532520       1024     2  freebsd-boot  (512K)
     533544        984        - free -  (492K)
     534528    8388608     3  freebsd-swap  (4.0G)
    8923136  106264576     4  freebsd-zfs  (51G)
  115187712       2008        - free -  (1.0M)

[23.09-BETA]/root:

I should get some quiet WAN time tomorrow to do interface testing and hopefully achieve a kernel dump. No doubt it will be more intermittent than usual, just to be difficult.

I'll remember to run your script too:

script kdb.enter.default=capture on; bt; show registers; show pcpu; capture off; dump; reset

️

stephenw10

Excellent, that looks good. Let's hope it reveals some useful data. Thanks.

stephenw10

You can try manually triggering a panic to make sure it catches a coredump. Run: sysctl debug.kdb.panic=1

RobbieTT

@stephenw10
Sorry Steve, this proved to be beyond me. I guess I will have to wait for the GUI button to be implemented or for a genuine idiot proof step-by-step guide to be written as this has eaten through way too many hours over too many days.

I think I hit the assumed-knowledge barrier too often, with steps given, only to be belatedly added to with instructions like 'using console mode' or 'use kernel debug mode option 6' or 'did you edit some .conf file' or 'follow 'x' thread' or 'install 'x' package but only by method 'y'.

So what did work:

got console working from macOS (mislabeled as GNU screen in pfSense docs)
got the swap partition size changed via console
fresh install
installed pfSense-kernel-debug-pfSense pkg from the GUI command line
ran kdb.enter.default=capture on; (etc) script from regular CLI
reboots (many)
kdb.enter.default=capture shown under /root
reboot into kernel debug mode via console (option 6 etc)
trigger panic via CLI using sysctl debug.kdb.panic=1
console scrolls through something that looks like a core dump...
crash report in /var/crash with info and text dump files
no core dump offered in the GUI
no core dump file found in /var/crash

Clearly I am typing with a little frustration (sorry about that) but perhaps you can spot something useful in the above.

️

stephenw10

I'm sorry. Yes it will be much better when there's a gui option.

You shouldn't need to add the debug kernel just to get the coredump.

The important steps are:

Make sure you have enough SWAP space (you do.
Edit /etc/pfSense-ddb.conf so it contains the different default line like:

# $FreeBSD$
#
#  This file is read when going to multi-user and its contents piped thru
#  ``ddb'' to define debugging scripts.
#
# see ``man 4 ddb'' and ``man 8 ddb'' for details.
#

script lockinfo=show locks; show alllocks; show lockedvnods
script pfs=bt ; show registers ; show pcpu ; run lockinfo ; acttrace ; ps ; alltrace

# kdb.enter.panic       panic(9) was called.
#script kdb.enter.default=textdump set; capture on; run pfs ; capture off; textdump dump; reset
script kdb.enter.default=capture on; bt; show registers; show pcpu; capture off; dump; reset

# kdb.enter.witness	witness(4) detected a locking error.
script kdb.enter.witness=run lockinfo

Reboot.
(Optionally) Run sysctl debug.kdb.panic=1 to test the setup. You should see it writing out the coredump to swap in the console after all the backtraces scroll past.

Steve

RobbieTT

@stephenw10 said in Wan periodic reset causes system reboot.:

Edit /etc/pSense-ddb.conf so it contains the different default line like:

Hmmm, no such file found on this device. No idea why!

️

stephenw10

Oh sorry I typo'd that.

Should be /etc/pfSense-ddb.conf

RobbieTT

@stephenw10

Haha - should have spotted that.

[23.09-BETA]/root: cat /etc/pfSense-ddb.conf
# $FreeBSD$
#
#  This file is read when going to multi-user and its contents piped thru
#  ``ddb'' to define debugging scripts.
#
# see ``man 4 ddb'' and ``man 8 ddb'' for details.
#

script lockinfo=show locks; show alllocks; show lockedvnods
script pfs=bt ; show registers ; show pcpu ; run lockinfo ; acttrace ; ps ; alltrace

# kdb.enter.panic       panic(9) was called.
# script kdb.enter.default=textdump set; capture on; run pfs ; capture off; textdump dump; reset
script kdb.enter.default=capture on; bt; show registers; show pcpu; capture off; dump; reset

# kdb.enter.witness	witness(4) detected a locking error.
script kdb.enter.witness=run lockinfo
[23.09-BETA]/root:

Now, do I have a typo of my own?

️

stephenw10

Looks fine to me. Reboot to apply it and then try a test panic.

RobbieTT

@stephenw10

Wife watching Bake Off on catch-up; I would die a painful death.

I'll be brave when she is elsewhere.

️

RobbieTT

@stephenw10
I think I have it. Now, how do I get this massive vmcore and info file to you?

<118>Netgate pfSense Plus 23.09-BETA amd64 20231020-0600
<118>Bootup complete
<6>ng0: changing name to 'pppoe0'
pf_test6: kif == NULL, if_xname pppoe0
<6>ng0: changing name to 'pppoe0'


Fatal trap 12: page fault while in kernel mode

cpuid = 3; apic id = 18
fault virtual address	= 0x10
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80f4e116
stack pointer	        = 0x0:0xfffffe00850b6b60
frame pointer	        = 0x0:0xfffffe00850b6b90
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 2 (clock (3))
rdi: fffff80203712800 rsi: 000000000000001c rdx: fffff8013760d878
rcx: fffff8013760d878  r8: 00000000ffffffbd  r9: 0000000000000018
rax: 0000000000000000 rbx: 0000000000000000 rbp: fffffe00850b6b90
r10: fffff802033dd8c0 r11: fffff8016f5e5000 r12: 0000000000010300
r13: fffff80203676b98 r14: fffffe00850b6b68 r15: 0000000000000018
trap number		= 12
panic: page fault
cpuid = 3
time = 1697905286
KDB: enter: panic

️

stephenw10

You can upload it here: https://nc.netgate.com/nextcloud/index.php/s/ywzFPM3F8GZnRdb

Or I can download it from somewhere if that's easier, just send me a link in chat.

RobbieTT

@stephenw10 said in Wan periodic reset causes system reboot.:

https://nc.netgate.com/nextcloud/index.php/s/ywzFPM3F8GZnRdb

Uploaded to your link. Usual privacy request, or I'll come looking for you.

If you can acknowledge they arrived ok, that would be great.

️