Wan periodic reset causes system reboot.
-
anything on beta 23.09?
-
No, nothing substantive. Netgate did ask me to produce results with a modified kernel but the moving set of instructions left me in a bit of a hole with a router that should have been in production.
I am building-up a new server with pfSense+ so I can test with more freedom but my testing hours with a live WAN are limited. I could do with more help really.
It would really help if Netgate provided a complete version to test with, rather than being left to modify stuff myself in order to provide data for them. They expect quite a bit from a paying customer (although I acknowledge this would be different if they could replicate the issue on their own dev systems).
Anyway, I remain committed and will still invest the time needed where I can.
๏ธ
-
Yes, I'm sorry about that. Not being able to replicate it ourselves makes everything much more difficult. It's especially annoying here because I have essentially an identical setup to you. The only significant difference is the connection speed.
We are working to add a coredump implementation in the gui to make this process much easier.
In the mean time I'd be happy to work with you here if you're able to.
If you're able to I would simply reinstall with a much larger SWAP size to avoid the need for an external drive with SWAP.
Also one thing I didn't realise until I tested it was that changes to the pfSnse-ddb.conf file are read in a boot so the system needs to be rebooted normally to apply them before the panic is triggered.
Steve
-
Clearly I am happy to work with you Steve.
I just looked at the swap in the GUI dashboard, only to find there isn't one showing on my Supermicro system. Did I miss a step or something?
[23.05.1-RELEASE][admin@Router-7.redacted.me]/root: swapinfo -h Device Size Used Avail Capacity
/root: top last pid: 76190; load averages: 0.14, 0.12, 0.09 up 0+02:00:50 18:24:55 67 processes: 1 running, 66 sleeping CPU: 0.2% user, 0.0% nice, 0.1% system, 0.1% interrupt, 99.6% idle Mem: 139M Active, 391M Inact, 870M Wired, 56K Buf, 29G Free ARC: 223M Total, 39M MFU, 177M MRU, 622K Anon, 1091K Header, 5669K Other 164M Compressed, 419M Uncompressed, 2.56:1 Ratio PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 34097 unbound 16 20 0 425M 316M kqread 3 0:51 3.03% unbound 36766 root 1 20 0 14M 3856K CPU6 6 0:00 0.21% top
I don't recall setting a swap on my Netgate 6100 or on this machine. The 6100 shows a swap of 1024 MiB - never seen it used though.
๏ธ
-
Mmm, yes by default the Plus installer sets 1G for swap. The CE installer at one time used half the RAM size by default if there was sufficient drive space. I'm not sure why yours you have none.
A 6100 here dumps ~750MB from the kernel when configured to do so but that's after running only a short time. If you can trigger this quickly on the 6100 it would be worth trying since the worst case is that it just fails to hold the dump and reboots. If you're not using the 6100 you can reinstall it and add more SWAP, I would expect 2GB to be more than enough.
Steve
-
@stephenw10
Undergpart
I can see 1.0G offreebsd-swap
- is the value used by pfSense?[23.05.1-RELEASE][admin@Router-7.redacted.me]/root: gpart show => 40 231270320 nvd0 GPT (110G) 40 532480 1 efi (260M) 532520 1024 2 freebsd-boot (512K) 533544 984 - free - (492K) 534528 2097152 3 freebsd-swap (1.0G) 2631680 228636672 4 freebsd-zfs (109G) 231268352 2008 - free - (1.0M) [23.05.1-RELEASE][admin@Router-7.redacted.me]/root:
It's still a bit odd for it to be missing from the GUI though.
I'm happy to use either the Netgate 6100 or the Supermicro (SYS-510D-8C-FN6P) for testing; so your choice with repeatability vs flexibility.
๏ธ
-
@RobbieTT said in Wan periodic reset causes system reboot.:
I can see 1.0G of freebsd-swap - is the value used by pfSense?
See/etc/fstab
[23.05.1-RELEASE][root@pfSense.bhf.net]/root: cat /etc/fstab # Device Mountpoint FStype Options Dump Pass# /dev/gpt/efiboot0 /boot/efi msdosfs rw 0 0 /dev/nvd0p3 none swap sw 0 0
When I got my 4100 (with pfSense 22.05 from back then ?), the swap line wasn't using "/dev/nvd0p3" but something else. The result : there was a swap partition but pfSense wasn't using it.
"nvd0p3" is my swap partition. I had to change that by editing /etc/fstab.
IFAIK : it was mentioning some disk ID, not the partition device name.If your system starts to use the swap, consider removing stuff, typically, when you use pfBlockerng, use less bigger DNS files ;)
-
@Gertjan
I think that is my question, given thegpart
above, should/etc/fstab
be changed from/dev/nvd1p3
to/dev/nvd0p3
?My presumption is that pfSense uses the
freebsd-swap
shown ingpart
but I am not certain or know why the install points to the wrong location.I don't think I am close to needing swap due to lack of RAM though:
This swap discussion is for kernel dumps, not day-to-day use.
[edit:] Changing to nvd0p3 and rebooting did indeed bring the GUI swap graph back:
๏ธ
-
Nice. Also interesting.
Ok so set the line in /etc/pSense-ddb.conf. I used:
script kdb.enter.default=capture on; bt; show registers; show pcpu; capture off; dump; reset
Then reboot to apply that.
I then tested it by running
sysctl debug.kdb.panic=1
which immediately panics the box and runs the script. At the console you see:panic: kdb_sysctl_panic cpuid = 3 time = 1697460855 KDB: enter: panic [ thread pid 1455 tid 100508 ] Stopped at kdb_enter+0x32: movq $0,0x2344f43(%rip) db:0:kdb.enter.default> capture on db:0:kdb.enter.default> bt Tracing pid 1455 tid 100508 td 0xfffffe00b7ceaac0 kdb_enter() at kdb_enter+0x32/frame 0xfffffe00b13afa10 vpanic() at vpanic+0x163/frame 0xfffffe00b13afb40 panic() at panic+0x43/frame 0xfffffe00b13afba0 kdb_sysctl_panic() at kdb_sysctl_panic+0x61/frame 0xfffffe00b13afbd0 sysctl_root_handler_locked() at sysctl_root_handler_locked+0x90/frame 0xfffffe00b13afc20 sysctl_root() at sysctl_root+0x216/frame 0xfffffe00b13afca0 userland_sysctl() at userland_sysctl+0x176/frame 0xfffffe00b13afd50 sys___sysctl() at sys___sysctl+0x5c/frame 0xfffffe00b13afe00 amd64_syscall() at amd64_syscall+0x109/frame 0xfffffe00b13aff30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00b13aff30 --- syscall (202, FreeBSD ELF64, __sysctl), rip = 0xb03aaf1e18a, rsp = 0xb03a86d9c88, rbp = 0xb03a86d9cc0 --- db:0:kdb.enter.default> show registers cs 0x20 ds 0x3b es 0x3b fs 0x13 gs 0x1b ss 0x28 rax 0x12 rcx 0xffffffff814589e2 rdx 0x3f8 rbx 0x100 rsp 0xfffffe00b13afa10 rbp 0xfffffe00b13afa10 rsi 0xc3b4cdc4 rdi 0x4 r8 0x7ac3b4cdc4 r9 0xfffffe00b7ceaac0 r10 0xfffffe00b13af8f0 r11 0xcedfc2df9afff59c r12 0 r13 0 r14 0xffffffff814b6685 r15 0xfffffe00b7ceaac0 rip 0xffffffff80d388c2 kdb_enter+0x32 rflags 0x86 kdb_enter+0x32: movq $0,0x2344f43(%rip) db:0:kdb.enter.default> show pcpu cpuid = 3 dynamic pcpu = 0xfffffe008efd7f00 curthread = 0xfffffe00b7ceaac0: pid 1455 tid 100508 critnest 1 "sysctl" curpcb = 0xfffffe00b7ceafe0 fpcurthread = 0xfffffe00b7ceaac0: pid 1455 "sysctl" idlethread = 0xfffffe0011fbde40: tid 100006 "idle: cpu3" self = 0xffffffff84013000 curpmap = 0xfffff8012468fd38 tssp = 0xffffffff84013384 rsp0 = 0xfffffe00b13b0000 kcr3 = 0xffffffffffffffff ucr3 = 0xffffffffffffffff scr3 = 0x0 gs32p = 0xffffffff84013404 ldt = 0xffffffff84013444 tss = 0xffffffff84013434 curvnet = 0xfffff80001203980 db:0:kdb.enter.default> capture off db:0:kdb.enter.default> dump Dumping 702 out of 8050 MB:..3%..12%..21%..32%..42%..51%..62%..71%..83%..92% Dump complete db:0:kdb.enter.default> reset Uptime: 3m30s
After rebooting you should see the crash report in the gui with the vmcore offered to download.
If that's all working then delete that core and try to panic it by removing the interface again. Hopefully the core is not bigger than 1G if you can trigger it soon enough after boot.
Steve
-
@stephenw10
Thanks Steve - as the issue is intermittent for me I probably need more swap.Can I just boot from the USB installer and manually tweak the existing partitions using gpart delete / resize and whatever ZFS uses for regrow?
(It's been a long time since I have used partition commands but probably not much has changed over a couple of decades... other than my memory...)
Hmm, may be easier to get a new install USB but does it offer an option to set the swap partition size during the install (ie I don't remember one)?
๏ธ
-
Yes, you can just set the size during the install:
-
@stephenw10
Ok, even my phat fingers can cope with that.Now all I need is some WAN time to myself.
๏ธ
-
I've racked-up the Supermicro and it has taken-over for pfSense duties, leaving the Netgate 6100 free for testing. What could possibly go wrong?
๏ธ
-
So bluuuuuue!
-
-
Reinstalled everything on the 6100 and presuming you guys are running more 23.09d than anything else, I pushed it on to the latest dev load. I'll run 23.05.1 on the other device for now, so much swapping around today. Probably missed something along the way.
Anyway, partitioned for a 4 GB Swap - hopefully that will be spacious enough for you:
[23.09-BETA]/root: gpart show => 40 115189680 nda0 GPT (55G) 40 532480 1 efi (260M) 532520 1024 2 freebsd-boot (512K) 533544 984 - free - (492K) 534528 8388608 3 freebsd-swap (4.0G) 8923136 106264576 4 freebsd-zfs (51G) 115187712 2008 - free - (1.0M) [23.09-BETA]/root:
I should get some quiet WAN time tomorrow to do interface testing and hopefully achieve a kernel dump. No doubt it will be more intermittent than usual, just to be difficult.
I'll remember to run your script too:
script kdb.enter.default=capture on; bt; show registers; show pcpu; capture off; dump; reset
๏ธ
-
Excellent, that looks good. Let's hope it reveals some useful data. Thanks.
-
You can try manually triggering a panic to make sure it catches a coredump. Run:
sysctl debug.kdb.panic=1
-
@stephenw10
Sorry Steve, this proved to be beyond me. I guess I will have to wait for the GUI button to be implemented or for a genuine idiot proof step-by-step guide to be written as this has eaten through way too many hours over too many days.I think I hit the assumed-knowledge barrier too often, with steps given, only to be belatedly added to with instructions like 'using console mode' or 'use kernel debug mode option 6' or 'did you edit some .conf file' or 'follow 'x' thread' or 'install 'x' package but only by method 'y'.
So what did work:
- got console working from macOS (mislabeled as GNU screen in pfSense docs)
- got the swap partition size changed via console
- fresh install
- installed pfSense-kernel-debug-pfSense pkg from the GUI command line
- ran
kdb.enter.default=capture on;
(etc) script from regular CLI - reboots (many)
kdb.enter.default=capture
shown under /root- reboot into kernel debug mode via console (option 6 etc)
- trigger panic via CLI using sysctl
debug.kdb.panic=1
- console scrolls through something that looks like a core dump...
- crash report in /var/crash with info and text dump files
- no core dump offered in the GUI
- no core dump file found in /var/crash
Clearly I am typing with a little frustration (sorry about that) but perhaps you can spot something useful in the above.
๏ธ
-
I'm sorry. Yes it will be much better when there's a gui option.
You shouldn't need to add the debug kernel just to get the coredump.
The important steps are:
- Make sure you have enough SWAP space (you do.
- Edit /etc/pfSense-ddb.conf so it contains the different default line like:
# $FreeBSD$ # # This file is read when going to multi-user and its contents piped thru # ``ddb'' to define debugging scripts. # # see ``man 4 ddb'' and ``man 8 ddb'' for details. # script lockinfo=show locks; show alllocks; show lockedvnods script pfs=bt ; show registers ; show pcpu ; run lockinfo ; acttrace ; ps ; alltrace # kdb.enter.panic panic(9) was called. #script kdb.enter.default=textdump set; capture on; run pfs ; capture off; textdump dump; reset script kdb.enter.default=capture on; bt; show registers; show pcpu; capture off; dump; reset # kdb.enter.witness witness(4) detected a locking error. script kdb.enter.witness=run lockinfo
- Reboot.
- (Optionally) Run
sysctl debug.kdb.panic=1
to test the setup. You should see it writing out the coredump to swap in the console after all the backtraces scroll past.
Steve