Firewall rebooted unexpectedly

michmoor

Netgate 6100 rebooted unexpectedly.
I have some crash dump files that i can upload.

Crash report begins.  Anonymous machine information:

amd64
15.0-CURRENT
FreeBSD 15.0-CURRENT #0 plus-RELENG_24_03-n256311-e71f834dd81: Fri Apr 19 00:28:14 UTC 2024     root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-24_03-main/obj/amd64/Y4MAEJ2R/var/jenkins/workspace/pfSense-Plus-snapshots-24_03-main/sources/FreeBS

Crash report details:

No PHP errors found.

Filename: /var/crash/info.0
Dump header from device: /dev/nda0p3
  Architecture: amd64
  Architecture Version: 4
  Dump Length: 371712
  Blocksize: 512
  Compression: none
  Dumptime: 2024-09-05 15:16:17 -0400
  Hostname: GAFW
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 15.0-CURRENT #0 plus-RELENG_24_03-n256311-e71f834dd81: Fri Apr 19 00:28:14 UTC 2024
    root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-24_03-main/obj/amd64/Y4MAEJ2R/var/j
  Panic String: page fault
  Dump Parity: 2857159027
  Bounds: 0
  Dump Status: good

michmoor

rebooted again...somethings failing i think.

SSD is still in a good state

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

stephenw10

Upload the crash report here: https://nc.netgate.com/nextcloud/s/mWWHieq9ZHL6seF

michmoor

@stephenw10
files uploaded. I also have a TAC opened. Im not seeing any signs of hardware failure as suggested but could be wrong.

stephenw10

Doesn't look like hardware, all those crashes are almost identical.

Backtrace:

db:1:pfs> bt
Tracing pid 12 tid 100043 td 0xfffff80001688740
kdb_enter() at kdb_enter+0x33/frame 0xfffffe00850ca270
panic() at panic+0x43/frame 0xfffffe00850ca2d0
trap_fatal() at trap_fatal+0x40f/frame 0xfffffe00850ca330
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00850ca390
calltrap() at calltrap+0x8/frame 0xfffffe00850ca390
--- trap 0xc, rip = 0xffffffff846626a7, rsp = 0xfffffe00850ca460, rbp = 0xfffffe00850ca490 ---
export_pflow() at export_pflow+0x77/frame 0xfffffe00850ca490
pf_detach_state() at pf_detach_state+0x45b/frame 0xfffffe00850ca4d0
pf_state_insert() at pf_state_insert+0x854/frame 0xfffffe00850ca570
pf_test_rule() at pf_test_rule+0x28f8/frame 0xfffffe00850ca9c0
pf_test() at pf_test+0x1382/frame 0xfffffe00850cab90
pf_check_out() at pf_check_out+0x22/frame 0xfffffe00850cabb0
pfil_mbuf_out() at pfil_mbuf_out+0x38/frame 0xfffffe00850cabe0
ip_output() at ip_output+0xb60/frame 0xfffffe00850cace0
ip_forward() at ip_forward+0x3c2/frame 0xfffffe00850cad90
ip_input() at ip_input+0x705/frame 0xfffffe00850cadf0
swi_net() at swi_net+0x138/frame 0xfffffe00850cae60
ithread_loop() at ithread_loop+0x257/frame 0xfffffe00850caef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe00850caf30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00850caf30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Looks like an issue in pflow, do you have that enabled?

The only other thing I see is:
<6>pid 67263 (pftop), jid 0, uid 0: exited on signal 6 (core dumped)
That could just be a symptom of the panic though.

michmoor

@stephenw10
I do have pflow enabled
Its been working great since the 24. update. Why is it acting up now?

stephenw10

Good question. And it's set to Netflowv5 so not this: https://redmine.pfsense.org/issues/15446

What else has changed?

michmoor

@stephenw10
I cant see the config history as now its flooded with (system): related messages.

The Auto Configuration Backup / Restore has no backups for the device. Is this normal?

This started yesterday during the work day so for sure no changes. Later that night i updated a pfblocker DNSBL feed but its not related to pfblocker.

Anything else i can check? Any other clues in the crash dumps?

stephenw10

Hmm, ACB not seeing backups is probably unrelated. But check general connectivity from the firewall itself. Check if using the key in a different box can see the backups.

This looks like a bug in flow to me, we are looking into it.

How often is it panicking? Can you test disabling pflow?

michmoor

@stephenw10
I can disable flow for now.

The restart events are below
9/5 - 3:20pm EDT
9/5 - 3:40pm EDT
9/5 - 11:50pm EDT
9/6 - 03:30am EDT
9/6 - 05:40am EDT
9/6 - 07:00am EDT

stephenw10

Hmm, OK it appears it probably is that bug. Or at least the same fix applies.

Something must have changed though for it to suddenly start hitting it.

michmoor

@stephenw10 Even though the redmine points to it being related to IPFIX?

The only thing that "recently" changed was a NAT Port Forward rule and DHCP settings on 9/5 @ 09:32am EDT

I see there is a patch created.

stephenw10

There is a patch but it's a compile time patch. It's fixed in 24.08 but would need a rebuild for 24.03.

Yes, in the original bug report it only affected IPFIX which is why I initially thought it could not be that. But Kristof believes the root cause is the same here, the fix is the same.

It is odd though that you were not hitting it before though. Something must have changed. Hard to imagine a port forward would have done it.

michmoor

@stephenw10
I honestly dont know what couldve change within 24hrs specifically to pflow. I added an additional collector configuration a while back ago

I reviewed my changes from yesterday and confirmed only those changes i stated were done. Considering the bulk of the reboots happened while i was asleep and as far as i know i don't sleep walk (maybe i do) it wasn't anything I've done overnight to cause those reboots.

As of now the fix is ready but will be released with 24.08?
The workaround is to disable pflow?

stephenw10

Well the first thing is to confirm it really is pflow by disabling it making sure it doesn't happen.