24.03 Crashing...?

rocketboy001

@stephenw10 - Ok, I realized that I had the 192.168.100.0/24 subnet set on a site-to-site OpenVPN connection. Never occurred to met that this would cause an issue (and I can't recall having any. I don't even remember seeing the messages logged but...) . I removed the route on the connection and have not see the error logged since (been a hour or so).

Wonder if there's a way to resolve that, the Starlink modem is at 192.168.100.1 and that's not configurable. Not having the subnet on the VPN connection is not the end of the world, but is inconvenient.

Guess I'll have to wait to see if the crash happens again.

stephenw10

Mmm, it shouldn't cause a kernel panic. But it might be part of a combination of things.

rocketboy001

@stephenw10 -

Welp, even with the subnet off the VPN I am still seeing the two messages:

arpresolve: can't allocate llinfo for 100.64.0.1 on em1

arp: 26:12:ac:1a:80:01 is using my IP address 192.168.100.2 on em1!

Interestingly (and I should have checked this before) the 26:12:ac:1a:80:01 mac is Starlink router/modem (which is in bypass mode).

stephenw10

I'd guess it uses that IP when it has no uplink so you can configure it.

Seems like you still have that IP address on the firewall though. Try running: ifconfig and if it's still present.

rocketboy001

@stephenw10

Yeah, it's on the Starlink interface (even after a reboot). Which is odd because I have the option set to ignore DHCP leases from 192.168.100.1... Did something change in 24.03 that's causing it to ignore that option? Would explain why I've not seen that IP on the interface before (as far as I recall) and not had any issues with that subnet on the VPN. Not sure it explains the crash...

em1: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
        description: WAN_Starlink
        options=4e100bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG>
        ether 00:26:55:ec:d9:34
        inet 100.122.205.144 netmask 0xffc00000 broadcast 100.127.255.255
        inet 192.168.100.2 netmask 0xffffff00 broadcast 192.168.100.255
        inet6 fe80::226:55ff:feec:d934%em1 prefixlen 64 scopeid 0x4
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

stephenw10

Nothing changed as far as I know. Is that actually the DHCP server though?
Check /var/db/dhclient.leases.em1

Is that other IP address a VIP on WAN_Starlink?

rocketboy001

@stephenw10 said in 24.03 Crashing...?:

The 192.168.100.1 address is what all the documentation for pfSense with Starlink tells you to reject leases from because even in bypass mode the modem/router can hand out an IP. /var/db/dhclient.leases.em1 does not seem to show the lease. Although it does have a route to 192.168.100.32.

lease {
  interface "em1";
  fixed-address 100.122.205.144;
  next-server 10.10.10.10;
  option subnet-mask 255.192.0.0;
  option routers 100.64.0.1;
  option domain-name-servers 1.1.1.1,8.8.8.8;
  option interface-mtu 1500;
  option dhcp-lease-time 300;
  option dhcp-message-type 5;
  option dhcp-server-identifier 100.64.0.1;
  option classless-routes 32,192,168,100,1,0,0,0,0,32,34,120,255,244,0,0,0,0,0,100,64,0,1;
  renew 2 2024/5/21 01:44:47;
  rebind 2 2024/5/21 01:46:36;
  expire 2 2024/5/21 01:47:17;
}
lease {
  interface "em1";
  fixed-address 100.122.205.144;
  next-server 10.10.10.10;
  option subnet-mask 255.192.0.0;
  option routers 100.64.0.1;
  option domain-name-servers 1.1.1.1,8.8.8.8;
  option interface-mtu 1500;
  option dhcp-lease-time 300;
  option dhcp-message-type 5;
  option dhcp-server-identifier 100.64.0.1;
  option classless-routes 32,192,168,100,1,0,0,0,0,32,34,120,255,244,0,0,0,0,0,100,64,0,1;
  renew 2 2024/5/21 01:44:53;
  rebind 2 2024/5/21 01:46:42;
  expire 2 2024/5/21 01:47:23;
}
lease {
  interface "em1";
  fixed-address 100.122.205.144;
  next-server 10.10.10.10;
  option subnet-mask 255.192.0.0;
  option routers 100.64.0.1;
  option domain-name-servers 1.1.1.1,8.8.8.8;
  option interface-mtu 1500;
  option dhcp-lease-time 300;
  option dhcp-message-type 5;
  option dhcp-server-identifier 100.64.0.1;
  option classless-routes 32,192,168,100,1,0,0,0,0,32,34,120,255,244,0,0,0,0,0,100,64,0,1;
  renew 2 2024/5/21 01:49:15;
  rebind 2 2024/5/21 01:51:04;
  expire 2 2024/5/21 01:51:45;
}
lease {
  interface "em1";
  fixed-address 100.122.205.144;
  next-server 10.10.10.10;
  option subnet-mask 255.192.0.0;
  option routers 100.64.0.1;
  option domain-name-servers 1.1.1.1,8.8.8.8;
  option interface-mtu 1500;
  option dhcp-lease-time 300;
  option dhcp-message-type 5;
  option dhcp-server-identifier 100.64.0.1;
  option classless-routes 32,192,168,100,1,0,0,0,0,32,34,120,255,244,0,0,0,0,0,100,64,0,1;
  renew 2 2024/5/21 01:53:48;
  rebind 2 2024/5/21 01:55:37;
  expire 2 2024/5/21 01:56:18;
}
lease {
  interface "em1";
  fixed-address 100.122.205.144;
  next-server 10.10.10.10;
  option subnet-mask 255.192.0.0;
  option routers 100.64.0.1;
  option domain-name-servers 1.1.1.1,8.8.8.8;
  option interface-mtu 1500;
  option dhcp-lease-time 300;
  option dhcp-message-type 5;
  option dhcp-server-identifier 100.64.0.1;
  option classless-routes 32,192,168,100,1,0,0,0,0,32,34,120,255,244,0,0,0,0,0,100,64,0,1;
  renew 2 2024/5/21 01:58:30;
  rebind 2 2024/5/21 02:00:19;
  expire 2 2024/5/21 02:01:00;
}
lease {
  interface "em1";
  fixed-address 100.122.205.144;
  next-server 10.10.10.10;
  option subnet-mask 255.192.0.0;
  option routers 100.64.0.1;
  option domain-name-servers 1.1.1.1,8.8.8.8;
  option interface-mtu 1500;
  option dhcp-lease-time 300;
  option dhcp-message-type 5;
  option dhcp-server-identifier 100.64.0.1;
  option classless-routes 32,192,168,100,1,0,0,0,0,32,34,120,255,244,0,0,0,0,0,100,64,0,1;
  renew 2 2024/5/21 02:03:33;
  rebind 2 2024/5/21 02:05:22;
  expire 2 2024/5/21 02:06:03;
}
lease {
  interface "em1";
  fixed-address 100.122.205.144;
  next-server 10.10.10.10;
  option subnet-mask 255.192.0.0;
  option routers 100.64.0.1;
  option domain-name-servers 1.1.1.1,8.8.8.8;
  option interface-mtu 1500;
  option dhcp-lease-time 300;
  option dhcp-message-type 5;
  option dhcp-server-identifier 100.64.0.1;
  option classless-routes 32,192,168,100,1,0,0,0,0,32,34,120,255,244,0,0,0,0,0,100,64,0,1;
  renew 2 2024/5/21 02:07:54;
  rebind 2 2024/5/21 02:09:43;
  expire 2 2024/5/21 02:10:24;
}
lease {
  interface "em1";
  fixed-address 100.122.205.144;
  next-server 10.10.10.10;
  option subnet-mask 255.192.0.0;
  option routers 100.64.0.1;
  option domain-name-servers 1.1.1.1,8.8.8.8;
  option interface-mtu 1500;
  option dhcp-lease-time 300;
  option dhcp-message-type 5;
  option dhcp-server-identifier 100.64.0.1;
  option classless-routes 32,192,168,100,1,0,0,0,0,32,34,120,255,244,0,0,0,0,0,100,64,0,1;
  renew 2 2024/5/21 02:12:21;
  rebind 2 2024/5/21 02:14:10;
  expire 2 2024/5/21 02:14:51;
}
lease {
  interface "em1";
  fixed-address 100.122.205.144;
  next-server 10.10.10.10;
  option subnet-mask 255.192.0.0;
  option routers 100.64.0.1;
  option domain-name-servers 1.1.1.1,8.8.8.8;
  option interface-mtu 1500;
  option dhcp-lease-time 300;
  option dhcp-message-type 5;
  option dhcp-server-identifier 100.64.0.1;
  option classless-routes 32,192,168,100,1,0,0,0,0,32,34,120,255,244,0,0,0,0,0,100,64,0,1;
  renew 2 2024/5/21 02:16:50;
  rebind 2 2024/5/21 02:18:39;
  expire 2 2024/5/21 02:19:20;
}

Clarification on my part, the 100.64.0.1 address is the Starlink gateway address. The public IP is 100.122.205.144. Starlink uses CGNAT.

rocketboy001

@rocketboy001

I'm an idiot. I was looking back though some Starlink pfSense setup info and was reminded that in order to have access to the modem/router in bypass mode you have to setup a VIP on the WAN interface. And you even asked about that... dumb.

Anyway, I removed the VIP so we'll see what happens now. Still, does not explain the crash prior to 24.3 it had been working fine, even with the 192.168.100.0 subnet in the OpenVPN config

stephenw10

Ah so 192.168.100.2 was the VIP? That seems reasonable if the modem is usually at 192.168.100.1. And odd that the modem would start using .2 itself.

You could probably use any IP for the VIP in the subnet there though. So maybe 192.168.100.10. If the modem suddenly starts using that too it would confirm some unexpected behaviour.

rocketboy001

@stephenw10

Yep, 192.168.100.2 was the VIP.

The only reason to set it up that way is to have LAN access to the modem via the app. Once it's setup and connected you can access it without the config on pfSense as long as the dish has connectivity to the Starlink network. I've never actually needed the LAN access so I'll just leave it be for now, easy enough to add back in if needed.

No crashes since the last one, I am still seeing

arpresolve: can't allocate llinfo for 100.64.0.1 on em1

in the logs... there are a few posts on here about it and elsewhere. Some people see that logged and it corresponds to an outage on that connection, I don't think I'm seeing that (or at least noticing it). Some suggestions were to change the monitor IP for the connection from the gateway address (100.64.0.1) to something else. Maybe I'll try that.

stephenw10

You must be able to ARP for the gateway address in order to send packets to it though. If you're seeing that log entry I would expect to find the entry missing from the ARP table and no traffic possible.

It could be gateway monitoring pings and tripping something on the gateway causing it to stop responding though. In which case monitoring some other external IP would prevent that.

mikebenna

I'm seeing a crash every couple weeks ("Fatal trap 12: page fault while in kernel mode") at exactly the same instruction pointer (0xffffffff80f246e2). I'm posting here under the expectation that same rip likely means similar underlying cause.

My setup is somewhat different however. I'm running in a proxmox VM and no CARP on WAN interfaces (but I am using CARP on the internal networks).

I'm also running 24.03-RELEASE (amd64). Other than the occasional crash, the system is normal.

Jun 23 19:25:16 Bouncer kernel: Fatal trap 12: page fault while in kernel mode
Jun 23 19:25:16 Bouncer kernel: cpuid = 1; apic id = 01
Jun 23 19:25:16 Bouncer kernel: fault virtual address   = 0x1c
Jun 23 19:25:16 Bouncer kernel: fault code              = supervisor read data, page not present
Jun 23 19:25:16 Bouncer kernel: instruction pointer     = 0x20:0xffffffff80f246e2
Jun 23 19:25:16 Bouncer kernel: stack pointer           = 0x28:0xfffffe005d698ae0
Jun 23 19:25:16 Bouncer kernel: frame pointer           = 0x28:0xfffffe005d698b70
Jun 23 19:25:16 Bouncer kernel: code segment            = base 0x0, limit 0xfffff, type 0x1b
Jun 23 19:25:16 Bouncer kernel:                         = DPL 0, pres 1, long 1, def32 0, gran 1
Jun 23 19:25:16 Bouncer kernel: processor eflags        = interrupt enabled, resume, IOPL = 0
Jun 23 19:25:16 Bouncer kernel: current process         = 2 (clock (1))
Jun 23 19:25:16 Bouncer kernel: rdi: 0000000000000000 rsi: 0000000000000000 rdx: fffffe005d698cf8
Jun 23 19:25:16 Bouncer kernel: rcx: 0000000000000000  r8: 000000000000041c  r9: 0000000000000000
Jun 23 19:25:16 Bouncer kernel: rax: 0000000000000000 rbx: 0000000000000000 rbp: fffffe005d698b70
Jun 23 19:25:16 Bouncer kernel: r10: 0000000000002014 r11: 000000000000e08c r12: 0000000000000000
Jun 23 19:25:16 Bouncer kernel: r13: 000000000000041c r14: fffff80132b30000 r15: 0000000000000034
Jun 23 19:25:16 Bouncer kernel: trap number             = 12
Jun 23 19:25:16 Bouncer kernel: panic: page fault
Jun 23 19:25:16 Bouncer kernel: cpuid = 1
Jun 23 19:25:16 Bouncer kernel: time = 1719195652
Jun 23 19:25:16 Bouncer kernel: KDB: enter: panic
Jun 23 19:25:16 Bouncer kernel: ---<<BOOT>>---


Jun 22 10:57:51 Bouncer kernel: Fatal trap 12: page fault while in kernel mode
Jun 22 10:57:51 Bouncer kernel: cpuid = 2; apic id = 02
Jun 22 10:57:51 Bouncer kernel: fault virtual address   = 0x1c
Jun 22 10:57:51 Bouncer kernel: fault code              = supervisor read data, page not present
Jun 22 10:57:51 Bouncer kernel: instruction pointer     = 0x20:0xffffffff80f246e2
Jun 22 10:57:51 Bouncer kernel: stack pointer           = 0x0:0xfffffe005d693ae0
Jun 22 10:57:51 Bouncer kernel: frame pointer           = 0x0:0xfffffe005d693b70
Jun 22 10:57:51 Bouncer kernel: code segment            = base 0x0, limit 0xfffff, type 0x1b
Jun 22 10:57:51 Bouncer kernel:                         = DPL 0, pres 1, long 1, def32 0, gran 1
Jun 22 10:57:51 Bouncer kernel: processor eflags        = interrupt enabled, resume, IOPL = 0
Jun 22 10:57:51 Bouncer kernel: current process         = 2 (clock (2))
Jun 22 10:57:51 Bouncer kernel: rdi: 0000000000000000 rsi: 0000000000000000 rdx: fffffe005d693cf8
Jun 22 10:57:51 Bouncer kernel: rcx: 0000000000000000  r8: 0000000000000157  r9: 0000000000000000
Jun 22 10:57:51 Bouncer kernel: rax: 0000000000000000 rbx: 0000000000000000 rbp: fffffe005d693b70
Jun 22 10:57:51 Bouncer kernel: r10: 000000000000201c r11: 000000000000e0c4 r12: 0000000000000000
Jun 22 10:57:51 Bouncer kernel: r13: 0000000000000157 r14: fffff8006047c000 r15: 0000000000000034
Jun 22 10:57:51 Bouncer kernel: trap number             = 12
Jun 22 10:57:51 Bouncer kernel: panic: page fault
Jun 22 10:57:51 Bouncer kernel: cpuid = 2
Jun 22 10:57:51 Bouncer kernel: time = 1719078955
Jun 22 10:57:51 Bouncer kernel: KDB: enter: panic
Jun 22 10:57:51 Bouncer kernel: ---<<BOOT>>---


May 16 12:05:02 Bouncer kernel: Fatal trap 12: page fault while in kernel mode
May 16 12:05:02 Bouncer kernel: cpuid = 1; apic id = 01
May 16 12:05:02 Bouncer kernel: fault virtual address   = 0x1c
May 16 12:05:02 Bouncer kernel: fault code              = supervisor read data, page not present
May 16 12:05:02 Bouncer kernel: instruction pointer     = 0x20:0xffffffff80f246e2
May 16 12:05:02 Bouncer kernel: stack pointer           = 0x28:0xfffffe005d698ae0
May 16 12:05:02 Bouncer kernel: frame pointer           = 0x28:0xfffffe005d698b70
May 16 12:05:02 Bouncer kernel: code segment            = base 0x0, limit 0xfffff, type 0x1b
May 16 12:05:02 Bouncer kernel:                         = DPL 0, pres 1, long 1, def32 0, gran 1
May 16 12:05:02 Bouncer kernel: processor eflags        = interrupt enabled, resume, IOPL = 0
May 16 12:05:02 Bouncer kernel: current process         = 2 (clock (1))
May 16 12:05:02 Bouncer kernel: rdi: 0000000000000000 rsi: 0000000000000000 rdx: fffffe005d698cf8
May 16 12:05:02 Bouncer kernel: rcx: 0000000000000000  r8: 0000000000000564  r9: 0000000000000000
May 16 12:05:02 Bouncer kernel: rax: 0000000000000000 rbx: 0000000000000000 rbp: fffffe005d698b70
May 16 12:05:02 Bouncer kernel: r10: 0000000000002021 r11: 000000000000e0e7 r12: 0000000000000000
May 16 12:05:02 Bouncer kernel: r13: 0000000000000564 r14: fffff8010ff6ba80 r15: 0000000000000034
May 16 12:05:02 Bouncer kernel: trap number             = 12
May 16 12:05:02 Bouncer kernel: panic: page fault
May 16 12:05:02 Bouncer kernel: cpuid = 1
May 16 12:05:02 Bouncer kernel: time = 1715886225
May 16 12:05:02 Bouncer kernel: KDB: enter: panic
May 16 12:05:02 Bouncer kernel: ---<<BOOT>>---


Apr 29 11:50:16 Bouncer kernel: Fatal trap 12: page fault while in kernel mode
Apr 29 11:50:16 Bouncer kernel: cpuid = 2; apic id = 02
Apr 29 11:50:16 Bouncer kernel: fault virtual address   = 0x1c
Apr 29 11:50:16 Bouncer kernel: fault code              = supervisor read data, page not present
Apr 29 11:50:16 Bouncer kernel: instruction pointer     = 0x20:0xffffffff80f246e2
Apr 29 11:50:16 Bouncer kernel: stack pointer           = 0x28:0xfffffe005d693ae0
Apr 29 11:50:16 Bouncer kernel: frame pointer           = 0x28:0xfffffe005d693b70
Apr 29 11:50:16 Bouncer kernel: code segment            = base 0x0, limit 0xfffff, type 0x1b
Apr 29 11:50:16 Bouncer kernel:                         = DPL 0, pres 1, long 1, def32 0, gran 1
Apr 29 11:50:16 Bouncer kernel: processor eflags        = interrupt enabled, resume, IOPL = 0
Apr 29 11:50:16 Bouncer kernel: current process         = 2 (clock (2))
Apr 29 11:50:16 Bouncer kernel: rdi: 0000000000000000 rsi: 0000000000000000 rdx: fffffe005d693cf8
Apr 29 11:50:16 Bouncer kernel: rcx: 0000000000000000  r8: 0000000000000564  r9: 0000000000000000
Apr 29 11:50:16 Bouncer kernel: rax: 0000000000000000 rbx: 0000000000000000 rbp: fffffe005d693b70
Apr 29 11:50:16 Bouncer kernel: r10: 0000000000002021 r11: 000000000000e0e7 r12: 0000000000000000
Apr 29 11:50:16 Bouncer kernel: r13: 0000000000000564 r14: fffff800510b0000 r15: 0000000000000034
Apr 29 11:50:16 Bouncer kernel: trap number             = 12
Apr 29 11:50:16 Bouncer kernel: panic: page fault
Apr 29 11:50:16 Bouncer kernel: cpuid = 2
Apr 29 11:50:16 Bouncer kernel: time = 1714416542
Apr 29 11:50:16 Bouncer kernel: KDB: enter: panic
Apr 29 11:50:16 Bouncer kernel: ---<<BOOT>>---


Apr 26 14:18:52 Bouncer kernel: Fatal trap 12: page fault while in kernel mode
Apr 26 14:18:52 Bouncer kernel: cpuid = 0; apic id = 00
Apr 26 14:18:52 Bouncer kernel: fault virtual address   = 0x1c
Apr 26 14:18:52 Bouncer kernel: fault code              = supervisor read data, page not present
Apr 26 14:18:52 Bouncer kernel: instruction pointer     = 0x20:0xffffffff80f246e2
Apr 26 14:18:52 Bouncer kernel: stack pointer           = 0x0:0xfffffe005d69dae0
Apr 26 14:18:52 Bouncer kernel: frame pointer           = 0x0:0xfffffe005d69db70
Apr 26 14:18:52 Bouncer kernel: code segment            = base 0x0, limit 0xfffff, type 0x1b
Apr 26 14:18:52 Bouncer kernel:                         = DPL 0, pres 1, long 1, def32 0, gran 1
Apr 26 14:18:52 Bouncer kernel: processor eflags        = interrupt enabled, resume, IOPL = 0
Apr 26 14:18:52 Bouncer kernel: current process         = 2 (clock (0))
Apr 26 14:18:52 Bouncer kernel: rdi: 0000000000000000 rsi: 0000000000000000 rdx: fffffe005d69dcf8
Apr 26 14:18:52 Bouncer kernel: rcx: 0000000000000000  r8: 0000000000000546  r9: 0000000000000000
Apr 26 14:18:52 Bouncer kernel: rax: 0000000000000000 rbx: 0000000000000000 rbp: fffffe005d69db70
Apr 26 14:18:52 Bouncer kernel: r10: 000000000000201c r11: 000000000000e0c4 r12: 0000000000000000
Apr 26 14:18:52 Bouncer kernel: r13: 0000000000000546 r14: fffff800059ce540 r15: 0000000000000034
Apr 26 14:18:52 Bouncer kernel: trap number             = 12
Apr 26 14:18:52 Bouncer kernel: panic: page fault
Apr 26 14:18:52 Bouncer kernel: cpuid = 0
Apr 26 14:18:52 Bouncer kernel: time = 1714166019
Apr 26 14:18:52 Bouncer kernel: KDB: enter: panic
Apr 26 14:18:52 Bouncer kernel: ---<<BOOT>>---

I've saved the last two crash reports if that helps.

VM config:

3GB RAM, balloon=0
Processors: 1 socket, 4 cores, kvm64
BIOS: UEFI
Machine: q35
3 Network devices.
- Two are virtio
- One PCI passthrough of Intel(R) I219-V SPT-H(2).

This system is part of an HA pair so the reboot doesn't have a big impact on me, but page faults always feel like good things to chase down.

The other system in the HA pair is also a VM under proxmox and has not crashed since the last release was installed.

Let me know if I can help further.

stephenw10

Same backtrace?

Are you also running HAProxy? If so update the HAProxy package. There is an update to address a known bug causing that panic.

mikebenna

@stephenw10, damn you're fast. Yes I'm also running HAproxy and yes there was a pending update I hadn't noticed. I have now applied it. I guess the only thing to do now is wait to see if the problem is resolved. Thanks so much for the response!

stephenw10

No worries let us know if that fixes it.