PFSense crashes with page fault/kernel panic on VMWare

geovaneg

Hi,

Our PfSense VPN Server (hosted on vmware), which was stable for approximately 20 days, rebooted after recovering from a crash.
I am trying to identify the reason for preventing further occurrences.
I will appreciate any tip.

Geovane

Relevant crash report information:

Dump header from device: /dev/label/swap0
Architecture: amd64
Architecture Version: 1
Dump Length: 157184
Blocksize: 512
Dumptime: Tue Sep 15 10:54:31 2020
Magic: FreeBSD Text Dump
Version String: FreeBSD 11.3-STABLE #243 abf8cba50ce(RELENG_2_4_5): Tue Jun 2 17:53:37 EDT 2020
root@buildbot1-nyi.netgate.com:/build/ce-crossbuild-245/obj/amd64/YNx4Qq3j/build/ce-crossbuild-245/source
Panic String: page fault
Dump Parity: 3990052888
Bounds: 0
Dump Status: good

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 04
fault virtual address = 0xe00000001
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80e35eb3
stack pointer = 0x28:0xfffffe000039c5d0
frame pointer = 0x28:0xfffffe000039c5d0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 56043 (syslogd)
Copyright (c) 1992-2018 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.

Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 06
fault virtual address = 0x3520030519
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80e91aca
stack pointer = 0x0:0xfffffe0171994750
frame pointer = 0x0:0xfffffe01719947d0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 12 (irq257: vmx0)
trap number = 12
panic: page fault
cpuid = 3
KDB: enter: panic

��panic

System Information:

2.4.5-RELEASE-p1 (amd64)
built on Tue Jun 02 17:51:17 EDT 2020
FreeBSD 11.3-STABLE
CPU Type: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz

4 CPUs: 4 package(s)

AES-NI CPU Crypto: Yes (active)
Hardware crypto: AES-CBC,AES-XTS,AES-GCM,AES-ICM
Kernel PTI: Disabled
MDS Mitigation: Inactive

6 GB RAM

Low CPU usage (35%) and memory

A single vmx0 NIC with public IPV4 in DMZ

System Function:

VPN GW/Server

Installed Packages:
Cron, Iftop, Open-VM-Tools (10.1.0_3,1 ), openvpn-client-export.
Main Services:
IPSec VPN, reaching 600 simultaneous mobile clients.
Open VPN, no more than 30 concurrent clients

VMWare Plataform:
VMware ESXi, 6.0.0, 5050593

stephenw10

Do you have the backtrace from either of those? Everything between db:0:kdb.enter.default> bt and db:0:kdb.enter.default> ps.
The page fault panics are not very descriptive. Different processes though which leans towards a hardware issue. Unlikely in VMWare.

Steve

Frogg

I had something similar on VMWare 6.5, VM was crashing randomly between some mins and some hours.

It was a bug on the virtual network drivers:

Try to change the driver VMXNET3 by your network card in your VM network configuration

I hope it will solve your trouble.

jimp

Usually random instability on VMWare is from:

Running an outdated version of ESX incompatible with the base OS
-OR, more likely:
The VM is using an outdated VM Hardware version. Shut down the VM and upgrade the compatibility level to match the latest available, then start it again.

geovaneg

@stephenw10

Hi Steve,

Thanks for the post.
This was the first crash with this machine (let's call it VPN3-server) after updating PfSense to version 2.4.5-RELEASE-p1. I updated a machine just in the hope of solving an unexpected crash and reboot problem that occurred with that machine and another identical machine, also a VPN server (VPN4-server).
If you think it's relevant, I can try to recover information from the previous crash of the other machine (VPN4-Server). Information from the VPN3-server machine, unfortunately I didn't save it. Do you think it's important? Do you have anything specific that you would like me to recover from the previous crash?

geovaneg

@Frogg said in PFSense crashes with page fault/kernel panic on VMWare:

I had something similar on VMWare 6.5, VM was crashing randomly between some mins and some hours.
It was a bug on the virtual network drivers:
Try to change the driver VMXNET3 by your network card in your VM network configuration
I hope it will solve your trouble.

Hi Frogg,

Thanks for the tip, I'll keep this trick up my sleeve ;-)
But initially, I intended to keep the paravirtualized driver for performance reasons.

Thank you.

Geovane

geovaneg

This post is deleted!

geovaneg

@jimp

Hi Jimp,

Thanks for the post.

Your hypotheses are interesting, let's explore them.
I talked to the virtualization infrastructure team and I have some answers:

Our version of VMWare is ESXi, 6.0.0, 5050593, it's not really up to date, but I didn't think that was a problem.
The hardware version of the virtual machine is 11, the last available for that version of VMWare. The configuration template used by VMWare was “Generic FreeBSD-64-Bits”
The NIC is paravirtualized (VMXNET 3)
The current disk controller is the scsi “LSI Logic Parallel”. Do you think VMWare's paravirtualized alternative would be compatible and more appropriate?

Jimp, the hypothesis of configuration problems in the virtualization environment is interesting, but there is a situation that contradicts this possibility:

I have another virtual PFSense with identical virtualization settings and version of PfSense that does not have this problem. Its function is only FW / GW of the wireless network and does not run VPN services. This machine has recorded almost a year of uptime in the same virtualization environment as the VPN servers that had an unexpected crash and reboot.

jimp

FreeBSD 11.x is not supported by ESX 6.0, see https://www.vmware.com/resources/compatibility/search.php?deviceCategory=software&details=1&operatingSystems=232&productNames=15&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc&testConfig=16

And ESX 6.0 is not "not really up to date", it's ancient in terms of ESX versions. It was released over 5 years ago and went EOL over 6 months ago.

Update your ESX version.

geovaneg

@jimp

Hi Jimp,

I will request this for the virtualization team.

Thank you.

Frogg

@geovaneg

I know this post is about 6.5 and not 6.0 but maybe you can find this intersting:
https://www.linkedin.com/pulse/linux-virtual-machine-crash-vmxnet3-nic-vmware-esxi-65-han-yong-lim

If I was you, I would get a try about changing the NIC just to test if the trouble come from there.

geovaneg

@stephenw10

Hi Steve,

I didn't understand exactly what you asked for ... sorry.
Rush and language problems ...

I copied the lines below: "Everything between db:0:kdb.enter.default> bt and db:0:kdb.enter.default> ps"

Thanks

db:0:kdb.enter.default> bt
Tracing pid 12 tid 100065 td 0xfffff8000618e620
kdb_enter() at kdb_enter+0x3b/frame 0xfffffe0171994400
vpanic() at vpanic+0x19b/frame 0xfffffe0171994460
panic() at panic+0x43/frame 0xfffffe01719944c0
trap_pfault() at trap_pfault/frame 0xfffffe0171994510
trap_pfault() at trap_pfault+0x49/frame 0xfffffe0171994570
trap() at trap+0x29d/frame 0xfffffe0171994680
calltrap() at calltrap+0x8/frame 0xfffffe0171994680
--- trap 0xc, rip = 0xffffffff80e91aca, rsp = 0xfffffe0171994750, rbp = 0xfffffe01719947d0 ---
ip_input() at ip_input+0x5da/frame 0xfffffe01719947d0
netisr_dispatch_src() at netisr_dispatch_src+0xa2/frame 0xfffffe0171994820
ether_demux() at ether_demux+0x15b/frame 0xfffffe0171994850
ether_nh_input() at ether_nh_input+0x32c/frame 0xfffffe01719948b0
netisr_dispatch_src() at netisr_dispatch_src+0xa2/frame 0xfffffe0171994900
ether_input() at ether_input+0x26/frame 0xfffffe0171994920
vmxnet3_rxq_eof() at vmxnet3_rxq_eof+0x752/frame 0xfffffe01719949b0
vmxnet3_legacy_intr() at vmxnet3_legacy_intr+0xe0/frame 0xfffffe01719949e0
intr_event_execute_handlers() at intr_event_execute_handlers+0xe9/frame 0xfffffe0171994a20
ithread_loop() at ithread_loop+0xe7/frame 0xfffffe0171994a70
fork_exit() at fork_exit+0x83/frame 0xfffffe0171994ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0171994ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
db:0:kdb.enter.default> ps

stephenw10

Ok, that's pretty generic. No way to pin down stuff in netisr_dispatch really.

The thing to check is if other crashes have the same or very similar backtraces and panic strings.

However, and JimP said, ESXi 6.0 does't support FreeBSD 11 and hence pfSense 2.4. You really need to get that upgraded before doing anything else.

Steve