Crash reports being generated, configuration bug?
I've submitted a crash report this morning after fighting with what I thought was a bad VM instance. Upgraded my VM install from 2.2.6 to 2.3.4 x64. This is when the problems began, system was crashing and rebooting about every 10 minutes. Loaded up a new VM with a fresh install of pfSense 2.3.4 x64, restored the config that I backed up from the prior VM, same issue. I then loaded up an old Dell Poweredge 1850 with pfSense 2.3.4 x64 and restored the config. After restoring the config, things seemed solid for a few days, but then had another crash and reboot last Friday, and another this morning. I've checked the ISO to make sure I didn't get a bad download.
Listing order of actions for readability:
VM (ESXi 5.5) is running x64 2.2.6 for years, solid as a rock.
Upgraded VM x64 2.2.6 to x64 2.3.4 using built-in upgrade tool. Crashing started.
Loaded up fresh VM with x64 2.3.4 and restored config from previous VM. Crashing continued.
Loaded up physical box with x64 2.3.4 and restored config. Crashing continued.
Any devs available to look up my crash report and possibly give me some insight? Would have been submitted from 184.108.40.206 this morning around 8:30am MDT (-6).
I've downloaded the config and combed over it but see nothing out of the norm. My next step is going to be a fresh install and reconfiguring by hand using the prior config XML as a guide. Hoping I don't need to put the time in to do that unless the crash report reveals nothing.
did you remove all conflicting packages before upgrading?
there are numerous packages that went extinct after 2.2.6
also, there might be issues with limiters & nat (not certain if that has been sorted or not)
I only had one package, the OpenVPN export tool package. I did not remove it before upgrading.
EDIT: I'll go through my config and check for limiters, I do not think I was using any.
EDIT2: No limiters are set up, and none before the upgrade either.
You'll have to post the details of the crash report here, or at least the first few parts of the IP address of the firewall (IPv4 and IPv6) so we can find the report. I looked for a report from the IP address on your forum post, but there were none from that subnet.
Without the contents of the crash report, it's not possible to accurately speculate about anything that is happening.
The only random bit of speculation I can offer is that unless you have updated ESX 5.5 to the latest patches, it may not be compatible with FreeBSD 10.x, and definitely not FreeBSD 11. Assuming it's an OS-level crash, I'd look at upgrading your ESX install to at least 6.0.x first.
Thank you for the info jimp. I was worried it was something strange with ESXi which is why we loaded on physical hardware so I'm thinking we can rule that out unless there's something strange left in my config from having been previously used on ESXi.
It crashed again about 5 minutes ago so I have a fresh report I can paste here.
EDIT: Let me attach as a txt file. Pasting into the response was a terrible idea. :)
Crash report attached.
Just to add more info that may or may not be useful, configuration is pretty standard. We have 5 public IPs, 4 of which are Proxy ARP setup. We have 3 OpenVPN server instances, 2 of them being site to site and one being remote access. Only one package loaded still, the OpenVPN client export.
Current hardware is a Dell Poweredge 1850, dual Xeon 3.2ghz, 4gb ECC. Full install to hdd mirror provided by the built-in PERC controller.
Things to note: PowerD is enabled. SSH is currently enabled with LAN access only. Static Route Filtering is enabled. IPv6 is blocked. No traffic shaping or limiters. Everything else is default out of box config iirc.
That seems familiar but I can't recall the specific cause…
em0: discard frame w/o packet header
db:0:kdb.enter.default> bt Tracing pid 0 tid 100035 td 0xfffff80003b5a4b0 pmap_kextract() at pmap_kextract+0x3c/frame 0xfffffe0120580920 bounce_bus_dmamap_load_buffer() at bounce_bus_dmamap_load_buffer+0x1bb/frame 0xfffffe0120580990 bus_dmamap_load_mbuf_sg() at bus_dmamap_load_mbuf_sg+0x72/frame 0xfffffe01205809f0 lem_get_buf() at lem_get_buf+0x92/frame 0xfffffe0120580a50 lem_rxeof() at lem_rxeof+0x1cf/frame 0xfffffe0120580af0 lem_handle_rxtx() at lem_handle_rxtx+0x33/frame 0xfffffe0120580b30 taskqueue_run_locked() at taskqueue_run_locked+0xe5/frame 0xfffffe0120580b80 taskqueue_thread_loop() at taskqueue_thread_loop+0xa8/frame 0xfffffe0120580bb0 fork_exit() at fork_exit+0x9a/frame 0xfffffe0120580bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0120580bf0 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Seems to be close to https://redmine.pfsense.org/issues/6330 which ended up being a hardware issue IIRC. In this case since the hardware is virtual, I'd go back to leaning toward an ESX version issue.
Also similar to https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=213257 but that's still an open issue in the e1000 driver, not clear why you might be hitting that one then, if it is.
You could try switching to vmxnet NICs, but I'd focus more on getting up to ESX 6.x if that's feasible. At least see if you can reproduce it on ESX 6.x in lab conditions.
Oh very interesting. So the issue may have followed me to physical hardware as I believe the NICs in my PE 1850 are also using the same driver for the intel Pro 1000 adapters.
We are readying an APU2 to try instead of the PE 1850, but your suggestion of switching to the VMXNet adapter may be a quicker determination of NIC issues.
Thank you Jimp! This by far makes more sense of the issue than anything else I've done on my own.
Just going to follow up on this and bring some closure to this thread. I continued to have crashes with the APU2 unit as well. I tried a fresh install and reconfiguration by hand instead of restoring the config, which still resulted in many crashes per day. We resorted to OPNsense and reconfigured by hand, things are stable since deploying it on the APU2 this past Sunday. I did submit a few more crash reports in hopes that there would be some key info there to help the guys behind pfSense, if it is indeed some kind of bug. Will revisit this issue when I can afford some more downtime, or when 2.4 is released.
Thanks for all of the input and help, sorry we couldn't get it figured out. Some kind of bizarre quirk specific to my configuration/environment I'm sure.