pfSense 2.7.2 in Hyper-V freezing with no crash report after reboot
-
Hello,
thanks for your reply.
Those are connections from our monitoring server. No other users conncted to the system before the freeze.
-
Anything shown at the console? Can you setup a serial console and log that output?
-
Yes I checked console connection, but all you see are syslogd messages of user logins sometimes multiple days old.
The affected systems are unfortunately all live systems, which have to be made functional again as soon as possible, so I don't have much time for longer tests.
I have not yet been able to cause a freeze on my test systems.
Normally a freeze occurs every 2 - 4 weeks. However, there is no guarantee of this.
-
We have currently moved the HA proxy from one of the affected systems to another system to see if the freezes are still occurring
Perhaps this will help to narrow down the potential sources of error
-
@Techniker_ctr said in pfSense 2.7.2 freezing with no crash report after reboot:
The console is also no longer accessible in some cases.
Hmm, so sometimes the console was still responsive? I assume you were not able to connect out at that point?
-
It seems so. In all the freezes I've seen, the console was no longer functional, but my colleague said he could access a device via console.
At the next freeze I will actively check if more info is available via console
-
Make sure you try using
ctl+t
at the console. That can often show a stuck process when nothing else is responding. -
Hey,
We had another outage today, unfortunately no further information could be retrieved via the console.
The system that failed is an updated system that runs on ufs and does not use a ha proxy.
Failure pattern is the same again. Normal load and suddenly no response from the system and total traffic failure. As alsways no crash report.
The following are the system loads
RAM
Proccesses:
Any other Idea how we can track down the cause?
-
Do you see anything in Hyper-V? Usage spikes etc?
-
I appreciate your help
I have checked our Grafana for the HyperV again and can indeed detect a CPU spike on the system, just before the freeze.
And a complete loss of traffic
-
Hmm, so like some process uses all the CPU cycles until it's rebooted. Yet nothing is logged....
-
Today we had another freeze. Same symptomps, no logs, no console. All we see is a CPU Spike right before the freeze.
Any other Ideas how to continue? Are we the only ones with this issue?
We are currently running 71 pfSense 2.7.2 and around 10 of them are showing the issue. We tried reinstalls on fresh 2.7.2 images to make sure there is no issue with the updating from 2.6.
All our 2.6 and 2.5.2 System run without these issues.
-
we just had another freeze. This time i have a couple of new errors in the logs before the logs completely stop.
Maybe this helps narrowing it down.
Nov 22 16:55:35 kernel Copyright (c) 1992-2023 The FreeBSD Project. Nov 22 16:55:35 kernel ---<<BOOT>>--- Nov 22 16:55:35 syslogd kernel boot file is /boot/kernel/kernel Nov 22 16:52:00 sshd 11196 Accepted publickey for root from xxx.xxx.xxx.xxx port 6014 ssh2: RSA SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxx Nov 22 16:51:56 sshd 8691 Accepted publickey for root from xxx.xxx.xxx.xxx port 13587 ssh2: RSA SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxx Nov 22 16:51:19 pfSctl 2755 could not finish read in a reasonable time. Action of event might not be completed. Nov 22 16:51:16 sshd 53470 Disconnected from user root xxx.xxx.xxx.xxx port 52790 Nov 22 16:51:16 sshd 53470 Received disconnect from xxx.xxx.xxx.xxx port 52790:11: disconnected by user Nov 22 16:51:16 sshd 26642 fatal: Timeout before authentication for xxx.xxx.xxx.xxx port 27224 Nov 22 16:51:16 rc.gateway_alarm 22904 >>> Gateway alarm: WANGWv6 (Addr:xxx:xxx:xxx::xxx Alarm:0 RTT:50.416ms RTTsd:167.822ms Loss:0%) Nov 22 16:50:46 php-fpm 60747 /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed IP addresses. Reloading endpoints that may use GW_WAN. Nov 22 16:50:46 check_reload_status 416 Reloading filter Nov 22 16:50:46 check_reload_status 416 Restarting OpenVPN tunnels/interfaces Nov 22 16:50:46 check_reload_status 416 Restarting IPsec tunnels Nov 22 16:50:46 check_reload_status 416 updating dyndns GW_WAN Nov 22 16:50:46 check_reload_status 416 updating dyndns WANGWv6 Nov 22 16:50:46 sshd 50953 Disconnected from user root xxx.xxx.xxx.xxx port 60047 Nov 22 16:50:46 sshd 50953 Received disconnect from xxx.xxx.xxx.xxx port 60047:11: disconnected by user Nov 22 16:50:46 rc.gateway_alarm 10435 >>> Gateway alarm: GW_WAN (Addr:xxx.xxx.xxx.xxx Alarm:0 RTT:52.696ms RTTsd:60.464ms Loss:0%) Nov 22 16:50:38 sshd 17812 fatal: Timeout before authentication for xxx.xxx.xxx.xxx port 33797 Nov 22 16:50:33 sshd 18507 fatal: Timeout before authentication for xxx.xxx.xxx.xxx port 39653 Nov 22 16:50:28 sshd 18455 fatal: Timeout before authentication for xxx.xxx.xxx.xxx port 21224 Nov 22 16:50:28 sshd 17705 fatal: Timeout before authentication for xxx.xxx.xxx.xxx port 1337 Nov 22 16:50:23 sshd 25669 Accepted publickey for root from xxx.xxx.xxx.xxx port 11618 ssh2: RSA SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxx Nov 22 16:50:01 pfSctl 17188 could not finish read in a reasonable time. Action of event might not be completed. Nov 22 16:49:58 sshd 99832 Disconnected from user root xxx.xxx.xxx.xxx port 60268 Nov 22 16:49:58 sshd 99832 Received disconnect from xxx.xxx.xxx.xxx port 60268:11: disconnected by user Nov 22 16:49:58 sshd 18141 Accepted publickey for root from xxx.xxx.xxx.xxx port 15849 ssh2: RSA SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxx Nov 22 16:49:58 sshd 1365 Disconnected from user root xxx.xxx.xxx.xxx port 53098 Nov 22 16:49:58 sshd 1365 Received disconnect from xxx.xxx.xxx.xxx port 53098:11: disconnected by user Nov 22 16:49:58 sshd 53096 Disconnected from user root xxx.xxx.xxx.xxx port 65117 Nov 22 16:49:58 sshd 53096 Received disconnect from xxx.xxx.xxx.xxx port 65117:11: disconnected by user Nov 22 16:49:58 sshd 97774 fatal: Timeout before authentication for xxx.xxx.xxx.xxx port 9367 Nov 22 16:49:58 rc.gateway_alarm 69423 >>> Gateway alarm: WANGWv6 (Addr:xxx:xxx:xxx::xxx Alarm:1 RTT:26.765ms RTTsd:25.516ms Loss:21%) Nov 22 16:48:51 check_reload_status 416 Reloading filter Nov 22 16:48:51 check_reload_status 416 Restarting OpenVPN tunnels/interfaces Nov 22 16:48:51 check_reload_status 416 Restarting IPsec tunnels Nov 22 16:48:51 check_reload_status 416 updating dyndns GW_WAN Nov 22 16:48:51 sshd 50248 Disconnected from user root xxx.xxx.xxx.xxx port 25550 Nov 22 16:48:51 sshd 50248 Received disconnect from xxx.xxx.xxx.xxx port 25550:11: disconnected by user Nov 22 16:48:51 sshd 51781 Disconnected from user root xxx.xxx.xxx.xxx port 45454 Nov 22 16:48:51 rc.gateway_alarm 59700 >>> Gateway alarm: GW_WAN (Addr:xxx.xxx.xxx.xxx Alarm:1 RTT:509.029ms RTTsd:1028.814ms Loss:0%) Nov 22 16:48:51 sshd 51781 Received disconnect from xxx.xxx.xxx.xxx port 45454:11: disconnected by user Nov 22 16:48:41 sshd 1365 Accepted publickey for root from xxx.xxx.xxx.xxx port 53098 ssh2: RSA SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxx Nov 22 16:48:35 sshd 99832 Accepted publickey for root from xxx.xxx.xxx.xxx port 60268 ssh2: RSA SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxx Nov 22 16:48:29 sshd 97374 Accepted publickey for root from xxx.xxx.xxx.xxx port 55569 ssh2: RSA SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxx Nov 22 16:48:22 sshd 53683 fatal: Timeout before authentication for xxx.xxx.xxx.xxx port 4106 Nov 22 16:48:22 sshd 52565 fatal: Timeout before authentication for xxx.xxx.xxx.xxx port 34558 Nov 22 16:47:53 sshd 78140 Disconnected from user root xxx.xxx.xxx.xxx port 49005 Nov 22 16:47:53 sshd 78140 Received disconnect from xxx.xxx.xxx.xxx port 49005:11: disconnected by user
-
Hmm, unfortunately nothing there really looks like an issue. Certainly not something that would cause it to stop responding entirely.
We could try running dtrace against it to see what's using the cycles but I think we'd need some way to trigger it before it stopped responding.
If you create a test instance in the same hypervisor does that also stop responding?
-