Partial lock-up

clarknova

2.0-BETA5 (amd64)
built on Sat Jan 29 18:46:16 EST 2011

Working remotely over https, I had just created a port forward rule with associated firewall rule (on WAN) and applied changes. I then went to the WAN firewall rules page and checked the new rule, then hit the move arrow to move it up. At this point the web UI timed out. I tried reloading the page but it timed out again.

I had an ssh session open to a host on pfsense's LAN side. This session continued responsive, but DNS resolution from the LAN host was failing, and I could not get a ping response from pfsense from the LAN host nor from the remote host (WAN), while normally I can. I tried pinging pfsense's LAN interface by IP address from the LAN host and this too failed. So pfsense continued routing even though it was not otherwise responding on the WAN or LAN interface.

I called my wife at home and had her check the vga console, which appeared normal. She hit Enter and the console reloaded. At this point I had her reboot using console option 5, which worked.

Unbound is installed, but it was more than DNS that failed, so I don't know if it's to blame.

I have a cron job recording the output of netstat -m to /var/log/netstat-m.log every hour, but the file is missing after the reboot. I'm not sure if that's related but I find it odd.

I looked in /var/crash but the only thing there is a 5B file 'minfree' that contains only '2048'.

Is there something else I can look at for clues to the cause of this? Anything else I can check on the console before rebooting if it happens again?

jimp

Could have been mbuf exhaustion or some other similar issue… things to look for might be:

netstat -m
netstat -ni
top -SH

See if anything unusual is taking up the CPU time.

Upgrade ASAP to a snap from 2/2 or later though so you can rule out the FTP proxy as a possible cause as well.

/var/crash only gets data in the case of a kernel panic - not a slowdown or hang.

clarknova

I was recording netstat -m to a log file hourly. Not sure why the file disappeared on reboot, but mbufs were x/7500 minutes before the wheels came off. The "max mbuf clusters" value reported by netstat -m is invariably 32768, and my total was nowhere near that number shortly before the problem occurred.

I will update to the latest snap tonight and try again. Do you know why my log file (/var/log/netstat-m.log) would have disappeared during the lock-reboot process? This is a full install.

jimp

the log folder is usually kept, I thought… though the actual clog files are reset.

You might try writing it in /root/ just to see if it makes a difference.

_igor_

Have the same lock-ups since Feb/02 update. Webinterface completely stops responding. Normally after restarting the webconfigurater twice! it works again - after waiting about 30sec to 1 minute. No log-entries, nor any log-entry which tells the restart of lighty. Strange.

It happens really often, sometimes calling log-files, sometimes when i try to call a service-page, so at every possible page i try to call. top shows nothing: 100% idle. Rest of the function seems to be "normal": I can surf internet, all services seem to work as normal.

Today when it happened, unbound died just at the same moment. Never happened before, so it was maybe a pure coincidence.

jimp

Update to the Feb 3 snapshot and see if it can still be reproduced. I'm not sure if it'll really impact this particular issue, but several other patches were added yesterday evening to fix other issues.

mromero

On i386 of 3 Feb just had a hard lockup.

On reboot I noticed "No Core Dump" message as messages were flying past,

Borat still smiling.

Upgrading to today's snapshot. Wish me luck.