Increasing mbuf and state table size precede total lockup

clarknova

2.0-BETA4 (i386)
built on Wed Sep 8 14:34:56 EDT 2010
FreeBSD 8.1-RELEASE
Platform nanobsd (2g)
net5501 on CF (UDMA)

I was using the Aug 30 snapshot. Up until 2 days ago my mbuf usage, as reported on the dashboard, was fairly steady, usually around x/3000 or so. State table normally ran around 3000/48000.

Two days ago I noticed mbuf usage and state table size were much higher, and yesterday higher again, around x/12000 and 11000/48000 respectively. CPU, RAM, disk usage appeared normal (20/23/13).

Late last night (thank goodness) pfsense became unresponsive: no DNS forwarder, no routing, no sshd, no serial console, no ping response, no web UI. Totally locked as far as I could tell (forgot to check arp response, but I doubt it). I rebooted it and immediately updated to the latest snap.

I had remote syslogging enabled. These are the final entries immediately before lockup:

$ grep pfsense /var/log/syslog.1 | less

Sep  8 17:02:01 pfsense kernel: Bump sched buckets to 64 (was 0)
Sep  8 23:09:51 pfsense ppp: [wan_link0] LCP: no reply to 1 echo request(s)
Sep  8 23:09:51 pfsense ppp: [wan_link4] LCP: no reply to 1 echo request(s)
Sep  8 23:09:51 pfsense ppp: [wan_link3] LCP: no reply to 1 echo request(s)
Sep  8 23:09:51 pfsense ppp: [wan_link1] LCP: no reply to 1 echo request(s)
Sep  8 23:09:51 pfsense ppp: [wan_link2] LCP: no reply to 1 echo request(s)
Sep  8 23:09:51 pfsense ppp: [wan_link5] LCP: no reply to 1 echo request(s)
Sep  8 23:10:01 pfsense ppp: [wan_link4] LCP: no reply to 2 echo request(s)
Sep  8 23:10:01 pfsense ppp: [wan_link0] LCP: no reply to 2 echo request(s)
Sep  8 23:10:01 pfsense ppp: [wan_link2] LCP: no reply to 2 echo request(s)
Sep  8 23:10:01 pfsense ppp: [wan_link1] LCP: no reply to 2 echo request(s)
Sep  8 23:10:01 pfsense ppp: [wan_link3] LCP: no reply to 2 echo request(s)
Sep  8 23:10:01 pfsense ppp: [wan_link5] LCP: no reply to 2 echo request(s)
Sep  8 23:10:11 pfsense ppp: [wan_link0] LCP: no reply to 3 echo request(s)
Sep  8 23:10:11 pfsense ppp: [wan_link4] LCP: no reply to 3 echo request(s)
Sep  8 23:10:11 pfsense ppp: [wan_link3] LCP: no reply to 3 echo request(s)
Sep  8 23:10:11 pfsense ppp: [wan_link1] LCP: no reply to 3 echo request(s)
Sep  8 23:10:11 pfsense ppp: [wan_link2] LCP: no reply to 3 echo request(s)
Sep  8 23:10:11 pfsense ppp: [wan_link5] LCP: no reply to 3 echo request(s)
Sep  8 23:10:21 pfsense ppp: [wan_link4] LCP: no reply to 4 echo request(s)
Sep  8 23:10:21 pfsense ppp: [wan_link0] LCP: no reply to 4 echo request(s)
Sep  8 23:10:21 pfsense ppp: [wan_link2] LCP: no reply to 4 echo request(s)
Sep  8 23:10:21 pfsense ppp: [wan_link1] LCP: no reply to 4 echo request(s)
Sep  8 23:10:21 pfsense ppp: [wan_link3] LCP: no reply to 4 echo request(s)
Sep  8 23:10:21 pfsense ppp: [wan_link5] LCP: no reply to 4 echo request(s)
Sep  8 23:10:31 pfsense ppp: [wan_link0] LCP: no reply to 5 echo request(s)
Sep  8 23:10:31 pfsense ppp: [wan_link0] LCP: peer not responding to echo requests
Sep  8 23:10:31 pfsense ppp: [wan_link0] LCP: state change Opened --> Stopping
Sep  8 23:10:31 pfsense ppp: [wan_link0] Link: Leave bundle "wan"

All it's telling me is that my mlppp links all went unresponsive around the same time. The switch and modems did not go offline, and everything was functioning immediately after power-cycling pfsense, so it's fair to say the problem was with pfsense, or at least it was apparently the only victim.

Today again (on the new snapshot) it appears that mbuf and state table count are steadily increasing, presently:


State table size	 4152/48000
MBUF Usage	 551 /1155

Which is already higher than I'd ever seen prior to 2 days ago.

Any idea what's going on here, how to diagnose or correct it?

Thanks.

jimp

State table size cause could be found by just looking at the state table, and/or Diagnostics > States Summary

If it is something causing an unusually high traffic load (e.g. virus) it should stand out in the state summary as one source with many destinations.

If that is the case, it shouldn't happen quite so easy, but more detail is definitely required to know for sure.

clarknova

Thanks for your reply.

What about mbuf usage. It is up steadily to 1252 /1920 at present. In a week or so I expect it to lock up again if I don't reboot it preemptively.

jimp

mbuf usage is a little harder to judge. Some systems ride close to the max with almost no load (or seem to anyhow) but it doesn't really mean there is a problem.

clarknova

The problem in this case is that pfsense locked up, and the only thing different that I can see is steadily climbing mbuf numbers.

clarknova

This does not appear to be an issue on the September 13 snapshot.

clarknova

I spoke too soon. Not only did this continue to be an issue on nanobsd/net5501, but I changed hardware and software version and it continues to be a problem.

I'm now running 2.0-BETA4 (i386)
built on Thu Oct 14 01:16:12 EDT 2010
FreeBSD 8.1-RELEASE-p1

on a SM X7SPA-H (Atom D510, 4GB) and seeing the exact same symptoms: mbuf usage increases steadily until uptime reaches approx 7 days, then total hard lockup.