Buffer errors, packet loss, latency

clarknova

2.0-RC1 (amd64)
built on Sun Feb 13 23:53:14 EST 2011

Lately I'm seeing daily spikes in latency on my RRD quality graphs. Healthy rtt would be 40-50 ms, but I'm getting ~30% of the samples over 100 ms. Normally I fix this by lowering the bandwidth of my HFSC parent queue, but that doesn't appear to help at all, suggesting it's not a congestion issue this time.

When I manually ping my WAN gateway, I get 2.2% packet loss, high jitter, and occasionally this:


ping: sendto: No buffer space available

Some search results from 2007 suggest it might be a states table exhaustion problem, but the Dashboard is currently reporting 57061/389000 states and 9% memory usage, so it doesn't seem likely.

Could this just be symptoms of a DSL problems? It wouldn't be out of the question, but I don't see any other evidence of that just yet.

Any ideas?

wallabybob

I've seen a number of reports of what seems like a memory (mbuf) leak. What are the mbuf stats?

clarknova

4062 /4865 currently, but always growing (since August). I've seen them get up to 25,000+ before the whole think tanks (I don't know if it was freeze, panic or other at the time). I have a cron job to record the output of netstat -m to a file hourly, if you think it might be informative.

eri--

Yes please and show even what services are you runing on the box.
Also type of nics.

clarknova

Intel 82574L Gigabit Ethernet x2 (em)
Unbound and Cron packages installed
Status: Services shows:

cron 	The cron utility is used to manage commands on a schedule. 	

dhcpd 	DHCP Service 	

ntpd 	NTP clock sync 	

openvpn 	OpenVPN client: 	

unbound 	Unbound is a validating, recursive, and caching DNS resolver.

all running. Does that answer your question about which services I'm running?

I cleared the attached log when I updated to the RC 1 day 21 hours ago.

netstat-m.log.txt

clarknova

I'm now thinking this doesn't have anything to do with uptime. Although a reboot appeared to fix the latency problems in the past, I think it may have been chance coincidence, because i normally do my updates at off-peak times.

Could it be something in the way the traffic shaper works? That's what it looks like to me, as if high-priority traffic isn't getting prioritized at all.

The first screen shot shows the Quality graph. When things are working well it stays in the 40-60 range. The second graph is throughput during the same period. Note that Quality is poor even when the upload rate is less than maximum.

You can see in the throughput graph that I lowered the WAN parent queue from 4000 kbit to 2500 kbit in an attempt to restore quality, but it appears to have had no such effect.

latency.png_thumb

throughput.png_thumb

Kevin

Sound like about the same issue I am having. Probably should merge our threads. My box is also using the em driver. I think that is where the problem lies. I have zero packages and no traffic shaping. It is running mostly defaults.

clarknova

@Kevin:

Sound like about the same issue I am having. Probably should merge our threads. My box is also using the em driver. I think that is where the problem lies. I have zero packages and no traffic shaping. It is running mostly defaults.

I just looked at your thread and I have had very similar symptoms, reported here: http://forum.pfsense.org/index.php/topic,32897.0.html

I suspect my issues are all related, and they appear to be related to yours too. I'm using the SM X7SPA-H, incidentally.

Further examination of my packet loss symptoms, it appears that the packet loss and latency problems appear when my uplink is running at max. That would be normal with no QoS in place, but it appears that regardless of where I limit my WAN parent queue, as soon as packets start to drop, they drop from queues that shouldn't be full or dropping, such as icmp and voip, which never run up to their full allocation.

Kevin

Mine are X7SPE-HF using dual 82574L Intel NICs. Right now I only have 2 VoIP phones connected registered to an external server. They lose registration after only a few minutes with no traffic at all.

clarknova

I just realized my traffic shaper is failing to classify a lot of the traffic it used to. So bulk traffic is squeezing out interactive traffic, as they all land in the default queue for reasons unknown to me.

Kevin

Still having the same issue on the latest RC1 March 2.

Is there any information i can send to try and resolve this.

Passes traffic for a few minutes then quits. Best I can tell only the NIC stops

sullrich

We think there is an mbuf leak in the Intel nic code. We used the Yandex drivers in 1.2.3 and now I am starting to wish we did the same for RC1. Looks like we will be importing the Yandex drivers soon so please keep an eye peeled on the snapshot server. Hopefully by tomorrow.

sullrich

@Kevin: Since you can reproduce this so quickly please email me at sullrich@gmail.com and I will work with you as soon as we have a new version available.

clarknova

@sullrich:

We think there is an mbuf leak in the Intel nic code. We used the Yandex drivers in 1.2.3 and now I am starting to wish we did the same for RC1. Looks like we will be importing the Yandex drivers soon so please keep an eye peeled on the snapshot server. Hopefully by tomorrow.

Great news. Thanks for the update. Please let me know if I can do any testing or provide any info to help.

clarknova

2.0-RC1 (amd64)
built on Thu Mar 3 19:27:51 EST 2011

Although I am thoroughly pleased with the new Yandex driver, I'm still seeing what looks like an mbuf leak. This is from a file that records 'netstat -m' every 4 hours, since my last upgrade.

[2.0-RC1]:grep "mbuf clusters" netstat-m.log
8309/657/8966/131072 mbuf clusters in use (current/cache/total/max)
8389/577/8966/131072 mbuf clusters in use (current/cache/total/max)
8484/610/9094/131072 mbuf clusters in use (current/cache/total/max)
8630/720/9350/131072 mbuf clusters in use (current/cache/total/max)
8815/663/9478/131072 mbuf clusters in use (current/cache/total/max)
8958/744/9702/131072 mbuf clusters in use (current/cache/total/max)
9055/775/9830/131072 mbuf clusters in use (current/cache/total/max)
9086/744/9830/131072 mbuf clusters in use (current/cache/total/max)
9192/766/9958/131072 mbuf clusters in use (current/cache/total/max)
9331/771/10102/131072 mbuf clusters in use (current/cache/total/max)
9627/731/10358/131072 mbuf clusters in use (current/cache/total/max)
9873/757/10630/131072 mbuf clusters in use (current/cache/total/max)

Kevin

I am still seeing the same issues as of the March 15 snapshot. Traffic stops passing after a short time. I will upgrade again tomorrow.

Any more insight or ideas on what is happening. The box is still connected via serial port to a PC Ermal has remote access to.

mdima

Hello,
I am using intel nics on the x64 rc1 snapshots (2 x Intel PRO/1000 MT Dual Port Server Adapter + 1 Intel PRO/1000 on the motherboard) and I am seeing my MBUF growing every time… I confirm... for example now on the dashboard I read:
mbuf usage: 5153 /6657

even if the firewall has only 76 states active... but the number/max is growing every hour... even if I don't have any problem related to this, traffic passes with no problems...

clarknova

@mdima:

even if I don't have any problem related to this, traffic passes with no problems…

The problem comes when the mbufs in use (the numbers you see) run into the max (which you don't see, unless you run 'netstat -m' from the shell), at which point everything stops rather precipitously.

mdima

@clarknova:

@mdima:

even if I don't have any problem related to this, traffic passes with no problems…

The problem comes when the mbufs in use (the numbers you see) run into the max (which you don't see, unless you run 'netstat -m' from the shell), at which point everything stops rather precipitously.

I know, I don't have problems at all, just I see that mbuf is growing constantly and that its values are very high if I compare them with another firewall I am using (x86 RC1 with 3Com nics, which has values about 200-300 mbuf sizes, max 1200).

On my x64 RC1 my netstat-m reports:

5147/1510/6657 mbufs in use (current/cache/total)
5124/1270/6394/25600 mbuf clusters in use (current/cache/total/max)

I don't know if it's normal or not, I just confirm that with intel nics on x64 this values seems to be growing constantly…

clarknova

@mdima:

5124/1270/6394/25600 mbuf clusters in use (current/cache/total/max)

I think when that 6394 hits 25600 you will see a panic. You can bump up that 25600 value by putting


kern.ipc.nmbclusters="131072"

(or the value of your choice) into /boot/loader.conf.local and reboot. This uses more RAM, but not a lot more, and buys you time between reboots.