Heavy Disk I/O

pdg

Hi,

I am new to pfsense, and I am witnessing heavy disk i/o every few minutes. Below is my config and would like to know the direction in which I could diagnose the root cause of this problem.

Kindly note the disk i/o is so high that system completely stalls.. RRD graphs show gaps, any keyboard action on console has no response. However commands such as top/iostat keep running fine.

H/w
Processor: 3.2GHz i5 4th generation (CPU Intel 4570)
RAM : 16 GB
HDD : 500 GB * 2 (GEOM Mirror)

Pkgs Installed:
arping-2.14_1-amd64
bandwidthd-2.0.1_6-amd64
iftop-0.17-amd64
iperf-2.0.5-amd64
lightsquid-1.8_2-amd64
mtr-0.85_1-amd64
nmap-6.47-amd64
ntopng-1.2.1-amd64
p7zip-9.20.1_2-amd64
sarg-2.3.9-amd64
snort-2.9.7.0-amd64
squid-3.4.10_2-amd64
squidguard-squid3-1.4_7-amd64
suricata-2.0.6-amd64
zip-3.0_1-amd64

Thanks,
Pd

heper

goto the shell/console

type: top -SH -mio

usual suspects: squid/lightsquid/ntop/snort/suricata

tons of processes use (almost) 100% i/o for a very short time … data wants to move as fast a possible

probably one of them is hogging 100% i/o for a longer time. Find it and let us know ; perhaps someone can help you out with other settings

pdg

dear heper,

Thanks for the prompt response. I have run the command, anything specific that I should watch for? in terms of fields or values.

Thanks,

pdg

Hi,

I see syncer and bufdaemon hitting 100% usage.

Best,
Pd

heper

i'd look for a process that holds the 100% for more then 10-15 seconds.

pdg

I have observed that system response stalls the moment bufdaemon goes 100%.

I find that happening frequently.

Best,

bmeeks

I support the Snort and Suricata packages on pfSense. Neither of those packages will generate much disk I/O unless you are getting huge numbers of alerts per second. If that is true, then they will be busy writing to their log files.

Those two packages can easily be removed and then reinstalled without losing their settings, so you could temporarily remove them to see if that impacted the disk I/O issue.

Bill

pdg

hi Snort and Suricata were my first doubts, and therefore I have disabled them, but the problem persists.

Any other ideas?

charliem

bufdaemon is an internal daemon started by the kernel. With the HW you listed, it should run very quickly, but it appears to be failing while still holding a lock.

Do you see any HW related error messages in your logs? Do you have remote syslogging enabled? As a test you could use only a single HDD rather than the 2 * mirror.

pdg

Hi Charliem,

1. I have tried disabling one of the mirrors, but has not gained any advantage
2. I do not have remote syslogging enabled

Though it may be a longshot but do you think having multiple VLANs on single Physical NIC could cause this effect in anyway, as this machine is also the router for the org. But, considering we have ~200 machines i wonder if it does really matter.

Anything else I could do to narrow down to the root cause.

Thanks!

charliem

Sorry I don't have specific suggestions. Possibilities as I see them are:

Something really is generating so much IO as to starve other the system threads. Standard way to track these hogs down is by using the different systat() displays. It sounds like you are already doing this. As a side note, I see pfSense does not include the gstat tool.

Or alternately something is wrong and interfering with these system threads, not allowing them to complete in a timely manner. This is where I was thinking HW error (disk i/o), network errors, waiting on remote logging, or something similar.

I guess a third possibility is somehow the number of buffers or size of buffers is too high, an estimation based on RAM amount that goes wrong. But 16G should not be a problem AFAIK.

Multiple VLANs on a single NIC should be OK, and not cause pauses like this, but if you have another NIC to throw in there that might be an easy check. Do you see any I/O errors on the interfaces?

pdg

Hi,

I did a netstat -ni, netstat -s to see for errors / failures, but don't find any.

Anything else I could look at .. I am really getting clueless.

Thanks.

Derelict

top -m io

maybe that'll catch something?

Like heper said.

You're looking for processes using I/O.

I know when I do cat /dev/zero > /root/delete.me cat shoots right to the top. :)

mir

What are the numbers for MBUF usage?
I have heard someone mention that multiport nics sometimes consumes all mbuffers.

pdg

Hi
I am not using multiport NICs.
I am attaching the Mbuf graph for the week.

Kernel Value is kern.ipc.nmbclusters="0" as per /boot/loader.conf
and sysctl output is
kern.ipc.nmbclusters: 26584

Looking forward further guidance.

Thanks,
Pd

MBuf.png_thumb