Heavy Disk I/O



  • Hi,

    I am new to pfsense, and I am witnessing heavy disk i/o every few minutes. Below is my config and would like to know the direction in which I could diagnose the root cause of this problem.

    Kindly note the disk i/o is so high that system completely stalls.. RRD graphs show gaps, any keyboard action on console has no response. However commands such as top/iostat keep running fine.

    H/w
    Processor: 3.2GHz i5 4th generation (CPU Intel 4570)
    RAM : 16 GB
    HDD : 500 GB * 2 (GEOM Mirror)

    Pkgs Installed:
    arping-2.14_1-amd64
    bandwidthd-2.0.1_6-amd64
    iftop-0.17-amd64
    iperf-2.0.5-amd64
    lightsquid-1.8_2-amd64
    mtr-0.85_1-amd64
    nmap-6.47-amd64
    ntopng-1.2.1-amd64
    p7zip-9.20.1_2-amd64
    sarg-2.3.9-amd64
    snort-2.9.7.0-amd64
    squid-3.4.10_2-amd64
    squidguard-squid3-1.4_7-amd64
    suricata-2.0.6-amd64
    zip-3.0_1-amd64

    Thanks,
    Pd



  • goto the shell/console

    type:  top -SH -mio

    usual suspects: squid/lightsquid/ntop/snort/suricata

    tons of processes use (almost) 100% i/o for a very short time … data wants to move as fast a possible

    probably one of them is hogging 100% i/o for a longer time. Find it and let us know ; perhaps someone can help you out with other settings



  • dear heper,

    Thanks for the prompt response. I have run the command, anything specific that I should watch for? in terms of fields or values.

    Thanks,



  • Hi,

    I see syncer and bufdaemon hitting 100% usage.

    Best,
    Pd



  • i'd look for a process that holds the 100% for more then 10-15 seconds.



  • I have observed that system response stalls the moment bufdaemon goes 100%.

    I find that happening frequently.

    Best,



  • I support the Snort and Suricata packages on pfSense.  Neither of those packages will generate much disk I/O unless you are getting huge numbers of alerts per second.  If that is true, then they will be busy writing to their log files.

    Those two packages can easily be removed and then reinstalled without losing their settings, so you could temporarily remove them to see if that impacted the disk I/O issue.

    Bill



  • hi Snort and Suricata were my first doubts, and therefore I have disabled them, but the problem persists.

    Any other ideas?



  • bufdaemon is an internal daemon started by the kernel.  With the HW you listed, it should run very quickly, but it appears to be failing while still holding a lock.

    Do you see any HW related error messages in your logs?  Do you have remote syslogging enabled?  As a test you could use only a single HDD rather than the 2 * mirror.



  • Hi Charliem,

    1. I have tried disabling one of the mirrors, but has not gained any advantage
    2. I do not have remote syslogging enabled

    Though it may be a longshot but do you think having multiple VLANs on single Physical NIC could cause this effect in anyway, as this machine is also the router for the org. But, considering we have ~200 machines i wonder if it does really matter.

    Anything else I could do to narrow down to the root cause.

    Thanks!



  • Sorry I don't have specific suggestions.  Possibilities as I see them are:

    Something really is generating so much IO as to starve other the system threads.  Standard way to track these hogs down is by using the different systat() displays.  It sounds like you are already doing this.  As a side note, I see pfSense does not include the gstat tool.

    Or alternately something is wrong and interfering with these system threads, not allowing them to complete in a timely manner.  This is where I was thinking HW error (disk i/o), network errors, waiting on remote logging, or something similar.

    I guess a third possibility is somehow the number of buffers or size of buffers is too high, an estimation based on RAM amount that goes wrong.  But 16G should not be a problem AFAIK.

    Multiple VLANs on a single NIC should be OK, and not cause pauses like this, but if you have another NIC to throw in there that might be an easy check.  Do you see any I/O errors on the interfaces?



  • Hi,

    I did a netstat -ni, netstat -s to see for errors / failures, but don't find any.

    Anything else I could look at .. I am really getting clueless.

    Thanks.


  • Netgate

    top -m io

    maybe that'll catch something?

    Like heper said.

    You're looking for processes using I/O.

    I know when I do cat /dev/zero > /root/delete.me cat shoots right to the top.  :)



  • What are the numbers for MBUF usage?
    I have heard someone mention that multiport nics sometimes consumes all mbuffers.



  • Hi
    I am not using multiport NICs.
    I am attaching the Mbuf graph for the week.

    Kernel Value is kern.ipc.nmbclusters="0" as per /boot/loader.conf
    and sysctl output is
    kern.ipc.nmbclusters: 26584

    Looking forward further guidance.

    Thanks,
    Pd