Wistron CM9 interrupt flood problem



  • I have an FX 5621 system (Fabiatech) which is based on a Via 1GHz fanless CPU, with 1GB RAM and a 4GB DOM solid state IDE disk module. It has 2 Gigabit ports and 4 10/100 ethernet ports.
    I have also added a Wistron CM9 mini-PCI wireless a/b/g card which I believe is based on an Atheros chipset.
    Installing either pfsense 1.2 or 1.2.3 RC1 works ok
    The two gigabit interfaces are configured as WAN and LAN, the wireless card (which shows up as ath0) as OPT1, and then the rest of the interfaces as OPT2,3,4, and 5.

    The system runs extremely slowly and the CPU utilization is between 80 and 100% even with nothing connected to the machine other than a PC to view the setup webpage. This happens with both 1.2.2 and 1.2.3 RC1.
    Checking this out on the shell with    systat -vmstat 1    and with    top -S    shows that CPU % user and system time are very low (max 1 to 3% each) and all the CPU utilization is coming from handling interrupts - specifically ath0 int 16 which consumes 75-80% of CPU resources.
    Does anybody know whether this problem with interrupts from the wireless card is specific to the CPU, to the Wistron CM9, or to some other factor? Either way, since the Wistron CM9 is meant to be FreeBSD compliant, how do I configure it or what else do I do to fix this issue?

    Thanks!

    James

    Thanks in advance and apologies if my questions are a bit naive - I am a bit new to Free BSD and PFsense. I will be happy to try and post the output of any diagnostic you tell me to run on the system, which I seem to be able to do so far



  • Have you tried the system without the CM9?

    The shell command vmstat -i shows interrupt counts and interrupt rates. Please reprt the output of vmstat -i.

    Have you tried a snapshot build of pfSense 1.2.3? These builds use a more up to date version of FreeBSD than is used in pfSense 1.2 (and possibly pfSense 1.2.3 RC1, but I don't recall).

    The operating system on a PC or "PC like" system relies on the BIOS to tell it which interrupt line is used by a device. It would be worth checking the date (and version) of the BIOS on your system and checking if there are any BIOS updates available. If the BIOS is incorrect about the interrupt routing the OS could (for example) attach an interrupt handler for a device to irq 17 but the device interrupts on irq16. This will cause a high interrupt rate on irq 16 because there is no handler on irq16 to clear the interrupt condition.



  • After a minute or two of operation the pattern is clear already - the system CPU is 80-90% busy with interrupts from ath0 on irq16:

    $ vmstat -i
    interrupt                              total                rate
    irq14:  ata0                        2676                    4
    irq16:  ath0 r10+        70603746        118661
    cpu0:  timer                  1188570              1997
    Total                              71794992        120663

    If I remove the card, it is still busy on irq16, but this time on the LAN!

    $ vmstat -i
    interrupt                              total                rate
    irq14:  ata0                        2092                    8
    irq16:  r10 et1            31838537          126846
    cpu0:  timer                    500128              1992
    Total                              32340757        128847

    Very odd - what do you suggest next?



  • @jrdecastro:

    After a minute or two of operation the pattern is clear already - the system CPU is 80-90% busy with interrupts from ath0 on irq16:

    You haven't quite got the right interpretation of the numbers.

    IRQ16 is shared by a number of devices including ath0, rl0, et1. Any one of those devices could request an interrupt which will be seen by the hardware and lowlevel interrupt handlers as an irq16 interrupt. The operating system lowlevel interrupt handlers will call each of the device drivers with a a device known to be on irq16 and its the responsibility of each of the device drivers to determine if their device has requested an interrupt and if so to service it.

    The numbers suggest that rl0 and/or et1 (and/or some as yet unidentified device) is requesting an interrupt which is not handled (the interrupt request condition is not cleared).

    If necessary, recable so that rl0 and et1 and OPTx interfaces. Is rl0 enabled? If so, disable it through the web GUI (Interfaces -> OPTx) and see if that makes a difference. Repeat for et1. (The pfSense WAN and LAN interfaces don't seem to be able to be disabled through the web GUI.)

    If you haven't already done so, try a recent snapshot build of pfSense 1.2.3 to pick up the latest device drivers. Snapshot builds can be downloaded from http://snapshots.pfsense.org/FreeBSD_RELENG_7_2/pfSense_RELENG_1_2/?C=M;O=D (If I recall correctly, support for the et devices has been added relatively recently and other have reported some issues.)



  • Tried the latest version dated 5 Oct 09 and got the same result. However, a bit of reading of the forums pointed out that there seems to be a problem with the agere chipset handling the et0 and et1 gigabit ports which can be tamed if you disable one of them in the BIOS (et0 corresponding to the 5th LAN interface). So I tried that and changed from:

    et0 WAN
    et1 LAN
    ath0 OPT1
    rl0  OPT2
    rl1  OPT3
    rl2  OPT4
    rl3  OPT5

    et0 LAN
    rl0 WAN
    ath0 OPT1
    rl1  OPT2
    rl2 OPT3
    rl3 OPT4

    (I disabled et0 and et1 became et0). Then the cpu utilization has gone down to basically 1% and the interrupt storm is gone.

    Looks like the new build does not solve the chipset problem but at least there is a workaround even if it does mean losing one gigabit port.

    I would be very interested in a proper solution which preserves both gigabit ports, but for the time being I will carry on configuring it this way - use one 100M port for the WAN, the gigabit port for the LAN, the ath0 for wireless, and the remaining 100M ports for other stuff.  I will continue experimenting and let you know if I run into trouble. Your tip to look at BIOS upgrades, even if it did not give me the expected solution, at least pointed me in the direction where I found the "disable one gigabit port" workaround so I can get this thing working, so thanks a lot!

    When is 2.0 expected or is 1.2.3 the alpha for 2.0? From what I read in the forums perhaps 2.0 will solve this (then again, perhaps not yet?)
    Thanks again for your time - you have been extremely helpful!
    James



  • @jrdecastro:

    I would be very interested in a proper solution which preserves both gigabit ports,

    Then its probably worthwhile reporting this to Fabiatech and your supplier. There might be something "quirky" about the use of the et devices in this box. They might know of a fix. It would also be worthwhile search the FreeBSD bug reports to see if if it has been reported or a fix has been found.



  • Thanks again
    I contacted the supplier (LinITX) and they told me to get in touch with Scott Ullrich as he supposedly has an FX5621 and it is apparently working according to them. They claim their own unit works OK with 1.2.2 but on CF embedded not CD -> hard disk version.
    I am using the CD install to hard disk (a DOM solid state disk on module 4GB unit) because I will want to install packages and that will be easier to do on the disk version than on the CF embedded version.


Log in to reply