Wistron CM9 interrupt flood problem

jrdecastro

I have an FX 5621 system (Fabiatech) which is based on a Via 1GHz fanless CPU, with 1GB RAM and a 4GB DOM solid state IDE disk module. It has 2 Gigabit ports and 4 10/100 ethernet ports.
I have also added a Wistron CM9 mini-PCI wireless a/b/g card which I believe is based on an Atheros chipset.
Installing either pfsense 1.2 or 1.2.3 RC1 works ok
The two gigabit interfaces are configured as WAN and LAN, the wireless card (which shows up as ath0) as OPT1, and then the rest of the interfaces as OPT2,3,4, and 5.

The system runs extremely slowly and the CPU utilization is between 80 and 100% even with nothing connected to the machine other than a PC to view the setup webpage. This happens with both 1.2.2 and 1.2.3 RC1.
Checking this out on the shell with systat -vmstat 1 and with top -S shows that CPU % user and system time are very low (max 1 to 3% each) and all the CPU utilization is coming from handling interrupts - specifically ath0 int 16 which consumes 75-80% of CPU resources.
Does anybody know whether this problem with interrupts from the wireless card is specific to the CPU, to the Wistron CM9, or to some other factor? Either way, since the Wistron CM9 is meant to be FreeBSD compliant, how do I configure it or what else do I do to fix this issue?

Thanks!

James

Thanks in advance and apologies if my questions are a bit naive - I am a bit new to Free BSD and PFsense. I will be happy to try and post the output of any diagnostic you tell me to run on the system, which I seem to be able to do so far

wallabybob

Have you tried the system without the CM9?

The shell command vmstat -i shows interrupt counts and interrupt rates. Please reprt the output of vmstat -i.

Have you tried a snapshot build of pfSense 1.2.3? These builds use a more up to date version of FreeBSD than is used in pfSense 1.2 (and possibly pfSense 1.2.3 RC1, but I don't recall).

The operating system on a PC or "PC like" system relies on the BIOS to tell it which interrupt line is used by a device. It would be worth checking the date (and version) of the BIOS on your system and checking if there are any BIOS updates available. If the BIOS is incorrect about the interrupt routing the OS could (for example) attach an interrupt handler for a device to irq 17 but the device interrupts on irq16. This will cause a high interrupt rate on irq 16 because there is no handler on irq16 to clear the interrupt condition.

jrdecastro

After a minute or two of operation the pattern is clear already - the system CPU is 80-90% busy with interrupts from ath0 on irq16:

$ vmstat -i
interrupt total rate
irq14: ata0 2676 4
irq16: ath0 r10+ 70603746 118661
cpu0: timer 1188570 1997
Total 71794992 120663

If I remove the card, it is still busy on irq16, but this time on the LAN!

$ vmstat -i
interrupt total rate
irq14: ata0 2092 8
irq16: r10 et1 31838537 126846
cpu0: timer 500128 1992
Total 32340757 128847

Very odd - what do you suggest next?

wallabybob

@jrdecastro:

After a minute or two of operation the pattern is clear already - the system CPU is 80-90% busy with interrupts from ath0 on irq16:

You haven't quite got the right interpretation of the numbers.

IRQ16 is shared by a number of devices including ath0, rl0, et1. Any one of those devices could request an interrupt which will be seen by the hardware and lowlevel interrupt handlers as an irq16 interrupt. The operating system lowlevel interrupt handlers will call each of the device drivers with a a device known to be on irq16 and its the responsibility of each of the device drivers to determine if their device has requested an interrupt and if so to service it.

The numbers suggest that rl0 and/or et1 (and/or some as yet unidentified device) is requesting an interrupt which is not handled (the interrupt request condition is not cleared).

If necessary, recable so that rl0 and et1 and OPTx interfaces. Is rl0 enabled? If so, disable it through the web GUI (Interfaces -> OPTx) and see if that makes a difference. Repeat for et1. (The pfSense WAN and LAN interfaces don't seem to be able to be disabled through the web GUI.)

If you haven't already done so, try a recent snapshot build of pfSense 1.2.3 to pick up the latest device drivers. Snapshot builds can be downloaded from http://snapshots.pfsense.org/FreeBSD_RELENG_7_2/pfSense_RELENG_1_2/?C=M;O=D (If I recall correctly, support for the et devices has been added relatively recently and other have reported some issues.)

jrdecastro

Tried the latest version dated 5 Oct 09 and got the same result. However, a bit of reading of the forums pointed out that there seems to be a problem with the agere chipset handling the et0 and et1 gigabit ports which can be tamed if you disable one of them in the BIOS (et0 corresponding to the 5th LAN interface). So I tried that and changed from:

et0 WAN
et1 LAN
ath0 OPT1
rl0 OPT2
rl1 OPT3
rl2 OPT4
rl3 OPT5

et0 LAN
rl0 WAN
ath0 OPT1
rl1 OPT2
rl2 OPT3
rl3 OPT4

(I disabled et0 and et1 became et0). Then the cpu utilization has gone down to basically 1% and the interrupt storm is gone.

Looks like the new build does not solve the chipset problem but at least there is a workaround even if it does mean losing one gigabit port.

I would be very interested in a proper solution which preserves both gigabit ports, but for the time being I will carry on configuring it this way - use one 100M port for the WAN, the gigabit port for the LAN, the ath0 for wireless, and the remaining 100M ports for other stuff. I will continue experimenting and let you know if I run into trouble. Your tip to look at BIOS upgrades, even if it did not give me the expected solution, at least pointed me in the direction where I found the "disable one gigabit port" workaround so I can get this thing working, so thanks a lot!

When is 2.0 expected or is 1.2.3 the alpha for 2.0? From what I read in the forums perhaps 2.0 will solve this (then again, perhaps not yet?)
Thanks again for your time - you have been extremely helpful!
James

wallabybob

@jrdecastro:

I would be very interested in a proper solution which preserves both gigabit ports,

Then its probably worthwhile reporting this to Fabiatech and your supplier. There might be something "quirky" about the use of the et devices in this box. They might know of a fix. It would also be worthwhile search the FreeBSD bug reports to see if if it has been reported or a fix has been found.

jrdecastro

Thanks again
I contacted the supplier (LinITX) and they told me to get in touch with Scott Ullrich as he supposedly has an FX5621 and it is apparently working according to them. They claim their own unit works OK with 1.2.2 but on CF embedded not CD -> hard disk version.
I am using the CD install to hard disk (a DOM solid state disk on module 4GB unit) because I will want to install packages and that will be easier to do on the disk version than on the CF embedded version.