{done} for someone to figure out my problem - figured out my own problem

GoldServe

Here is my background. I've used NET4801, WRAP boards with three ethernets and a wireless card in all of them. Using 1.2rc2 to 1.2rc4, they all have died when torrenting or very high network traffic in the same way. Because of high loads from the wireless interrupts, I concluded that the NET4801 and WRAP boards were too slow for my purpose.

Now I have a 1GHZ mini-itx eden board with two nic ports and a pci slot which i've added a mini-pci for wireless. I have everything set up properly with dual wan load balancing and fail over. Everything is working perfect cept when I run torrents at high throughput. The CPU does not go over 20% so that can't be the problem. Miniupnpd will complain in the system logs that it has run out of buffer space (but i've got 512MB ram) so I even disabled miniupnpd. Now, the box will drop all traffic, WAN and LAN (wifi) for a few seconds and will recover itself. The common denominator between all the boxes is the wireless card. Could somehow the atheros drivers be leaking memory or something?

This is what a ping from a local host to the box looks like:

Reply from 192.168.1.1: bytes=32 time<1ms TTL=64
Reply from 192.168.1.1: bytes=32 time<1ms TTL=64
Reply from 192.168.1.1: bytes=32 time<1ms TTL=64
Reply from 192.168.1.1: bytes=32 time<1ms TTL=64
Reply from 192.168.1.1: bytes=32 time=410ms TTL=64
Reply from 192.168.1.1: bytes=32 time=1756ms TTL=64
Reply from 192.168.1.1: bytes=32 time=913ms TTL=64
Reply from 192.168.1.1: bytes=32 time=388ms TTL=64
Reply from 192.168.1.1: bytes=32 time=100ms TTL=64
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Reply from 192.168.1.1: bytes=32 time<1ms TTL=64
Reply from 192.168.1.1: bytes=32 time<1ms TTL=64
Reply from 192.168.1.1: bytes=32 time<1ms TTL=64
Reply from 192.168.1.1: bytes=32 time<1ms TTL=64
Reply from 192.168.1.1: bytes=32 time<1ms TTL=64
Reply from 192.168.1.1: bytes=32 time<1ms TTL=64
Reply from 192.168.1.1: bytes=32 time<1ms TTL=64
Reply from 192.168.1.1: bytes=32 time<1ms TTL=64

I am offering $50 to anyone who can guide me through debugging this really annoying issue (from preventing pfsense being perfect in my opinion) and getting it fixed.

This has happened to three different setups of mine so i seriously think there is a bug somewhere.

Thanks for looking.

GoldServe

I think i've solved my problem and i've isolated the problem to the atheros wireless card.

From another machine on another network, I pinged the box and found it was still alive. The only interface that seemed to have died was the atheros interface. So i've been reading up on the net and I found I had tx buffer underrun errors when I typed athstats. That was a bad bad thing and would reset the card causing the timeouts. From the remote machine, I sshed into the box and confirmed this was the case but ifconfig -v ath0 did not show the card with the OACTIVE flag. I guess it resetted the card before I could see the flag.

Anyways, my solution was to increase the hw.ath.txbuffer from 200 to 2000 and maybe this delays when the box dies. With 400 and 800, the interface took longer and longer to die.

If anyone knows why or has a permanent fix, please post it this way! Hope this helps some people with wireless cards.

jahonix

How does your states table look like?
Running torrents is likely to exceed the 10.000 states preset. Just increase it and test again.

Further on, I would change the "Firewall Optimization Options" from "System: Advanced functions" to "aggressive".

Please report back what you'll find!

GoldServe

Upped states tables to 30 000 and tried aggressive. No help. My states are no where near 30000 and it seems like the ath0 is dropped. I can still ping the router from wan side with no dropped packets.

eri--

Please post /tmp/rules.debug; pfctl -vvsr; dmesg; what services you are running and athstats output.

GoldServe

rules.debug: http://www.pastebin.ca/892342

pfctl -vvsr: http://www.pastebin.ca/892343

dmesg: http://www.pastebin.ca/892348

services:
dnsmasq DNS Forwarder
Running
[Restart Service] [Stop Service]
dhcpd DHCP Service
Running
[Restart Service] [Stop Service]
miniupnpd UPnP Service
Running

athstats:
via:/tmp# athstats
1135628 tx management frames
8274 tx frames discarded prior to association
1164 tx discarded empty frame
39 tx failed 'cuz FIFO underrun
23325 tx failed 'cuz bogus xmit rate
2389 tx frames with rts enabled
965 tx frames with 11g protection
12618 rx failed 'cuz of FIFO overrun
1122744 rx management frames
84760 beacon setup failed 'cuz no mbuf
815798812 beacons transmitted
307 periodic calibration failures
1 rate control checks
1 tx used alternate antenna
Antenna profile:
[2] tx 1143860 rx 1150871

I've sorta fixed by problems now by raising ath tx and rx buffers to:
hw.ath.txbuf: 2000
hw.ath.rxbuf: 2000

Is there an upper limit and are people with wireless cards on pfsense experiencing this problem with high loads?

Thanks!

eri--

I would mostly say that you have signal problems or have interference on your channel.

GoldServe

I have very good signal. Running another atheros card on the laptop and on 5.9ghz A band. The two are no further than 20 feet apart.

Pretty sure it is the TX underrun error because when I get ping timeouts, I check the athstats from the wan and see that number shoot up. I'm guessing the card then goes into reset and in a few seconds, my connection is made again. Windows never re-associates so I did not loose the connection completely but the card does not send out data for a few seconds.

cybrsrfr

I have a Netgear wg311t that has the Atheros chipset and it also shows great signal works for a short while and then goes up and down. I've duplicated this same issue on a friends pfSense machine with a different wg311t.

I have used other Atheros wireless devices with pfSense that work great with no issues. So I think it is a problem specific to the Atheros chipset that the wg311t uses.

GoldServe

It's possible…but you should take a look at the athstats when it does go down. If you see many tx underruns, then you've got the same problem. You've gotta hit it with heavy traffic though.