Primary igb driver crash - manually disable carp on the primary under high load

adam65535

The crash always occurs on the primary (80% of the time) even if I swap the hard drives too so that the primary hardware and the secondary hardware swap so I am pretty sure it is not a bad network card, motherboard, etc. If I disable carp while under light loads on the primary it works fine and no crashes. If I then do the same test and disable carp on the secondary while the master carp is disabled and then enable carp on the primary so that the primary becomes master again it works fine and no crash. This is very repeatable and happens with pfsense 2.0.3 and 2.1.1-PRERELEASE (build of Feb 19th).

The crash always happens in the igb driver it seems from the crash dumps.

If I just unplug a network cable on the primary it does not crash that way either (even under high load). The master stays running and becomes backup and the slave takes over and becomes just as you would expect. It is only when I disable carp (under iperf test) that the primary crashes which is how I force the primary to become backup.

I have two Dell R320 firewalls in a cluster that are identical.
2 quad port Intel gigabit ET2 network cards in each firewall (8 ports each firewall)
Onboard unsupported broadcom nics disable in bios
Hyperthreading disabled in bios
4 cores

I have a production firewall config with the iperf3 test running through the firewall with a client on the outside going through the built in loadbalancer (relayd) to a server running iperf3 on the inside.

Test command line on the client and server using port 4444:
Server# iperf3 -p 4444 -s -4
Client# iperf3 -c 192.168.1.10 -p 4444 -t 500

loader.conf.local:
kern.ipc.nmbclusters="131072"
hw.igb.num_queues=1
hw.igb.txd="4096"
hw.igb.rxd="4096"
hw.igb.rx_process_limit="1000"

#netstat -m (while under iperf test on primary before disabling carp)
20615/2050/22665 mbufs in use (current/cache/total)
20507/1723/22230/131072 mbuf clusters in use (current/cache/total/max)
20505/1255 mbuf+clusters out of packet secondary zone in use (current/cache)
0/104/104/65536 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/32768 9k jumbo clusters in use (current/cache/total/max)
0/0/0/16384 16k jumbo clusters in use (current/cache/total/max)
46173K/4374K/50547K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

#vmstat -i (while under iperf test on primary before disabling carp)
interrupt total rate
irq20: atapci0 48357 5
irq22: ehci1 18467 2
irq23: ehci0 27944 3
cpu0: timer 18377583 1999
irq256: igb0:que 0 3388510 368
irq257: igb0:link 4 0
irq258: igb1:que 0 3394503 369
irq259: igb1:link 2 0
irq260: igb2:que 0 30518 3
irq261: igb2:link 2 0
irq262: igb3:que 0 23436 2
irq263: igb3:link 2 0
irq264: igb4:que 0 626670 68
irq265: igb4:link 2 0
cpu3: timer 18377412 1999
cpu2: timer 18377397 1999
cpu1: timer 18377405 1999
Total 81068214 8822

top -P (while under iperf test on primary before disabling carp)
last pid: 55340; load averages: 0.07, 0.02, 0.00 up 0+02:34:29 19:49:38
42 processes: 1 running, 41 sleeping
CPU 0: 0.0% user, 0.0% nice, 0.0% system, 48.5% interrupt, 51.5% idle
CPU 1: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 2: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 3: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Mem: 64M Active, 24M Inact, 153M Wired, 220K Cache, 26M Buf, 3645M Free
Swap: 8192M Total, 8192M Free

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
265 root 1 76 20 6908K 1388K kqread 3 0:55 0.00% check_rel
72286 root 1 58 0 151M 45284K accept 1 0:09 0.00% php
26203 root 1 44 0 5784K 1468K select 2 0:02 0.00% apinger
85021 root 1 76 20 8296K 1732K wait 1 0:01 0.00% sh
69713 _relayd 1 44 0 12320K 3828K kqread 2 0:01 0.00% relayd
63080 root 1 44 0 26268K 7140K kqread 1 0:01 0.00% lighttpd
…

Crash dump:
http://textdump.net/read/4300/

stephenw10

The new igb(4) drivers have been backed out of 2.1.1 because they were unreliable for a lot of people. The 19th is a snapshot that may or may not have had the unstable drivers in so the first thing I suggest you do is update to a new snapshot.
https://forum.pfsense.org/index.php/topic,72763.0.html

Alternatively go to 2.1 release which had the stable driver.

Are you not using that specifically because of the issue with AltQ?

Steve

adam65535

Thanks for the response. I know for sure that the old 2.1 drivers are back in the feb 19th 2.1.1-PRERELEASE build because the driver obeys the hw.igb.num_queues setting. The updated driver that was in for a few pre-releases before the 19th had that issue. As I mentioned… this happens even with 2.0.3.

If someone is just doing regular testing or in production where the system is not stressed I suspect they would never see the problem. They would have to be under heavy load and manually disable carp on the primary. I can only get it to crash if disabling carp while the system is under high stress under iperf. The previous complaints about the updated drivers where that they would crash igb with just an iperf test. I suspect this issue is different or maybe the old driver has the same issue but only shows itself under the condition that I mentioned.

stephenw10

Ah, well it was worth asking but it seems you're aware of that. 2.0.3 was still built on FreeBSD 8.1 so it's not a fair comparison with 2.1.
Do you have independent pfSync interfaces?

Steve

adam65535

Yea. I have a dedicated pfsync interface.

I am going to check for new firmware and install any updates on the Dell R320 and network cards one more time (It was last done about 2 or 3 months ago) just in case there was something changed recently.

I expect to have paid support going soon with pfsense.org but I know the pfsense devs don't write the drivers though so I am not sure what they can do. If it is a deep driver bug I assume it would be Intel that would have to fix it (I thought I read that Intel developed the driver but I am not sure).

It would be interesting to see a list of commands that get executed when carp gets disabled to try them one at a time manually to see if I can get it to crash and know which command triggered it.

dotdash

Is there a specific reason you need to disable CARP under high load?

adam65535

It is not a need… It is more of concerns about the uncertainty of the issue. I am a little concerned that something might come up and I do need to manually switch to the backup without physical access to unplug a cable to work around it. I worry more that the problem might show up in a way that I have not tested yet since I do not understand the real cause of the problem.

I have some i350 cards coming in to test instead of the Quad Intel ET2 cards but since they use the same igb drivers I suspect I will have the same issue.

EDIT: Keep in mind that high load as I defined it above is only a single TCP connection taking up 600mbit of traffic which is something that could happen somewhat frequently depending on what is transferred between interfaces at times(backups, file shares, deployments to production servers, etc). It is the manually disabling Carp that would be infrequent of course.