CARP + igb NIC Kernel Panic in 2.0.1 Release
I have 2 Dell PE R300 machines running 2.0.1 release and CARP on every interface and everything has been running fine. I use the onboard Broadcom NICs (bge0/1) for the WAN and SYNC interfaces respectively, and an Intel N2XX-AIPCI02 Gigabit ET Quad port PCI Express card for everything else (LAN and 3 OPTs with VLANs). We are getting ready to add BGP and multi-homing with another ISP, so yesterday I added a second Intel card, same as above, to move some stuff to - started with the CARP Backup as I thought that was a safe way to go. Part of my plan was to move the CARP SYNC to one of the new interfaces and use the 2 onboard Broadcom NICs for the 2 WAN connections.
Everything seemed fine - pfSense saw all of the new interfaces, I reassigned a few things to the new interfaces, including the CARP SYNC to one (igb7 originally), connected everything up and put it back in service, it moved to a backup role on the CARP interfaces and I was happy. I disconnected the WAN interface of the Master and saw the Backup take over - traffic was still flowing fine, no interruptions. At this point I was still happy and shut down the master using 6)Halt on the cosole. Once it was offline I pulled power from it and started to disconnected cables getting ready to add the new NIC. All of a sudden (well, maybe a minute after pulling the power plug) I got a bunch of alarms on my phone saying the WAN connection was down - looking up at the console for the Backup I saw it was rebooting. WTF?! I quickly connect the Master cables back up and power it up, start it booting to recover the network - only a 5 minute outage. ;)
So, after a ton of messing around and step by step troubleshooting I have come to this conclusion: any time I use ANY of the Intel NICs as the CARP SYNC interface, configured as Backup, it will failover and run fine UNTIL that SYNC interface goes down. ie. if the interface stays up, (Master isn't powered down, it is connected to a switch, etc) it will run fine - anywhere from 20 seconds to 3 minutes after the interface goes down (ie. Master power supply fail) there appears to be a kernel panic and crash dump and then pfSense restarts. I have not confirmed this operation on a Master - hopefully I will this afternoon. On the Backup, I can easily and reliably recreate this using any of the Intel NICs, old card and new. The first few times it crashed I allowed the UI to submit the crash reports to the developers, after that I wanted to complete my testing first to make sure I wasn't doing something wrong and try to narrow it down.
I have tried uninstalling all of the packages installed (only 2) and I get the same results each time. All of my tests have been done with v2 - I downgraded to 2.0 Beta 5 then RC1 and RC3 with a similar result each time. The only change I saw was at one point (can't recall which version) I didn't get the immediate crash dump fly by on the console, it just froze with this screen:
Anyone have any thoughts about this?
Thanks in advance.
I can confirm this also happens when the configured Master has an igb interface as its CARP SYNC and that interface goes down - just duplicated it 4 times. The best part is, if I leave all interfaces connected, so it thinks it should boot up as the Master, but with the igb SYNC interface disconnected, seconds after it finishes booting it panics, dumps, and reboots. Fun to watch! haha
Hard to tell without a backtrace, looks like memory allocation, such as mbufs.
Thanks Jim. I'll give that a shot this afternoon.
That appears to have worked great! Everything ran fine through my tests, both Master and Backup. Thanks very much for your help. Much appreciated.