New latency every 30 seconds with 2.4.2 caused by radvd 2.17_3 *****HAVE A TEMPORARY FIX*****

tcsac

I've noticed since upgrading to 2.4.2 that I'm getting nasty latency spikes almost exactly every 30 seconds. I've tried hitting the firewall internal IP from multiple hosts, and they all see the same behavior (I've got 3 VLANs and the IP on each VLAN exhibits the same thing). Pings between hosts on the same subnet do not see this behavior, however pings that cross subnets (and have to go through the router) do. The spikes are really nasty too, 300ms+ on most occasions.

Anyone else seeing similar behavior? It's made my network almost unusable for anything WAN facing.

I'm running a Supermicro SYS-5018D-FN4T with a Chelsio T520-CR if that makes a difference. Again, this JUST started after updating to 2.4.2, 0 issues with prior versions of PFsense.

Happy to grab logs if they'd help, but I don't see anything obvious.

tcsac

I guess I'll reply to my own, I've narrowed it down - every time there is a latency spike, RADVD jumps up to 30% CPU usage, so I think I've found the culprit, now the question is what changed in 2.4.2 that might be causing that behavior? I'm assuming radvd is running for an he.net ipv6 tunnel I use.

I see the radvd version changed with the upgrade, I think I'm closing in on the issue at least :)

Nov 21 21:56:30 kernel radvd: 1.9.1 -> 2.17_3 [pfSense]
Nov 21 21:56:57 radvd 76672 version 2.17 started

**so I'm not spamming here, I'll just reply to this. Killing radvd stopped all of my issues. I'm going to leave it down for now - I can live without IPv6 for the moment - let me know what else I can provide to help the troubleshooting.

tcsac

bump - anybody?

tcsac

Alright, so I'm hoping someone from the netgate crew can help me understand protocol here. I've seen REPEATEDLY people get flamed for posting to the bug tracker without posting an issue on the forum first. I track down almost exactly where the issue resides and post it on the forum and get nothing but crickets. Not so much as a "hey, give us X log" from anyone from netgate. This is a very easily reproducible issue, I can literally downgrade, have it go away, upgrade and it re-appears issue. So what are the steps to getting this fixed? This is feeling like a Plex issue at this point… aka: acknowledged as broken but nobody cares enough to bother fixing it.

chpalmer

Im not seeing this. And with the lack of response Im guessing nobody else is either.

Talk more about your total network. No Puma equipped cable modems do ya?

tcsac

PFsense directly attached to a pair of ADSL modems that are in bridge mode (Netgear 7550). I won't say the rest of the network is irrelevant, but it's kind of irrelevant given the behavior exists even directly attached to the pfsense box.

Flole

Having the same issue here, did this ever get fixed?

chpalmer

I missed the response 8 mos. ago.. Is this happening with 2.4.3 for you? Are you on a bonded connection or just load balancing?

Flole

I'm using lacp with 4 interfaces and vlans on the bond, running pfsense 2.4.3. I started radvd in debug mode but nothing that indicates what might cause the problem. It results in pings getting lost or go up all the way to 750ms whenever that spike happens. I also have some messages in dmesg about a "listen queue overflow". I am using snort, but not on the interfaces that I am having problems with, so I don't think it is related (just wanted to mention it as its an installed package).

Flole

Unfortunately this still exists in 2.4.4. I also noticed some latency spikes when PPPOE reconnects, so maybe this is related to IPv6 in General? dpinger and dhcp are using lots of CPU on the PPPOE reconnection event.

Only thing in dmesg that could be related is

sa6_recoverscope: embedded scope mismatch: xxxxxx sin6_scope_id was overridden

a few times.

Flole

I just removed the LACP and that fixed this issue (or at least made it better).
What I have noticed then is, that I'm having ping spikes on every second block of logs (so the first one is fine, no ping spike, but the second one causes a ping spike):

[Nov 04 04:10:33] radvd (46168): ioctl(SIOCGIFINDEX) succeeded on em2.1
[Nov 04 04:10:33] radvd (46168): ioctl(SIOCGIFFLAGS) succeeded on em2.1
[Nov 04 04:10:33] radvd (46168): em2.1 is up
[Nov 04 04:10:33] radvd (46168): em2.1 is running
[Nov 04 04:10:33] radvd (46168): em2.1 supports multicast
[Nov 04 04:10:33] radvd (46168): sysctl ifdata succeeded on em2.1
[Nov 04 04:10:33] radvd (46168): ioctl(SIOCGIFMEDIA) succeeded on em2.1
[Nov 04 04:10:33] radvd (46168): em2.1 is active


[Nov 04 04:10:36] radvd (46168): ioctl(SIOCGIFINDEX) succeeded on em2.1
[Nov 04 04:10:36] radvd (46168): ioctl(SIOCGIFFLAGS) succeeded on em2.1
[Nov 04 04:10:36] radvd (46168): em2.1 is up
[Nov 04 04:10:36] radvd (46168): em2.1 is running
[Nov 04 04:10:36] radvd (46168): em2.1 supports multicast
[Nov 04 04:10:36] radvd (46168): sysctl ifdata succeeded on em2.1
[Nov 04 04:10:36] radvd (46168): ioctl(SIOCGIFMEDIA) succeeded on em2.1
[Nov 04 04:10:36] radvd (46168): em2.1 is active

However, to me that's more like a workaround instead of a fix. Does that help finding the issue @chpalmer ?

aharrison

I was seeing the exact same thing and was driving me crazy.

ServicesDHCPv6 Server & RALANDHCPv6 Server

Disabled the above - the issue went away instantly like magic

Flole

@aharrison What NIC are you using? I think I have found the issue.

aharrison

It's the Netgate RCC-VE 2440 System

Flole

Well I guess someone has to look into this now.... Most likely it's some kind of driver issue.... Are you using LACP?

Flole

Also it would be interesting to know if you've touched the tunables. I'm trying to figure out what all these 3 machines have in common, obviously it's not the NIC that caused the issue here.

aharrison

sorry for the delay. LACP is not in use - its a rather simple one WAN, two LAN out configuration with a couple VLANs in play. Minimal changes have been made to the defaults.

tcsac

@Flole - my fix was to disable radvd per aharrison's suggestion (apologize just seeing this now, it never notified me). I am also doing LACP from chelsio NICs to a pair of cisco nexus switches. Still have LACP enabled, but disabled RADVD and the lag spikes are gone. I stumbled upon this again because when I upgraded to 2.4.4 it re-enabled radvd again and I couldn't recall what I had done the last time.

**I should note that disabling radvd isn't a fix as it means no dhcpv6 on the LAN which isn't really ideal. This is definitely a bug in radvd 2.17 and it's unfortunate it hasn't been addressed. It's not the CPU or hyperthreading or anything related to the NICs - it's 100% caused by radvd, you can see the process spike (even with priority set to low) and a corresponding lag spike goes with it. Disable radvd and the spikes disappear.

tcsac

@aharrison @Flole @chpalmer

I believe I have a fix - I've been running this for about 20 minutes with no lag spikes. I won't call it ideal, or even great, but it proves without a doubt that the issue is with the 2.x builds of radvd and not a network card, or vlan or lacp or insert whatever excuse issue, it's radvd.

I installed an older 1.x binary I was able to find on the freebsd packages mirror to replace the 2.17 binary. It seems to work perfectly fine (it's advertising as expected) and no more lag issues. Steps below (1.15 was the newest version I could find):

First stop radvd (disable advertisements from the GUI)
next you need to ssh into the system and go to the console
cd /usr/local/sbin
mv radvd radvd.bak
mv radvdump radvdump.bak
cd /tmp
fetch http://pkg.freebsd.org/FreeBSD:10:amd64/release_3/All/radvd-1.15.txz
tar xf radvd-1.15.txz
cd /tmp/usr/local/sbin
cp radvd* /usr/local/sbin/

restart radvd from the GUI and you should be good to go.

Hopefully someone at netgate will address this more formally. As far as I can tell at this point it breaks nothing. If you do run into issues you should have no problem backing out, just delete the radvd and move radvd.bak to radvd.