New latency every 30 seconds with 2.4.2 caused by radvd 2.17_3

  • I've noticed since upgrading to 2.4.2 that I'm getting nasty latency spikes almost exactly every 30 seconds.  I've tried hitting the firewall internal IP from multiple hosts, and they all see the same behavior (I've got 3 VLANs and the IP on each VLAN exhibits the same thing).  Pings between hosts on the same subnet do not see this behavior, however pings that cross subnets (and have to go through the router) do.  The spikes are really nasty too, 300ms+ on most occasions.

    Anyone else seeing similar behavior?  It's made my network almost unusable for anything WAN facing.

    I'm running a Supermicro SYS-5018D-FN4T with a Chelsio T520-CR if that makes a difference.  Again, this JUST started after updating to 2.4.2, 0 issues with prior versions of PFsense.

    Happy to grab logs if they'd help, but I don't see anything obvious.

  • I guess I'll reply to my own, I've narrowed it down - every time there is a latency spike, RADVD jumps up to 30% CPU usage, so I think I've found the culprit, now the question is what changed in 2.4.2 that might be causing that behavior?  I'm assuming radvd is running for an ipv6 tunnel I use.

    I see the radvd version changed with the upgrade, I think I'm closing in on the issue at least :)

    Nov 21 21:56:30 kernel radvd: 1.9.1 -> 2.17_3 [pfSense]
    Nov 21 21:56:57 radvd 76672 version 2.17 started

    **so I'm not spamming here, I'll just reply to this.  Killing radvd stopped all of my issues.  I'm going to leave it down for now - I can live without IPv6 for the moment - let me know what else I can provide to help the troubleshooting.

  • bump - anybody?

  • Alright, so I'm hoping someone from the netgate crew can help me understand protocol here.  I've seen REPEATEDLY people get flamed for posting to the bug tracker without posting an issue on the forum first.  I track down almost exactly where the issue resides and post it on the forum and get nothing but crickets.  Not so much as a "hey, give us X log" from anyone from netgate.  This is a very easily reproducible issue, I can literally downgrade, have it go away, upgrade and it re-appears issue.  So what are the steps to getting this fixed?  This is feeling like a Plex issue at this point… aka: acknowledged as broken but nobody cares enough to bother fixing it.

  • Im not seeing this. And with the lack of response Im guessing nobody else is either.

    Talk more about your total network.  No Puma equipped cable modems do ya?

  • PFsense directly attached to a pair of ADSL modems that are in bridge mode (Netgear 7550).  I won't say the rest of the network is irrelevant, but it's kind of irrelevant given the behavior exists even directly attached to the pfsense box.

  • Having the same issue here, did this ever get fixed?

  • I missed the response 8 mos. ago.. Is this happening with 2.4.3 for you? Are you on a bonded connection or just load balancing?

  • I'm using lacp with 4 interfaces and vlans on the bond, running pfsense 2.4.3. I started radvd in debug mode but nothing that indicates what might cause the problem. It results in pings getting lost or go up all the way to 750ms whenever that spike happens. I also have some messages in dmesg about a "listen queue overflow". I am using snort, but not on the interfaces that I am having problems with, so I don't think it is related (just wanted to mention it as its an installed package).

  • Unfortunately this still exists in 2.4.4. I also noticed some latency spikes when PPPOE reconnects, so maybe this is related to IPv6 in General? dpinger and dhcp are using lots of CPU on the PPPOE reconnection event.

    Only thing in dmesg that could be related is

    sa6_recoverscope: embedded scope mismatch: xxxxxx sin6_scope_id was overridden

    a few times.

  • I just removed the LACP and that fixed this issue (or at least made it better).
    What I have noticed then is, that I'm having ping spikes on every second block of logs (so the first one is fine, no ping spike, but the second one causes a ping spike):

    [Nov 04 04:10:33] radvd (46168): ioctl(SIOCGIFINDEX) succeeded on em2.1
    [Nov 04 04:10:33] radvd (46168): ioctl(SIOCGIFFLAGS) succeeded on em2.1
    [Nov 04 04:10:33] radvd (46168): em2.1 is up
    [Nov 04 04:10:33] radvd (46168): em2.1 is running
    [Nov 04 04:10:33] radvd (46168): em2.1 supports multicast
    [Nov 04 04:10:33] radvd (46168): sysctl ifdata succeeded on em2.1
    [Nov 04 04:10:33] radvd (46168): ioctl(SIOCGIFMEDIA) succeeded on em2.1
    [Nov 04 04:10:33] radvd (46168): em2.1 is active
    [Nov 04 04:10:36] radvd (46168): ioctl(SIOCGIFINDEX) succeeded on em2.1
    [Nov 04 04:10:36] radvd (46168): ioctl(SIOCGIFFLAGS) succeeded on em2.1
    [Nov 04 04:10:36] radvd (46168): em2.1 is up
    [Nov 04 04:10:36] radvd (46168): em2.1 is running
    [Nov 04 04:10:36] radvd (46168): em2.1 supports multicast
    [Nov 04 04:10:36] radvd (46168): sysctl ifdata succeeded on em2.1
    [Nov 04 04:10:36] radvd (46168): ioctl(SIOCGIFMEDIA) succeeded on em2.1
    [Nov 04 04:10:36] radvd (46168): em2.1 is active

    However, to me that's more like a workaround instead of a fix. Does that help finding the issue @chpalmer ?

  • I was seeing the exact same thing and was driving me crazy.

    ServicesDHCPv6 Server & RALANDHCPv6 Server

    Disabled the above - the issue went away instantly like magic

  • @aharrison What NIC are you using? I think I have found the issue.

  • It's the Netgate RCC-VE 2440 System

  • Well I guess someone has to look into this now.... Most likely it's some kind of driver issue.... Are you using LACP?

  • Also it would be interesting to know if you've touched the tunables. I'm trying to figure out what all these 3 machines have in common, obviously it's not the NIC that caused the issue here.

  • sorry for the delay. LACP is not in use - its a rather simple one WAN, two LAN out configuration with a couple VLANs in play. Minimal changes have been made to the defaults.