QUIC hell



  • Hello,

    We are running pfSense at a SME office (~20 people, WAN is 100mbps symmetrical) for
    the last ~2 years with an unremarkable(*) history of issues but had a strange incident
    today and I'm wondering if anybody else has seen something like it.

    The issue:
    WAN throughput for all services drops, ping to the outside reports a loss rate >50%.
    The dashboard shows the WAN link >90% busy outbound.
    A WAN packet trace shows mostly outbound IPv6 traffic to <some-ipv6-ip>UDP port 443.
    IPv6 + UDP ? WTF !?

    We run dual-stack IPv4/IPv6 so the initial panic reaction (DoS attack ? Intrusion ?) was
    to disable IPv6 to kill that malitious thing. A little later the issue returns on IPv4
    but we are ready for it and cut off the offending LAN host right away.
    After a little more "research" (asking "what the hell are you doing?!" to the hosts owner),
    this turns up to be just to a large upload to Google Drive.

    Correlating with the traces, this is using QUIC (a UDP based protocol(**)) within th
    Chrome (on Mac) as client of Google Drive.
    When disabling QUIC in Chrome, the upload is actually fine: with TCP in control we
    don't see any negative effects on other services.

    Soooo, I'm wondering how to adjust our setup to this and if others have seen - and solved - such an issue before:

    • Running uploads to Google Drive in our context is certainly legitimate and using a common
        client like Chrome (with QUIC enabled by default) is not something we want to restrict.
    • However this legitimate client bringing down the WAN for all kind of other essential
        services is a serious issue.
    • So what is there to do ? Rate-limit all UDP traffic ?

    An internet search only shows a handful of other reports on this topic (e.g. ***), but I can't believe we are the only ones affected …

    Regards,
    Martin.

    (*) This a compliment!

    (**) See https://en.wikipedia.org/wiki/QUIC

    (***) http://kb.fortinet.com/kb/documentLink.do?externalID=FD36680</some-ipv6-ip>



  • I've noticed it before, I even went and blocked it outbound.  Blocking it is only going to affect something that uses it.  Chrome/Chromium anywhere (Windows, Linux, BSD) will generate it, I think even Android.  It is supposed to make things "faster" but both ends have to be expecting it, so right now destinations like Google (any of the apps, even GMail) are pretty much the only ones using it.  There is an extension for Chrome (Search for SPDY) that will indicate it's use.

    On my home network, there was no difference enabled/disabled, but I'm not even close to saturating the link.

    And yes it was quite bizarre seeing UDP destined to port 443.  I had the same WTF moment.
    It's fairly well defined so if you rate limit UDP with dest port == 443 you should only  be throttling the one thing.



  • Policy: general rate limiting by internal switch port(s) ?



  • Thanks for the input.

    I'll give it a try with a simple block rule for outbound v4/v6-UDP-port-443 (and -80) - this should trigger an auto-fall-back to TCP instead of QUIC, without an explicit config change on the host. I'll post results here.

    That said, I'm still wondering why our experience with QUIC is so bad while there so few other posts on QUIC …

    Thanks,
    Martin.



  • @mnbokaem:

    Thanks for the input.

    I'll give it a try with a simple block rule for outbound v4/v6-UDP-port-443 (and -80) - this should trigger an auto-fall-back to TCP instead of QUIC, without an explicit config change on the host. I'll post results here.

    That said, I'm still wondering why our experience with QUIC is so bad while there so few other posts on QUIC …

    Thanks,
    Martin.

    I don't recall seeing a problem with the traffic, just that UDP to 443 was weird and raised the same concerns, but my environment is different than your (fewer users, typical home setup) so perhaps I didn't trigger the bad behavior.

    As for why "… so few other posts on QUIC..."  you only noticed it because of a symptom, I noticed it because I was looking at the firewall logs;  I typically have it set up with a default deny on LAN, then add ports/services as needed (yes, I know arguments for and against this) so I saw a lot of blocked traffic and had to dig into it.  I'm glad "Professor Google" had an answer on what the traffic is.



  • QUIC is supposed to be very good about congestion. I wonder why your network got flooded to such high loss rates.



  • Having that level of issues when things are congested suggests a connectivity problem of some sort. Maybe a duplex mismatch if your CPE is forced to 100 Mb full duplex and your WAN (or whatever's plugged into the CPE) is set to autonegotiate.



  • I believe I am experiencing the same issue. When I upload to youtube, my internet packetloss is around 50%. It is only with youtube, nothing else.

    It looks like pfsense's network graph doesn't handle it well either. It reports the wrong bandwidth being used.

    https://www.dslreports.com/forum/r30684240-Speed-Are-youtube-video-uploads-uncapped
    The last post in this thread seems to mention the issue.



  • I noticed many blocked packets during video playback using a Chrome browser on Debian for Netflix. The previous beta unstable was the only version that allowed Linux and Netflix to run natively. It was during this time google started their QUIC experiment and the logs started to show the UDP being blocked. I just have ports 80 and 443 use TCP only and not TCP/UDP.
    No extra rules needed. Benefit of running a Default Deny set up.  Just reset the LAN out rule for those ports to TCP only.
    UDP port 443 LOL, another great idea from google, sorry, but I do not trust that company, never will again.



  • @NotAnAlias:

    I believe I am experiencing the same issue. When I upload to youtube, my internet packetloss is around 50%. It is only with youtube, nothing else.

    It looks like pfsense's network graph doesn't handle it well either. It reports the wrong bandwidth being used.

    In that case it looks like QUIC's congestion control is broken.

    The graph is absolutely correct, that uses the NIC's counters. You can pass more traffic out of an interface than your ISP allows. The only limit to how much you can push out your WAN NIC is its link speed, probably 100 Mb or 1 Gb. TCP won't exceed your ISP-imposed speed limit because of its congestion control. UDP has no such congestion control so you can blast as much as you want, but everything past your connection's upload limit will be dropped upstream. QUIC congestion control is supposed to prevent sending more traffic than your link allows. Since it's not, it can overrun your connection so much it will result in significant problems.



  • @cmb:

    @NotAnAlias:

    I believe I am experiencing the same issue. When I upload to youtube, my internet packetloss is around 50%. It is only with youtube, nothing else.

    It looks like pfsense's network graph doesn't handle it well either. It reports the wrong bandwidth being used.

    In that case it looks like QUIC's congestion control is broken.

    The graph is absolutely correct, that uses the NIC's counters. You can pass more traffic out of an interface than your ISP allows. The only limit to how much you can push out your WAN NIC is its link speed, probably 100 Mb or 1 Gb. TCP won't exceed your ISP-imposed speed limit because of its congestion control. UDP has no such congestion control so you can blast as much as you want, but everything past your connection's upload limit will be dropped upstream. QUIC congestion control is supposed to prevent sending more traffic than your link allows. Since it's not, it can overrun your connection so much it will result in significant problems.

    That's the fun part of networking:  slowest link matters.  You have a gig to your pfSense box, 100 to the cable modem, then the other side of that is clamped to 25 upstream.  Almost makes one want to put in a queue for outbound traffic clamped to the 25Mbps.