Latency spikes while transferring data
Hello all, I have been struggling with this all day and can't seem to find a solution. When I run a speedtest on comcast's site, the latency skyrockets on both the download and upload tests, with upload being the worst. I've tried enabling/disabling TSO, LRO, Checksum Offload, MSI-X, and fastforwarding, with no difference. Enabling polling basically kills the interface (no data can be sent).
System has 2x Intel 82573L interfaces.
Edit: I've also tried commenting out extra loader.conf stuff and extra Tunables, with no difference.
I'll include ping results, loader.conf, and Tunables below. Any suggestions are greatly appreciated! :)
Ping during speedtest:
/root(4): ping bing.com PING bing.com (188.8.131.52): 56 data bytes 64 bytes from 184.108.40.206: icmp_seq=0 ttl=120 time=13.405 ms 64 bytes from 220.127.116.11: icmp_seq=1 ttl=120 time=13.757 ms 64 bytes from 18.104.22.168: icmp_seq=2 ttl=120 time=13.666 ms 64 bytes from 22.214.171.124: icmp_seq=3 ttl=120 time=12.597 ms 64 bytes from 126.96.36.199: icmp_seq=4 ttl=120 time=13.712 ms 64 bytes from 188.8.131.52: icmp_seq=5 ttl=120 time=13.684 ms 64 bytes from 184.108.40.206: icmp_seq=6 ttl=120 time=14.116 ms 64 bytes from 220.127.116.11: icmp_seq=7 ttl=120 time=12.549 ms 64 bytes from 18.104.22.168: icmp_seq=8 ttl=120 time=13.600 ms 64 bytes from 22.214.171.124: icmp_seq=9 ttl=120 time=18.159 ms 64 bytes from 126.96.36.199: icmp_seq=10 ttl=120 time=59.694 ms 64 bytes from 188.8.131.52: icmp_seq=11 ttl=120 time=91.635 ms 64 bytes from 184.108.40.206: icmp_seq=12 ttl=120 time=104.155 ms 64 bytes from 220.127.116.11: icmp_seq=13 ttl=120 time=107.565 ms 64 bytes from 18.104.22.168: icmp_seq=14 ttl=120 time=126.674 ms 64 bytes from 22.214.171.124: icmp_seq=15 ttl=120 time=153.777 ms 64 bytes from 126.96.36.199: icmp_seq=16 ttl=120 time=142.239 ms 64 bytes from 188.8.131.52: icmp_seq=17 ttl=120 time=159.227 ms 64 bytes from 184.108.40.206: icmp_seq=18 ttl=120 time=178.841 ms 64 bytes from 220.127.116.11: icmp_seq=19 ttl=120 time=175.886 ms 64 bytes from 18.104.22.168: icmp_seq=20 ttl=120 time=16.405 ms 64 bytes from 22.214.171.124: icmp_seq=21 ttl=120 time=16.096 ms 64 bytes from 126.96.36.199: icmp_seq=22 ttl=120 time=13.026 ms 64 bytes from 188.8.131.52: icmp_seq=23 ttl=120 time=13.708 ms 64 bytes from 184.108.40.206: icmp_seq=24 ttl=120 time=22.384 ms 64 bytes from 220.127.116.11: icmp_seq=25 ttl=120 time=14.192 ms 64 bytes from 18.104.22.168: icmp_seq=26 ttl=120 time=9764.597 ms 64 bytes from 22.214.171.124: icmp_seq=27 ttl=120 time=8770.282 ms 64 bytes from 126.96.36.199: icmp_seq=28 ttl=120 time=7776.026 ms 64 bytes from 188.8.131.52: icmp_seq=29 ttl=120 time=6775.911 ms 64 bytes from 184.108.40.206: icmp_seq=30 ttl=120 time=5781.350 ms 64 bytes from 220.127.116.11: icmp_seq=31 ttl=120 time=4780.921 ms 64 bytes from 18.104.22.168: icmp_seq=32 ttl=120 time=3779.889 ms 64 bytes from 22.214.171.124: icmp_seq=33 ttl=120 time=2784.440 ms 64 bytes from 126.96.36.199: icmp_seq=34 ttl=120 time=1784.390 ms 64 bytes from 188.8.131.52: icmp_seq=35 ttl=120 time=784.341 ms 64 bytes from 184.108.40.206: icmp_seq=36 ttl=120 time=15.181 ms 64 bytes from 220.127.116.11: icmp_seq=37 ttl=120 time=13.373 ms
autoboot_delay="3" vm.kmem_size="435544320" vm.kmem_size_max="535544320" kern.ipc.nmbclusters="70000" debug.acpi.disabled="thermal" #kern.hz="2000" #kern.timecounter.hardware="TSC" #kern.timecounter.smp_tsc="1" #kern.timecounter.smp_tsc_adjust="1" ahci_load="YES" # AHCI aio_load="YES" # Async IO system calls autoboot_delay="3" # reduce boot menu delay from 10 to 3 seconds. cc_htcp_load="YES" # H-TCP Congestion Control for more aggressive increase in speed on higher # latency, high bandwidth networks. amdtemp_load="YES" # amd K8, K10, K11 thermal sensors #hw.em.enable_aim="1" # enable Intel's Adaptive Interrupt Moderation to reduce load for igb(4) (default 1) #hw.em.max_interrupt_rate="19200" # maximum number of interrupts per second generated by single igb(4) (default 8000) hw.em.num_queues="1" # number of queues supported on the hardware NIC (default 0),(Intel i350 = 8 queues) # For saturated network set to zero(0) which will auto tune and the driver will create # as many queues as CPU cores up to a max of eight(8). For lightly loaded networks set to # one(1) to reduce interrupts, lower latency and increase efficiency. (vmstat -i) hw.em.enable_msix="1" # enable MSI-X interrupts for PCI-E devices so nic polling is not needed anymore (default 1) hw.em.eee_setting="0" # Enable Energy Efficient Ethernet dev.em.0.eee_control="0" dev.em.1.eee_control="0" hw.em.txd="4096" # igb under load will not drop packets by using transmit and receive descriptor hw.em.rxd="4096" # rings in main memory which point to packet buffers. the igb driver transfers packet # data to and from main memory independent of the CPU, using the descriptor rings as # lists of packet transmit and receive requests to carry out. Increase each if # your machine or network is saturated or if you have plenty of ram. (default 1024) hw.em.rx_process_limit="-1" # maximum number of received packets to process at a time, The default is # too low for most firewalls. recommend "-1" which means unlimited (default 100) #hw.em.rx_abs_int_delay="33" # Default receive interrupt delay limit in usecs #hw.em.tx_abs_int_delay="66" # Default transmit interrupt delay limit in usecs #hw.em.rx_int_delay="33" # Default receive interrupt delay in usecs #hw.em.tx_int_delay="66" # Default transmit interrupt delay in usecs net.isr.bindthreads="1" # bind a network thread to a real single CPU core _IF_ you are using a single network # queue (hw.igb.num_queues="1") and network processing is using less then 90% of the # single CPU core. For high bandwidth systems settting bindthreads to "0" will spread # the network processing load over multiple cpus allowing the system to handle more # throughput. The default is faster for most systems with multiple queues. (default 0) net.link.ifqmaxlen="55" # network interface output queue length in number of packets. we recommend the # number of packets the interface can transmit in 50 milliseconds. it is more efficient to # send packets out of a queue then to re-send them from an application, especially from # high latency wireless devices. if your upload speed is 25 megabit then set to # around "107". Do NOT set too high as to avoid excessive buffer bloat. (default 50) # calculate: bandwidth divided by 8 bits times 1000 divided by the MTU times 0.05 seconds # ( ( (25/8) * 1000 ) / 1.448 ) * 0.05 = 107.90 packets in 50 milliseconds. net.inet.tcp.tcbhashsize="1024" kern.ipc.somaxconn="2048" # On some systems HPET is almost 2 times faster than default ACPI-fast # Useful on systems with lots of clock_gettime / gettimeofday calls # See http://old.nabble.com/ACPI-fast-default-timecounter,-but-HPET-83--faster-td23248172.html # After revision 222222 HPET became default: http://svnweb.freebsd.org/base?view=revision&revision=222222 #kern.timecounter.hardware="HPET" # Tweaks hardware #coretemp_load="YES" #intel legal.intel_wpi.license_ack="1" legal.intel_ipw.license_ack="1" boot_multicons="YES" boot_serial="YES" comconsole_speed="9600" console="comconsole,vidconsole" hw.usb.no_pf="1"
vfs.forcesync default (0) debug.pfftpproxy 1 vfs.read_max default (32) net.inet.ip.portrange.first default (1024) net.inet.tcp.blackhole default (2) net.inet.udp.blackhole default (1) net.inet.ip.random_id default (1) net.inet.tcp.drop_synfin default (1) net.inet.ip.redirect default (1) net.inet6.ip6.redirect default (1) net.inet6.ip6.use_tempaddr default (0) net.inet6.ip6.prefer_tempaddr default (0) net.inet.tcp.syncookies 0 net.inet.tcp.recvspace 65535 net.inet.tcp.sendspace 65535 net.inet.ip.fastforwarding 0 net.inet.tcp.delayed_ack D default (0) net.inet.udp.maxdgram default (57344) net.link.bridge.pfil_onlyip default (0) net.link.bridge.pfil_member default (1) net.link.bridge.pfil_bridge default (0) net.link.tap.user_open default (1) kern.randompid default (347) net.inet.ip.intr_queue_maxlen default (1000) hw.syscons.kbd_reboot default (0) net.inet.tcp.inflight.enable 0 net.inet.tcp.log_debug default (0) net.inet.icmp.icmplim default (0) net.inet.tcp.tso default (1) net.inet.udp.checksum default (1) kern.ipc.maxsockbuf default (4262144) net.inet.tcp.mssdflt 1460 net.inet.tcp.msl 9000 net.inet.ip.redirect 0 net.inet.raw.maxdgram 9216 net.inet.raw.recvspace 9216 kern.ipc.somaxconn 1024 dev.em.0.fc 0 dev.em.1.fc 0 net.inet.ip.check_interface 1 net.inet.ip.process_options 0 net.inet.icmp.drop_redirect 1 net.inet.tcp.drop_synfin 1 net.inet.tcp.fast_finwait2_recycle 1 net.inet.tcp.icmp_may_rst 0 net.inet.tcp.path_mtu_discovery 0 net.inet.tcp.nolocaltimewait 1 net.inet.tcp.rfc3042 0 net.inet6.icmp6.nodeinfo 0
A speedtest wil purposely flood the device with traffic to see how fast the link will go. Your "ping" packets will always get on the end of a long queue at the router when trying to go out (upstream) and so it will be "big number" of milliseconds before they even escape the router.
To get around that you need to play with the Traffic Shaper and let it prioritize ACK packets, ping packets, little stuff. Then those packets can "jump the queue" and get out ahead of all those waiting speed test packets.
Start reading about traffic shaping…
I think the question here is why there's a latency spike while transferring data and not how you will shape the bandwidth. Even you shape the data still a latency spike will occur as this affects on how big the data will be transfer to another station.
I think the question here is why there's a latency spike while transferring data
A speedtest wil purposely flood the device with traffic
Thanks for the info! I confirmed the same spike on my Ubiquiti edge lite, I guess I never looked at latency while speed testing before, only reason I did this time was adding a lot of the tweaks seen on the first post.
Would traffic shaping be a benefit or a hinder regarding online gaming (xbox 360, ps3, pc, ps4 soon) since I have heard shaping can introduce some latency. I also have a few other devices (not mine) that are sometimes in use at the same time while gaming, which generally consists of either youtube, netflix, or bittorrent, though the link doesn't get saturated very often.
The router currently has 1GB DDR2 RAM and a single core AMD Athlon64 2000+ 1GHz.
If shaping would help, is there a guide that explains how everything works/what it means?
Also, regarding the timecounter, which should I be using, ACPI-safe or TSC?
Thanks again! :)
Edit: One more question, does anyone know why the interfaces stop working when polling is enabled?
Latency that high is because of "buffer bloat". TCP uses dropped packets to determine when a link is congested. Because buffers are too large on many residential connections, the latency goes sky high because you have a long line of packets wasting to be processed. You can control the buffer bloat on your sending side by rate limiting the upload speed of your router/firewall to be 80%-90% of your actual upload speed.
Unfortunately, you can't control buffer bloat on your receiving end. It's up to your ISP to properly size those buffers.
Thanks for the buffer bloat explanation, figures comcast would have a huge buffer. :|
I did try the traffic shaping wizard, and that worked great to almost eliminate the upload latency spike while saturated, though download still hits about 200ms, which isn't all that bad I guess. I'll continue to research and tweak it.
The only thing left I can't figure out is why polling causes the network interfaces to stop working, I would think Intel Pro NICs would support that feature. Is there something in the kernel that isn't compiled in by default which is required for polling?