Latency spikes while transferring data

Ryu.Hayabusa

Hello all, I have been struggling with this all day and can't seem to find a solution. When I run a speedtest on comcast's site, the latency skyrockets on both the download and upload tests, with upload being the worst. I've tried enabling/disabling TSO, LRO, Checksum Offload, MSI-X, and fastforwarding, with no difference. Enabling polling basically kills the interface (no data can be sent).

System has 2x Intel 82573L interfaces.

Edit: I've also tried commenting out extra loader.conf stuff and extra Tunables, with no difference.

I'll include ping results, loader.conf, and Tunables below. Any suggestions are greatly appreciated! :)

Ping during speedtest:

/root(4): ping bing.com
PING bing.com (204.79.197.200): 56 data bytes
64 bytes from 204.79.197.200: icmp_seq=0 ttl=120 time=13.405 ms
64 bytes from 204.79.197.200: icmp_seq=1 ttl=120 time=13.757 ms
64 bytes from 204.79.197.200: icmp_seq=2 ttl=120 time=13.666 ms
64 bytes from 204.79.197.200: icmp_seq=3 ttl=120 time=12.597 ms
64 bytes from 204.79.197.200: icmp_seq=4 ttl=120 time=13.712 ms
64 bytes from 204.79.197.200: icmp_seq=5 ttl=120 time=13.684 ms
64 bytes from 204.79.197.200: icmp_seq=6 ttl=120 time=14.116 ms
64 bytes from 204.79.197.200: icmp_seq=7 ttl=120 time=12.549 ms
64 bytes from 204.79.197.200: icmp_seq=8 ttl=120 time=13.600 ms
64 bytes from 204.79.197.200: icmp_seq=9 ttl=120 time=18.159 ms
64 bytes from 204.79.197.200: icmp_seq=10 ttl=120 time=59.694 ms
64 bytes from 204.79.197.200: icmp_seq=11 ttl=120 time=91.635 ms
64 bytes from 204.79.197.200: icmp_seq=12 ttl=120 time=104.155 ms
64 bytes from 204.79.197.200: icmp_seq=13 ttl=120 time=107.565 ms
64 bytes from 204.79.197.200: icmp_seq=14 ttl=120 time=126.674 ms
64 bytes from 204.79.197.200: icmp_seq=15 ttl=120 time=153.777 ms
64 bytes from 204.79.197.200: icmp_seq=16 ttl=120 time=142.239 ms
64 bytes from 204.79.197.200: icmp_seq=17 ttl=120 time=159.227 ms
64 bytes from 204.79.197.200: icmp_seq=18 ttl=120 time=178.841 ms
64 bytes from 204.79.197.200: icmp_seq=19 ttl=120 time=175.886 ms
64 bytes from 204.79.197.200: icmp_seq=20 ttl=120 time=16.405 ms
64 bytes from 204.79.197.200: icmp_seq=21 ttl=120 time=16.096 ms
64 bytes from 204.79.197.200: icmp_seq=22 ttl=120 time=13.026 ms
64 bytes from 204.79.197.200: icmp_seq=23 ttl=120 time=13.708 ms
64 bytes from 204.79.197.200: icmp_seq=24 ttl=120 time=22.384 ms
64 bytes from 204.79.197.200: icmp_seq=25 ttl=120 time=14.192 ms
64 bytes from 204.79.197.200: icmp_seq=26 ttl=120 time=9764.597 ms
64 bytes from 204.79.197.200: icmp_seq=27 ttl=120 time=8770.282 ms
64 bytes from 204.79.197.200: icmp_seq=28 ttl=120 time=7776.026 ms
64 bytes from 204.79.197.200: icmp_seq=29 ttl=120 time=6775.911 ms
64 bytes from 204.79.197.200: icmp_seq=30 ttl=120 time=5781.350 ms
64 bytes from 204.79.197.200: icmp_seq=31 ttl=120 time=4780.921 ms
64 bytes from 204.79.197.200: icmp_seq=32 ttl=120 time=3779.889 ms
64 bytes from 204.79.197.200: icmp_seq=33 ttl=120 time=2784.440 ms
64 bytes from 204.79.197.200: icmp_seq=34 ttl=120 time=1784.390 ms
64 bytes from 204.79.197.200: icmp_seq=35 ttl=120 time=784.341 ms
64 bytes from 204.79.197.200: icmp_seq=36 ttl=120 time=15.181 ms
64 bytes from 204.79.197.200: icmp_seq=37 ttl=120 time=13.373 ms

loader.conf:

autoboot_delay="3"
vm.kmem_size="435544320"
vm.kmem_size_max="535544320"
kern.ipc.nmbclusters="70000"
debug.acpi.disabled="thermal"

#kern.hz="2000"
#kern.timecounter.hardware="TSC"
#kern.timecounter.smp_tsc="1"
#kern.timecounter.smp_tsc_adjust="1"

ahci_load="YES"                    # AHCI
aio_load="YES"                     # Async IO system calls
autoboot_delay="3"                 # reduce boot menu delay from 10 to 3 seconds.
cc_htcp_load="YES"                 # H-TCP Congestion Control for more aggressive increase in speed on higher
                                   # latency, high bandwidth networks.

amdtemp_load="YES"                 # amd K8, K10, K11 thermal sensors

#hw.em.enable_aim="1"              # enable Intel's Adaptive Interrupt Moderation to reduce load for igb(4) (default 1)
#hw.em.max_interrupt_rate="19200"  # maximum number of interrupts per second generated by single igb(4) (default 8000)

hw.em.num_queues="1"               # number of queues supported on the hardware NIC (default 0),(Intel i350 = 8 queues)
                                   # For saturated network set to zero(0) which will auto tune and the driver will create
                                   # as many queues as CPU cores up to a max of eight(8). For lightly loaded networks set to
                                   # one(1) to reduce interrupts, lower latency and increase efficiency. (vmstat -i)

hw.em.enable_msix="1"              # enable MSI-X interrupts for PCI-E devices so nic polling is not needed anymore (default 1)
hw.em.eee_setting="0"              # Enable Energy Efficient Ethernet
dev.em.0.eee_control="0"
dev.em.1.eee_control="0"

hw.em.txd="4096"                   # igb under load will not drop packets by using transmit and receive descriptor
hw.em.rxd="4096"                   #  rings in main memory which point to packet buffers. the igb driver transfers packet
                                   #  data to and from main memory independent of the CPU, using the descriptor rings as
                                   #  lists of packet transmit and receive requests to carry out. Increase each if
                                   #  your machine or network is saturated or if you have plenty of ram. (default 1024)
hw.em.rx_process_limit="-1"        # maximum number of received packets to process at a time, The default is
                                   # too low for most firewalls. recommend "-1" which means unlimited (default 100)
#hw.em.rx_abs_int_delay="33"         # Default receive interrupt delay limit in usecs
#hw.em.tx_abs_int_delay="66"         # Default transmit interrupt delay limit in usecs
#hw.em.rx_int_delay="33"             # Default receive interrupt delay in usecs
#hw.em.tx_int_delay="66"             # Default transmit interrupt delay in usecs

net.isr.bindthreads="1"            # bind a network thread to a real single CPU core _IF_ you are using a single network
                                   # queue (hw.igb.num_queues="1") and network processing is using less then 90% of the
                                   # single CPU core. For high bandwidth systems settting bindthreads to "0" will spread
                                   # the network processing load over multiple cpus allowing the system to handle more
                                   # throughput. The default is faster for most systems with multiple queues. (default 0)

net.link.ifqmaxlen="55"            # network interface output queue length in number of packets. we recommend the
                                   # number of packets the interface can transmit in 50 milliseconds. it is more efficient to
                                   # send packets out of a queue then to re-send them from an application, especially from
                                   # high latency wireless devices. if your upload speed is 25 megabit then set to
                                   # around "107". Do NOT set too high as to avoid excessive buffer bloat. (default 50)
                                   # calculate: bandwidth divided by 8 bits times 1000 divided by the MTU times 0.05 seconds
                                   # ( ( (25/8) * 1000 ) / 1.448 ) * 0.05 = 107.90 packets in 50 milliseconds.

net.inet.tcp.tcbhashsize="1024"
kern.ipc.somaxconn="2048"

# On some systems HPET is almost 2 times faster than default ACPI-fast
# Useful on systems with lots of clock_gettime / gettimeofday calls
# See http://old.nabble.com/ACPI-fast-default-timecounter,-but-HPET-83--faster-td23248172.html
# After revision 222222 HPET became default: http://svnweb.freebsd.org/base?view=revision&revision=222222
#kern.timecounter.hardware="HPET"

# Tweaks hardware
#coretemp_load="YES" #intel
legal.intel_wpi.license_ack="1"
legal.intel_ipw.license_ack="1"
boot_multicons="YES"
boot_serial="YES"
comconsole_speed="9600"
console="comconsole,vidconsole"
hw.usb.no_pf="1"

Tunables:

vfs.forcesync 	 				default (0) 
debug.pfftpproxy 				1 	
vfs.read_max 	 				default (32) 
net.inet.ip.portrange.first 	 	default (1024) 
net.inet.tcp.blackhole 	 		default (2) 	
net.inet.udp.blackhole 	 		default (1) 
net.inet.ip.random_id 	 		default (1) 
net.inet.tcp.drop_synfin 	 		default (1) 	
net.inet.ip.redirect 	 			default (1) 
net.inet6.ip6.redirect 			default (1) 	
net.inet6.ip6.use_tempaddr 	 	default (0) 	
net.inet6.ip6.prefer_tempaddr  	default (0) 	
net.inet.tcp.syncookies 		 	0 	
net.inet.tcp.recvspace 		 	65535 	
net.inet.tcp.sendspace 		 	65535 	
net.inet.ip.fastforwarding 		0 	
net.inet.tcp.delayed_ack 	D	 	default (0) 	
net.inet.udp.maxdgram 		 	default (57344) 	
net.link.bridge.pfil_onlyip 	 	default (0) 	
net.link.bridge.pfil_member 	 	default (1) 	
net.link.bridge.pfil_bridge 	 	default (0) 	
net.link.tap.user_open 	 		default (1) 	
kern.randompid 				default (347) 	
net.inet.ip.intr_queue_maxlen 	default (1000) 	
hw.syscons.kbd_reboot 		 	default (0) 	
net.inet.tcp.inflight.enable 	 	0 	
net.inet.tcp.log_debug 		 	default (0) 	
net.inet.icmp.icmplim 		 	default (0) 	
net.inet.tcp.tso 			 	default (1) 	
net.inet.udp.checksum 		 	default (1) 	
kern.ipc.maxsockbuf 			default (4262144) 	
net.inet.tcp.mssdflt 			 	1460 	
net.inet.tcp.msl 				9000 	
net.inet.ip.redirect 			 	0 	
net.inet.raw.maxdgram 			9216 	
net.inet.raw.recvspace 		 	9216 	
kern.ipc.somaxconn 			1024 	
dev.em.0.fc 					0 	
dev.em.1.fc 				 	0 	
net.inet.ip.check_interface 	 	1 	
net.inet.ip.process_options 		0 	
net.inet.icmp.drop_redirect 	 	1 	
net.inet.tcp.drop_synfin 		 	1 	
net.inet.tcp.fast_finwait2_recycle 	1 	
net.inet.tcp.icmp_may_rst 	 	0 	
net.inet.tcp.path_mtu_discovery 	0 	
net.inet.tcp.nolocaltimewait 		1 	
net.inet.tcp.rfc3042 				0 	
net.inet6.icmp6.nodeinfo 	 	0

phil.davis

A speedtest wil purposely flood the device with traffic to see how fast the link will go. Your "ping" packets will always get on the end of a long queue at the router when trying to go out (upstream) and so it will be "big number" of milliseconds before they even escape the router.
To get around that you need to play with the Traffic Shaper and let it prioritize ACK packets, ping packets, little stuff. Then those packets can "jump the queue" and get out ahead of all those waiting speed test packets.
Start reading about traffic shaping…

m4st3rc1p0

I think the question here is why there's a latency spike while transferring data and not how you will shape the bandwidth. Even you shape the data still a latency spike will occur as this affects on how big the data will be transfer to another station.

doktornotor

@m4st3rc1p0:

I think the question here is why there's a latency spike while transferring data

Uhm… because

@phil.davis:

A speedtest wil purposely flood the device with traffic

?

Ryu.Hayabusa

Thanks for the info! I confirmed the same spike on my Ubiquiti edge lite, I guess I never looked at latency while speed testing before, only reason I did this time was adding a lot of the tweaks seen on the first post.

Would traffic shaping be a benefit or a hinder regarding online gaming (xbox 360, ps3, pc, ps4 soon) since I have heard shaping can introduce some latency. I also have a few other devices (not mine) that are sometimes in use at the same time while gaming, which generally consists of either youtube, netflix, or bittorrent, though the link doesn't get saturated very often.

The router currently has 1GB DDR2 RAM and a single core AMD Athlon64 2000+ 1GHz.

If shaping would help, is there a guide that explains how everything works/what it means?

Also, regarding the timecounter, which should I be using, ACPI-safe or TSC?

Thanks again! :)

Edit: One more question, does anyone know why the interfaces stop working when polling is enabled?

Harvy66

Latency that high is because of "buffer bloat". TCP uses dropped packets to determine when a link is congested. Because buffers are too large on many residential connections, the latency goes sky high because you have a long line of packets wasting to be processed. You can control the buffer bloat on your sending side by rate limiting the upload speed of your router/firewall to be 80%-90% of your actual upload speed.

Unfortunately, you can't control buffer bloat on your receiving end. It's up to your ISP to properly size those buffers.

Ryu.Hayabusa

Thanks for the buffer bloat explanation, figures comcast would have a huge buffer. :|

I did try the traffic shaping wizard, and that worked great to almost eliminate the upload latency spike while saturated, though download still hits about 200ms, which isn't all that bad I guess. I'll continue to research and tweak it.

The only thing left I can't figure out is why polling causes the network interfaces to stop working, I would think Intel Pro NICs would support that feature. Is there something in the kernel that isn't compiled in by default which is required for polling?