Solved: lcl<->lcl rsync, scp stalling at same spot only when routed via pf

hcoin

Solved – it's an asymmetric routing issue. See last posts.

There have been a few posts on this forum (And elsewhere) over the recent years discussing scp stalling midway through a transfer, similarly for rsync, and ssh sessions dropped randomly. The answers given, when they worked usually had to do with mss or mtu, sometimes acks. But they were all of the sort 'I changed this and then it worked' -- but I suspect it just changed the packet timing enough to avoid the still unresolved issue. Here's a pretty solid little setup test case showing the still existing problem:

UbuntuBox1 (amd64) on RFC1918 subnet a one interface, connected to an unmanaged switch.
UbuntuBox2 (i386) with dns entries on RFC1918 subnets a and with ip alias, IP b, one interface, connected to the same switch.
PFSense running as a VM on ubuntu server with LAN on subnet A, and a virtual IP on subnet B, same virtual and physical interface, plugged into the same switch.

On box 1, when transferring a box 1 directory full of certificates and keys (max size, 8k, mostly less, maybe 50 files) to box 2, whether by scp or rsync:
When box 1 addresses box 2 on the 'A' subnet (no pfsense involvement) it works normally.
When box 1 addresses box 2 on the 'B' subnet (pfsense routing in the lan interface and out the ip alias (same physical interface)), it gets to 2112kb transferred in the same file in the list, then just stalls for at least over about 15 minutes before giving up. It is 100% repeatable.

This is on 2.1.5 release as that's the last one the postfix package's libraries all get properly installed.

The logs show blocked packets from 1 (subnet a) to 2 (subnet b): TCP: A, 2 seconds later TCP FPA, 15 sec, TCP FPA 15 sec, TCP FPA 15 sec, TCP A

There is a firewall rule to pass all ipv4 traffic either way from-to RFC1918 nets which the two test subnets are in.
There are two other boxes like 2, except amd64, that work normally on the same b subnet. Both are VMs, one on the same machine as PF, the other a VM on a different machine (qemu KVM linux host)

I changed the network card in box 2, same issue.

ssh connections get dropped every 10 minutes or so, re-connects are instant, logs are silent.

Verified ping 1500 bytes works from 1 to 2b. Tried the PF options to IP Do-Not-Fragment compatibility Disable Firewall Scrub. All 'disable hardware offloading' boxes checked. Tried device polling, no change.

I can see posts relating to this problem going back at least to 2011 through this year, no real resolution other than for problems that wouldn't pass packet 1. You can google 'rsync stalls 2112' and 'scp stalls 2112kb' to see various shots to fix it. When tried retried either at once or in intervals, same thing.

The resolutions found on line to dates are along the lines: 'we kicked it and tweaked it and if your preserve partial transfers then try again eventually it will work'

There was this suggested linux magic:
net.core.rmem_default = 524288
net.core.rmem_max = 524288
net.core.wmem_default = 524288
net.core.wmem_max = 524288
net.ipv4.tcp_wmem = 4096 87380 524288
net.ipv4.tcp_rmem = 4096 87380 524288
net.ipv4.tcp_mem = 524288 524288 524288
net.ipv4.tcp_rfc1337 = 1
net.ipv4.ip_no_pmtu_disc = 0
net.ipv4.tcp_sack = 1
net.ipv4.tcp_fack = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_ecn = 0
net.ipv4.route.flush = 1

Which when saved after rebooting changed ... nothing whatever.

My hunch is that it's a timing thing as the i386 box is 'just slow enough' to hit the issue the magic timing reveals. I did try moving a single file that's bigger than 2112KB, and it reliably stalls right there at 2112KB One person having this trouble reported 2118kb.
Ubuntu boxen are 14.04LTS linux kernel 3.13.0-57-generic #95-Ubuntu SMP Fri Jun 19 09:27:48 UTC 2015 i686 i686 i686 GNU/Linux and the amd64 version of same.

I need help here. Ideas?

hcoin

P.S. No packet errors on end of the endpoints reported by the interfaces, nor on PF.

hcoin

Completely 'clean install' with the systems plugged directly into the same unmanaged switch still exhibits exactly the same problem. 2112mb 'hard stop' on rsync and ssh transfers. With all the thousands upon thousands of ssh and rsync users out there, and pfsense being the difference between the hang and normal operations, I do think I'm correct to bring this up here. It could well be an artifact of freebsd, since that codebase is found in a fair few routers.

Any ideas?

Derelict

I rsync over ssh all the time through pfSense. Never had to do anything special and never had any problems.

dotdash

I don't have time to dig into your setup too deeply, but I've been using rsync with one or both sides behind pfSense for years without issues.
I'm not doubting you are experiencing problems, but I don't think the problem is endemic to pfSense.
Doh! Beaten to the punch by Derelict…

hcoin

Solved.

Box-subnet-A.foo.com sends a TCP packet to Box-subnet-B.foo.com. Box-subnet-B.foo.com cooks up an ACK and related traffic with destination on subnet-A. Little does anyone suspect Box-Subnet-B.foo.com is pulling double duty and standing-in for Box-Broken-Subnet-A.foo.com and so it has a little suspected interface on subnet A.

So, when it's time for B to answer A, it notices there's a direct path out it's happenstance A interface, avoiding the router altogether. It sends the ack and related traffic out the subnet A interface with the subnet A return address– which the original sender doesn't recognize at all, drops, and the transfer stalls.

. I Just Hate Asymmetric Routing Artifacts . I always seem to suspect them last at the cost of a day.

Removed the subnet A interface on box B --- all good.

hcoin

Thanks to the two folks who posted here. I too use rsync and ssh via pfsense for years now without issues. That's what had me so stumped. The fact asymmetric routing is a bug I think is a bug in routing in general.

I'm guessing with the NDP thing if I were using ipv6 this issue would have been avoided. Any v6 capable folks agree or not? Does v6 avoid asymmetric routing issues?

P.S. The above is just the diagnosis. The easiest answer I could find was to add an outbound nat rule on the lan interface that maps all traffic coming from subnet A on the lan port, going to the problematic double-homed machine(s) to be natted to the lan interface address on the subnet B network. Works like a charm.