Netgate 7100 NAT/routing poor performance issue
-
@stephenw10 Akismet says my edit is spam :/ and doesn't allow me to edit. Are you able to edit it?
In fact the response from LAN interface is correct:
# iperf3 -c 172.22.2.1 Connecting to host 172.22.2.1, port 5201 [ 4] local 172.22.2.2 port 51774 connected to 172.22.2.1 port 5201
I must have made some pasting mistake. So no issue here anyway.
netstat -i also shows almost no drops or errors:
-
And here are those problematic iperf tests - from local vm to external host over 7100:
# iperf3 -c 94.240.XX.19 Connecting to host 94.240.XX.19, port 5201 [ 4] local 172.22.2.2 port 59614 connected to 94.240.XX.19 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 27.8 MBytes 233 Mbits/sec 125 41.0 KBytes [ 4] 1.00-2.00 sec 32.6 MBytes 273 Mbits/sec 289 79.2 KBytes [ 4] 2.00-3.00 sec 28.7 MBytes 241 Mbits/sec 269 18.4 KBytes [ 4] 3.00-4.00 sec 32.6 MBytes 273 Mbits/sec 249 29.7 KBytes [ 4] 4.00-5.00 sec 31.6 MBytes 265 Mbits/sec 234 43.8 KBytes [ 4] 5.00-6.00 sec 32.8 MBytes 275 Mbits/sec 224 50.9 KBytes [ 4] 6.00-7.00 sec 30.6 MBytes 257 Mbits/sec 280 41.0 KBytes [ 4] 7.00-8.00 sec 33.2 MBytes 278 Mbits/sec 213 31.1 KBytes [ 4] 8.00-9.00 sec 33.2 MBytes 278 Mbits/sec 230 69.3 KBytes [ 4] 9.00-10.00 sec 33.4 MBytes 280 Mbits/sec 254 42.4 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 316 MBytes 265 Mbits/sec 2367 sender [ 4] 0.00-10.00 sec 316 MBytes 265 Mbits/sec receiver
And with -P10
[root@jumbo ~]# iperf3 -c 94.240.XX.19 -P10 Connecting to host 94.240.XX.19, port 5201 [...] [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 37.7 MBytes 31.6 Mbits/sec 993 sender [ 4] 0.00-10.00 sec 37.5 MBytes 31.4 Mbits/sec receiver [ 6] 0.00-10.00 sec 26.3 MBytes 22.1 Mbits/sec 746 sender [ 6] 0.00-10.00 sec 26.1 MBytes 21.9 Mbits/sec receiver [ 8] 0.00-10.00 sec 28.5 MBytes 23.9 Mbits/sec 873 sender [ 8] 0.00-10.00 sec 28.2 MBytes 23.7 Mbits/sec receiver [ 10] 0.00-10.00 sec 45.3 MBytes 38.0 Mbits/sec 1097 sender [ 10] 0.00-10.00 sec 45.0 MBytes 37.8 Mbits/sec receiver [ 12] 0.00-10.00 sec 21.0 MBytes 17.6 Mbits/sec 714 sender [ 12] 0.00-10.00 sec 20.8 MBytes 17.4 Mbits/sec receiver [ 14] 0.00-10.00 sec 26.2 MBytes 22.0 Mbits/sec 820 sender [ 14] 0.00-10.00 sec 25.9 MBytes 21.7 Mbits/sec receiver [ 16] 0.00-10.00 sec 24.4 MBytes 20.4 Mbits/sec 653 sender [ 16] 0.00-10.00 sec 24.2 MBytes 20.3 Mbits/sec receiver [ 18] 0.00-10.00 sec 51.3 MBytes 43.0 Mbits/sec 1227 sender [ 18] 0.00-10.00 sec 50.9 MBytes 42.7 Mbits/sec receiver [ 20] 0.00-10.00 sec 27.4 MBytes 23.0 Mbits/sec 819 sender [ 20] 0.00-10.00 sec 27.2 MBytes 22.8 Mbits/sec receiver [ 22] 0.00-10.00 sec 26.6 MBytes 22.3 Mbits/sec 796 sender [ 22] 0.00-10.00 sec 26.3 MBytes 22.0 Mbits/sec receiver [SUM] 0.00-10.00 sec 315 MBytes 264 Mbits/sec 8738 sender [SUM] 0.00-10.00 sec 312 MBytes 262 Mbits/sec receiver
Retries aren't that bad, but the throughput is very poor :(
-
Hmm, that number of retries is not great though.
Do you see packet loss to the IP if you run a ping whilst testing?
-
No, 0 packet loss during ping.
But I might have found what causes this problem... or at least: where the problem is located. IOur top switch is connected directly to switch of their Mikrotik router. And as soon our ISP puts our downlink port to Off (cutting us of from the Internet as a result), instantly iperf from our local vm to our external machine goes up to 940 Mbps.
They say their configuration is good, but doesn't seem to me. I also see "Redirect host (New nexthop:)" messages (from IPS gateway) when I ping our external host from internal vm. Don't know why pings go as far as to ISP the gateway and not stay within our switches.Update: on second thoughts I believe it's rather fault of our configuration. Maybe I incorrectly configured something on 7100? In the end the problem is limited to our internal hosts that are NATed on pfSense. Any hints what should I look into? What could make packets outgoing from my local network hit IPS gateway/router before arriving at our external machine?
-
Oh, you could have a routing issue here if traffic either target is forced via it's gateway. pfSense has a rule to prevent that though. Traffic from the WAN to some other address in the WAN subnet bypasses the route-to rules that would otherwise force it. Your other hosts or routers may not.
Do you see ICMP redirects anywhere? That would be a sure sign something is wrong. -
@stephenw10 Yes, I see ICMP redirects on the connections that are causing problems here - when pinging from our internal host(s) - over XG-7100 - to our external host:
[root@172.22.2.2 ~]# ping 94.240.XX.19 PING our.external.host (94.240.XX.19) 56(84) bytes of data. From our.isp.gateway.ip (94.240.XX.254): icmp_seq=1 Redirect Host(New nexthop: our.external.host (94.240.XX.19)) 64 bytes from our.external.host (94.240.XX.19): icmp_seq=1 ttl=63 time=6.25 ms From our.isp.gateway.ip (94.240.XX.254): icmp_seq=2 Redirect Host(New nexthop: our.external.host (94.240.XX.19)) 64 bytes from our.external.host (94.240.XX.19): icmp_seq=2 ttl=63 time=4.64 ms From our.isp.gateway.ip (94.240.XX.254): icmp_seq=3 Redirect Host(New nexthop: our.external.host (94.240.XX.19)) 64 bytes from our.external.host (94.240.XX.19): icmp_seq=3 ttl=63 time=2.54 ms From our.isp.gateway.ip (94.240.XX.254): icmp_seq=4 Redirect Host(New nexthop: our.external.host (94.240.XX.19)) 64 bytes from our.external.host (94.240.XX.19): icmp_seq=4 ttl=63 time=2.50 ms ^C --- our.external.host ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3004ms rtt min/avg/max/mdev = 2.503/3.987/6.257/1.573 ms
IPS has confirmed, that all traffic initiated from our internal hosts (172.22.2.0/24) and targeted at our external host (94.240.XX.19) goes via their Mikrotik switch interface (traffic is seen there). It should not do that, as 94.240.XX.19 host is connected to a switch before ISP router (please see network map scheme in the first post).
I don't know what can cause that. We have rather classic Manual outbound NAT (including NATing 172.22.2.0/24 over WAN). We also have several Virtual IPs mapped to WAN and NAT 1:1 for them. But it doesn't matter if I ping from an internal host that is mapped to WAN IP address or has its own mapped 1:1 public address - it's the same. Ping redirects + iperf 290Mbps. What should I look into?
-
Is the WAN subnet size set correctly on the 7100?
Otherwise check the rulset in /tmp/rules.debug. You should see a rule like:
pass out route-to ( lagg0.4090 94.240.XX.254 ) from 94.240.XX.1 to !94.240.XX.0/24 ridentifier 1000028911 keep state allow-opts label "let out anything from firewall host itself"
In other words it only applies route-to to traffic that isn't inside the WAN subnet. But that shouldn't apply to this traffic you're seeing.
-
WAN seems to be set OK (94.240.XX.1 /24, GW_WAN 94.240.XX.254)
All Virtual IPs are also /24 (ie. 94.240.XX.2 / 24).I've looked into /tmp/rules.debug and yes, I have such rule:
[23.09.1-RELEASE][root@pfsense]/root: cat /tmp/rules.debug | grep "lagg0.4090 94.240.XX.254" | grep -v IPsec GWGW_WAN = " route-to ( lagg0.4090 94.240.XX.254 ) " GWfailover = " route-to { ( lagg0.4090 94.240.XX.254 ) } " pass out log route-to ( lagg0.4090 94.240.XX.254 ) from 94.240.XX.1 to !94.240.XX.0/24 ridentifier 1000012111 keep state allow-opts label "let out anything from firewall host itself" [... many other rules ...]
Looks ok, doesn't it?
Btw, I'm not sure if this is relevant here, but I also have this rules:
pass in quick on $LAN inet from $LAN__NETWORK to <negate_networks> ridentifier 10000001 keep state label "NEGATE_ROUTE: Negate policy routing for destination" label "id:1422071308" label "gw:failover" pass in quick on $LAN $GWfailover inet from $LAN__NETWORK to any ridentifier 1422071308 keep state label "USER_RULE: Default LAN -> any" label "id:1422071308" label "gw:failover"
as a result of such last rule in fw rules for LAN interface:
where failover is a gateway group of WAN and WAN2:
-
OMG! Was it that GW failover rule that was messing here?
As soon as I've added yet another rule ABOVE that last failover rule, set this way (no specific gateway set here):
ping to our external host are not redirected anymore and iperf seems to be OK!
Was I doing the failover the wrong way here?? Or failover is OK, but because of our specific setup (WAN network with real hosts in it) that additional rule is required here? What is the the recommended setup here?
-
Aha! Yes. The policy routing via the failover gateway applies route-to to the states regardless of what the outbound rules are doing.
So, yes, you need a more specific rule above that to bypass the policy routing for traffic to the WAN subnet. Which is sounds like is exactly what you added.
-
THANK YOU VERY MUCH for helping me analyze this weird issue and what finally lead me to solution! Your support and input was amazing! Thank you!