SG-1000 Very Poor Performance

mkernalcon

I manage a network for a small business, and part of that is supporting our remote worker (just one but soon to be two). Before I came to the company, they were on Sonicwall, and he had a second Sonicwall set up for IPSEC as his connection. Now he has an SG-1000 that I set up to connect back here over OpenVPN, and provide a connection for his laptop and phone (Avaya 9608 that connects to our local IP Office here). He has a 100/100 "Business" cable line, with a Technicolor DPC3848VE modem providing an address in 192.168.0.0/24 (I do not have access to the modem - connections to HTTP or HTTPS on the modem connect but give no data). I'm not totally sure how things are configured WANward of the modem - he has a static public IP, but the first hop after the modem on traceroute is in 10.0.0.0/8. Our end is multi-wan, we have a 100/100 fiber connection and a 25/5 fixed wireless, static IP on each - I'm making good use of policy-based routing and other features to optimize these connections. OpenVPN (there are three servers running, TCP/1194 and UDP/1194 for laptop connections, and another UDP port just for remote workers, i.e. just him for now) is set up on 127.0.0.1 with port forwards as recommended in the Multi-WAN guide.

Many days there is no issue - I can't verify for sure that it is performing as well as it could, but it is performing well enough to keep the user happy (which keeps me happy). Some days, the connection is seriously inconsistent and almost unusably slow. The kicker is that it's not just OpenVPN connections that go bad either - it seems to be the connection between the SG-1000 and the modem.

First evidence: I had to disable gateway alarms. I saw several alarms show up in logs with about 20% packet loss and 2000-4000ms ping. Of course, this gateway is the DHCP gateway, 192.168.0.1, the modem. Directly connected with about 3ft of ethernet cable that has not had issues prior to the SG-1000. These got more and more common until one day the alarm triggered several times an hour. That day I disabled gateway checking, reworked DNS to skip the SG-1000 entirely (turned off DNS resolver and forwarder and modified DHCP accordingly - this was a reaction to unbound using more CPU than I thought was warranted, and a few totally unrelated unbound issues on this end), and things got better (although still lower performance than expected).

Second evidence: The typical SLOW mode. This morning was a great example of this. I noticed in my phone system logs that his phone cut its connection and never retried. My connections over the vpn to his sg-1000 were very inconsistent (I could ssh, it would take many seconds to display all the text on the login screen (not like an old serial terminal where it's slow but consistent; it would do a few lines and stop), and I could get a command or two across before the connection died. Connection to webconfigurator was similar. However, this was not limited to the vpn - accessing regular internet via the laptop behind the SG-1000 was just as slow for him. He was able to disconnect the wired interface, and the laptop connected straight to his modem - a HUGE amount faster for him.

Third evidence: I got speedtest-cli on his sg-1000, and even after some more tweaks (disable IPv6, disable hardware checksum offload, trying with manual 100base-TX mode and autoselect), I don't see more than about 5/5 on his connection, where in real world tests, I expect at least 40/40 on his connection.

All this and I expect to see errors, collisions, something strange on the interface, but there is nothing. There is some funky firewalling happening in the modem, as evidenced by ping -f tests (consistently shows about 7-8% loss to the modem, and 80-90% loss past the modem. It's clearly throttling at least ICMP).

CPU usage is not crazy - according to top -aSH, I mostly see it hover around 98% idle, with dips here and there (php-fpm is a big one when I open up webconfigurator). Load average is higher, with values almost always above 0.4, typically around 0.8 and often poking above 1.0, although I'm not sure what is causing this.

I get that I'm not going to get huge performance out of this little box, but I do expect it to be able to route more than 5mbps consistently, and hopefully give me usable openvpn performance.

TL;DR: an SG-1000 that I administer (from many many miles away) is always giving mediocre performance at best, and often slows down to unusable. What further steps can I take to diagnose/fix this issue?

Let me know what config or command output you'd like to see.

mkernalcon

Hmm, I've potentially fixed it. Will monitor over the next few days to make sure it holds up.

I (of course) have Disable TSO set in Advanced->Networking. However, it seems that there is a default system tunable that sets net.inet.tcp.tso to 1, ignoring this value (and any value you may set in /boot/loader.conf.local as per the tuning guide). Set to 0 and verified using sysctl on the command line. I'm really hoping this kills the last of my issues with the pfsense install on this end too (on a massively overpowered 2x X5650 HP server that still acts like it's hitting bottlenecks sometimes).

Seems to me like this is not the proper intention, and that this should be taken out of future versions, or at least documented to keep folks like me from pulling their hair out too much :P

TheNarc

@mkernalcon Looks like you can safely ignore the tunable:
https://forum.netgate.com/topic/106131/disable-hardware-tcp-segmentation-offload

mkernalcon

@thenarc ...well that's confusing. Where exactly can I verify TSO per-interface then?

And regardless of theory, in practice on this SG-1000, with Disable TSO checked, and the tunables at their default state, I get about 5x worse throughput on speedtest than if I change the value of that tunable (or use sysctl to do it). It's an instant difference too. Maybe it's something weird about the cpsw interfaces?

TheNarc

@mkernalcon That is certainly interesting. I have no experience with the SG-1000 so I can't really comment on it specifically. I just thought that other posting might be relevant if you were just wondering whether you needed to set that tunable. But if your testing has unequivocally shown that you get poor performance until you set it to 0, then I can't explain it in the context of that other post. It also claims that the tunable flips back to a 1 on a reboot, which will be a problem for you if that's the only workaround you have. Maybe you could use the shellcmd package to set it back to 0 after every reboot. Note that I don't know enough to say whether that's a good idea, and hopefully someone who knows more about this will stop by :)

mkernalcon

Alright, this one is still most definitely unsolved.

Today I ran all the same tests as yesterday, and it now is performing a bit better with TSO enabled. However, it does seem like that's not the whole story, like there's another more important factor governing these slowdowns (running tests to the same speedtest server give VASTLY variable results even with no change - like anywhere from less than 1Mbit/s to over 20, ping varies from about 80ms to over 2 seconds.

Where else can I look for this issue?

TheNarc

@mkernalcon When connecting from behind the SG-1000, is he double-NATed? Or is DHCP turned off on the SG-1000? I have a Technicolor modem/router as well that I've put in bridged mode. It's not the same modem as his, but hopefully his can be put in bridged mode as well. I don't know if that would be acceptable for his setup, since then everything would need to connect via the SG-1000, but it may be another test worth running. Note that, while generally speaking double-NATing isn't an ideal situation, I don't have any specific theory for how it could be causing the issues you're observing. It's just another thought for something to try.

mkernalcon

@thenarc Yes, his setup is a double NAT (modem gives 192.168.0.0/24, sg-1000 gives 192.168.28.0/24, and 192.168.0.0/24 shows up nowhere else in my entire setup.

And unfortunately I seem to not have access to the modem configuration - plus I don't want to tank his other connections because this box isn't routing right - oh and the modem is his only wireless access point. So no, I don't have the ability to de-NAT the WAN interface unfortunately. I have a rule to allow all ipv4 traffic with source 192.168.0.0/24 on WAN. I can't really imagine why the double-NAT would cause a performance issue, so hopefully it's unrelated to that.

moikerz

So you are using the wifi on the upstream router? That's hardly ideal. (Yes it's an upstream router+modem, since it has wifi). So if a wireless client decides to pull 50Mbps, then the SG1000 is going to have to deal with a drop in it's WAN bandwidth. That situation could easily be causing you problems, and the solution is to get another WAP and disable the one on the upstream router. What else is running off that upstream router?

As for the 100Mbps connection, thats basically the effective limits of the SG1000. I wouldn't expect much more than that, especially if you're trying to run other services such as OpenVPN as well.

Lastly, the double-NAT also isn't an ideal scenario, especially for VoIP. For OpenVPN it should be OK though.

mkernalcon

Yes, he's using the upstream wifi, but pretty much when he's working (i.e. when he expects reasonable performance on the SG-1000 side), there is little to no usage of the upstream side. I'll talk to him about that, but I don't expect that's the problem.

Although I'd be extremely happy with 100mbps out of this thing, I don't need that (and I don't expect that - this is a big reason I don't want to put him entirely behind this router). However, I expect it to be able to route better than a few mbps, and it is not doing that consistently.

Double-NAT is never ideal, I know that. If it helps, the phone is connected through the openvpn tunnel (it connects using H.323 to the IP Office that is sitting local here, using its local address which routes through the tunnel - the phone has no knowledge of the vpn, and doesn't hit the internet at all).

mkernalcon

Here's another symptom: bad latency to the modem (again, there is a direct connection here over about 4' of cat5e. I expect pings less than 1ms, but here's my statistics: round-trip min/avg/max/stddev = 1.145/2.539/53.022/3.831 ms. Frequent 10+ms pings in this.

Over the same time period, a ping to a host on the LAN side (through a gigabit switch) gives: round-trip min/avg/max/stddev = 0.229/0.412/9.816/0.888 ms - it was only one singular ping that ventured above 0.7, almost all were about 0.3

I'm beginning to think the modem is doing something very strange to cause these issues.

TheNarc

@mkernalcon I can't tell with 100% certainty, but I'm pretty sure the DPC3848VE is a Puma 6 modem. https://badmodems.com/Forum/app.php/badmodems Note the Cisco 3848V entry, and it seems that Cisco sold this model to Tecnicolor. This thread further suggests that the Technicolor 3848VE is also a Puma 6 modem. If so, it's hot garbage and he needs to get his ISP to replace it. He can run a test too: http://www.dslreports.com/tools/puma6

moikerz

I apologise for my earlier post - it came across very snarky! I understand you're working with what you've got so far. Definitely pushing everything through the SG1000 is the most-controlled - thus the most ideal - scenario.

Don't expect too much accuracy below 1ms timings. ~1 is fine. Don't sweat the small stuff, it's likely variances in timings and interrupts. It does sound like that upstream router/modem should [at minimum] be replaced.

mkernalcon

Thanks @TheNarc , I had a hunch that the modem was doing something nasty - I just never think about leased hardware being bad from design. Glad to know I wasn't barking up the wrong tree on this side. Advised to get a replacement from his provider or consider buying his own.

And yes, he has always had mild connection issues - he noticed occasional poor call quality on the phone even when it was SW->IPSEC->SW (behind the same modem) before I touched anything, but the sg-1000 seems to have much more trouble with it.

And @moikerz - no worries; you'd probably yell at me worse if I told you how many VLANS I am successfully running off of how small a LAGG on the main office LAN (and that's probably not even my worst sin on this setup). Turns out that the IT culture at a construction company doesn't ask for much and is willing to put up with a lot. Of course what they do ask for tends to be either trivially easy or impossible shy of some hacks. Try telling some of the hard-hat side of the company that they are behind too many layers of NAT and let me know how that goes :P

And the only reason I mentioned the pings is because I saw SEVERAL over 10ms pings over a single cable, I do recognize that ping is a very poor indicator of connection quality.