Site-to-site VPN bandwith problem
-
My setup I have two VM running pfSense 2.0.3, both have a
1 Intel(R) Xeon(R) CPU X5690 @ 3.47GHz
4 GB ramI setup a IPsec site-to-site VPN to do replication between our office and co-location. It's been working without flaws for over a year now. The replication traffic is about 80~120 GB a day.
Our ISP is the same at both places COX communication.
Co-location has 90 down/ 70 up
Office had 50 down / 50 upAbout 2-3 weeks ago COX did an "upgrade" at the co-location. Ever since that time we get about 30~50 for about an hour then it, it drops off to about 10~13 until it finishes. I have RRD of it.
Now I assumed the problem is what ever change they made in there "upgrade". But, no the problem has to be with the way I setup pfSense. So the guy at the co-location wants to repalce the pfSense firewalls with 2 sonicwalls, something he is familiar with. :)
Here is what I have tried to debug it.
I added a VM at the co-location setup the VM up, to down load a file across the VPN, the file had the same 10~13 bandwith cap. It's not a problem with the SAN to SAN software.
I connected to a FTP site from the co-location to the office with over the Internet, I was getting normal 40~50 speed. Note the FTP is on a differnt IP addres than the firewall. Odd VPN must be the problem or the different IP address changed something, but why can it get 30~50 for an hour?
I read the forums here, I restart IPsec VPN, didn't help. I reboot the firewall, didn't help. I looked over the configuration of it looked OK to me, but I can post it if that would help. I looked at the IPsec logs, looks normal to me.
I talked with the guy at COX he waned to see a tracerout from both places, I do that an it looks differnt, now it goes to 10 new hops that it didn't last time I looked at it.
That last thing I tried was to do some traffic shaping and limit the upload speed to 30. It did cap the first hour to 30, but after that it still goes back to 10~13.
![status_rrd_graph_img.php11 - Copy.png](/public/imported_attachments/1/status_rrd_graph_img.php11 - Copy.png)
![status_rrd_graph_img.php11 - Copy.png_thumb](/public/imported_attachments/1/status_rrd_graph_img.php11 - Copy.png_thumb) -
My setup I have two VM running pfSense 2.0.3, both have a
1 Intel(R) Xeon(R) CPU X5690 @ 3.47GHz
4 GB ramI setup a IPsec site-to-site VPN to do replication between our office and co-location. It's been working without flaws for over a year now. The replication traffic is about 80~120 GB a day.
Our ISP is the same at both places COX communication.
Co-location has 90 down/ 70 up
Office had 50 down / 50 upAbout 2-3 weeks ago COX did an "upgrade" at the co-location. Ever since that time we get about 30~50 for about an hour then it, it drops off to about 10~13 until it finishes. I have RRD of it.
Now I assumed the problem is what ever change they made in there "upgrade". But, no the problem has to be with the way I setup pfSense. So the guy at the co-location wants to repalce the pfSense firewalls with 2 sonicwalls, something he is familiar with. :)
Here is what I have tried to debug it.
I added a VM at the co-location setup the VM up, to down load a file across the VPN, the file had the same 10~13 bandwith cap. It's not a problem with the SAN to SAN software.
I connected to a FTP site from the co-location to the office with over the Internet, I was getting normal 40~50 speed. Note the FTP is on a differnt IP addres than the firewall. Odd VPN must be the problem or the different IP address changed something, but why can it get 30~50 for an hour?
I read the forums here, I restart IPsec VPN, didn't help. I reboot the firewall, didn't help. I looked over the configuration of it looked OK to me, but I can post it if that would help. I looked at the IPsec logs, looks normal to me.
I talked with the guy at COX he waned to see a tracerout from both places, I do that an it looks differnt, now it goes to 10 new hops that it didn't last time I looked at it.
That last thing I tried was to do some traffic shaping and limit the upload speed to 30. It did cap the first hour to 30, but after that it still goes back to 10~13.
From what you are describing this sounds suspiciously like traffic shaping / throttling on Cox's part. If you have not actually changed anything at all on the pfsense firewalls and the only thing to change was indeed the 'upgrade' in the data center (aka let us put some equipment in that let's us monitor / throttle your bandwidth) then your assumption that it is something with the provider is correct. My guess is they are shaping the IPSEC traffic with allowing full speed for ~ 1 hour and then backing down the speed to the 10-13. Keep in mind that these providers are all getting ready for the "pay to play" model that is coming where if you don't pay extra for the bandwidth guarantees then your traffic might suffer (see stories about Netflix and others in discussions to pay providers to give their services priority).
Did you FTP the traffic for over an hour to see if the speed changed? If you did, did you do it in the same direction (upload / download) as traffic through the tunnel? Did you run FTP through the tunnel for an hour and see if it dropped?
It may end up you will have to let them put in the sonicwalls to "prove" that it is not the pfsense boxes, but the simple statement of "nothing changed other than your co-lo service upgrade" should be enough for them to get it through their thick heads that the problem is on their end. My suggestion would be to request a level 3 engineer and arrange a time for them to watch the traffic over a period of time and when it drops get them to check their system and see why it drops. I see this all too often where a vendor provides a line or equipment and leaves some default setting on or a new policy regarding usage is put in place and no one knows about the new policy.
Check your service level guarantees. Does it limit the amount of time traffic can run full speed? Does it state that the ISP must provide you full speed or do you "share" bandwidth. Look at your terms of service and confirm there is nothing in there about limiting your speed in certain use cases. Check for both your office and the colocation facility.
Hope you can get this sorted out. I highly doubt it is the pfsense boxes as I have pushed around 140 Mbps through a pair of very old poweredge 1950 units with 10-15 percent proc usage. The newer units I can get much higher and it barely hits the procs.
Good luck and let us know what you find out!
-
Did you FTP the traffic for over an hour to see if the speed changed?
No, it was just long enough to move a 4G ISO, I could try a longer test after hours if that would help. But I did run the FTP test at the same time as the replication traffic was running, in the throttled 10~13 speed. The FTP got the remaining 30~40 of the bandwidth.
If you did, did you do it in the same direction (upload / download) as traffic through the tunnel?
It was in the same direction as the replication traffic office -> co-location.
Did you run FTP through the tunnel for an hour and see if it dropped?
I have 2 tunnels one for iSCSI, and one for management. I used windows explorer to move a 4G ISO across the management tunnel to the co-location. Than I looked at the RDD graphs, when I move that ISO, it took most of the bandwidth away from the iSCSI tunnel, until the transfer was done.
Check your service level guarantees. Does it limit the amount of time traffic can run full speed?
I'll look into that.
Thanks for the info, nice to know I'm not crazy. :)
-
I would run the FTP test and see if the Bandwidth gets throttled for the FTP transfer after the same amount of time of active transfer as it does for the IPSEC transfer. Many, many ISP's are giving burst speeds for a period of time and then throttling for a period of time after the usage stops. The FTP test might prove this.
-
I would run the FTP test and see if the Bandwidth gets throttled for the FTP transfer after the same amount of time of active transfer as it does for the IPSEC transfer. Many, many ISP's are giving burst speeds for a period of time and then throttling for a period of time after the usage stops. The FTP test might prove this.
I'll give that a try.
I e-mail COX and co-location guy, and asked about the service level guarantees, and getting a level 3 engineer to take a look. They changed their tune, about the problem being pfSense fault. Now they say it's a problem on there end with a BGP setup on a COXs router, which is routing the traffic over a pipe with a 20 limit.
-
Good to hear they are finally digging into it and appear to have identified the problem. I am not sure that it still explains the speed cut after 1 hours since if it only had 20M pipe to begin with you shouldn't have been getting the higher speed for the hour and getting cut down to the 10-13. Still sounds like they are limiting the traffic. Update us once they get back to you. Sometimes it just takes asking the right questions and getting past the level 1 folks.
-
OK, it looks like the problem was with the COX router. We where getting routed to a level 3 network that, was throttling our traffic. The new path is still throttling our traffic but at only at ~20, which is enough to do our replication in about 8 to 10 hours, which meets our business requirement. It would be nice if we could get the full 50, but it's not as high a priority now.
That e-mail about the service level guarantees got the ball rolling again. Thanks for the help.
![status_rrd_graph_img.php.05-16-2014 - Copy.png](/public/imported_attachments/1/status_rrd_graph_img.php.05-16-2014 - Copy.png)
![status_rrd_graph_img.php.05-16-2014 - Copy.png_thumb](/public/imported_attachments/1/status_rrd_graph_img.php.05-16-2014 - Copy.png_thumb)