Network timeouts on individual transfers
-
I'm running r1soft software that replicates the backup to a 2nd datacenters over 1 Gbps line with guaranteed reservation. Exactly every 30 minutes - on the clock 30 minutes and 1-5 seconds after start, the transfer stops. So if a backup starts at 11:00, it stops at 11:30:01. If a 2nd backup starts at 11:10, it stops at 11:40:01. If a 3rd backup (small one) runs at 11:30 and ends at 11:45, it completes. So it doesn't seem to be transfers stopping at the same time, that would be normal if one unplugged the network cable - it is 30 minutes from the start.
When I have run tools like ipferf, I have not found an issue, but I haven't run it for 30 minutes... I have also ran the r1soft ftp-module instead of this https-based module and there the backup completes with no issues. The same goes for Proxmox backup server, no issues even on hours of backup. So I assume the R1soft-software has poor connection handling on their protocol - but the underlying issue must be something on my end. R1soft don't know what is the cause.
4/17/21 8:46:14 AM Error Manager Could not read message from socket channel 4/17/21 8:46:14 AM Error Manager Connection reset unexpectedly.
Any idea on how to troubleshoot this issue, to at least document the error somehow? I just moved to a different 2nd datacenter and I see the same problem with no fw on the remote end, just a brand new server. So this rules out anything on the remote end. On the main center I have had all the time, I run pfSense as firewall. I have run snmp-monitors between the locations and iperf, but it doesn't record any issues and seem stable with no packet loss.
I wonder if it somehow relates to flip-flop or high-availability setup (that is not active and I only have one network port active on the backup-server). It would be awesome to find a way to trace this.
-
No advice here on what to do in order to trace down this?
Internal traffic does not time out, it is only connections leaving the pfSense firewall that is stopped after 30 minutes of tx. I tried to move the backup-server inside my own network and it worked without issues.
-
@fireix
Possibly a sort of state timeout.Basically, connections which are in use (while transmitting packets) don't timeout. The timeout counter starts after the last packet is transmitted.
However, I'm not familiar with your tool. Maybe it opens multiple connections to the other host which are partly idle while syncing.You may look up the docs for details on state timeouts in pfSense: https://docs.netgate.com/pfsense/en/latest/config/advanced-firewall-nat.html.
For troubleshooting you can add a pass rule to the top of the rule set allowing the access to the remote host and set a high timeout in the advanced options.