Routing issue - Static routes needed?
- 
 Well, yeah,…I believe so,...regarding the bigger picture: We're having a DMZ, a "LAN", and three high secure networks. So there's a part of the network diagram you don't see in the picture I posted. Internet --- |Firewall| --- DMZ --- | Firewall | --- LAN --- | High Secure Networks The DMZ hosts are all dual-homed, kind of an external and internal DMZ as we didn't want to have any direct connection from the inside (LAN onwards) to the internet. Furthermore, we wanted to have the secure networks on a dedicated physical switch with another physical firewall in front of them. So,...that's why the topology came to be... 
 The switch "homing" the secure networks is using VLANs to separate those three secure networks from each other.And as there need to be interconnections between the networks (as few as possible), the routing topic came up at some point...that's about the bigger picture I guess. Everything runs perfectly as intended, except some applications seem to have problems when connection from LAN to the secure networks (which are very few, mainly for security and system monitoring purpose)... 
- 
 In the case of your scenario, you first need to think about the direction in which the majority of the access will be flowing, this will determine the default routes that need to be installed at each location. 
 High Secure Networks -> Firewall LAN interface (or none if you want it really secure)
 LAN -> Firewall LAN interface
 DMZ -> left hand firewall LAN interfaceThen you need to think about what are the exceptions to that rule, if it is a manageable number, you can put in static routes, for instance on the DMZ network, you could put static routes for the LAN segment to route the traffic to the WAN interface of the right hand firewall. 
 Same goes for machines on the LAN interface, I'm guessing there is another firewall between LAN and High Secure Networks, but it isn't shown?If it is too complicated to manage routes on many machines, then in the DMZ network think of it as a Y connection, with a Layer 3 switch sitting in the centre of it. 
 That way all the systems use the switch as the default gateway, the switch uses the left hand side firewall LAN interface as a default gateway, and has static routes for certain subnets to the right hand firewall WAN interface.The major difference between the switch and the firewall is that the switch is stateless, in that it doesn't care what state the connection is in, it just forwards or routes packets to the desired destination blindly. The firewall on the other hand is stateful, meaning that it needs to be aware of what is going on. So for example, if a TCP data packet arrives without it first having seen the TCP SYN / ACK sequence take place, it is going to drop the packet. 
 The second major difference is that the switch is designed to work ar wirespeed, whether that is 100Mbps, 1Gbps, or 10Gbps, it can forward or route the traffic without delay.
 The firewall on the other hand has alot of work to do to decide initially if it is going to allow the packet to pass, this can severely limit the number of new connections per second it can handle, and secondly, the forwarding rate on pfSense while good can't come close to a switch.Lastly, you must remember for every action there is an equal and opposite reaction, well almost, for every packet sent, there is usually a response. Your planning process must take into account packets being sent and the response being received. 
 Again the switch doesn't care beyond simply forwarding or routing, but the firewall does. For example, when you make a DNS request on port UDP/53, the firewall implicitly knows to expect a response back to that request from the IP you made the request to, and it will open the response port for a period of time in expectation of a response. Once it receives the response, the port is closed.This book is essential reading on the subject: 
 http://www.amazon.com/Internetworking-TCP-Vol-1-Principles-Architecture/dp/0130183806/ref=sr_1_1?keywords=comer+and+stevensHere's one way to look at it: Default GW = 10.1.1.2 Route 10.1.2.1 via 10.1.1.3 +----------+ +----------+ +----------+ XXXXXXX | | | | | | XXXXX X XX | | | | | | XX XXXXX | | | 10.1.1.1 | | | +-------+ X X | |.2 | LAYER 3 | .3| |.1 10.1.2.0/24 .55| | X XX+--------+ FIREWALL +---------------+ SWITCH +------------+ FIREWALL +----------------------+ PC | XXX INTERNET XXX WAN | | LAN | | WAN | | LAN | | XX X | | | | | | +-------+ X XX | | | | | | XXX XXX | | | | | | Default GW = 10.1.2.1 X X XX +----------+ +----------+ +----------+ XXXXXXXXXXXXXX |.1 Default GW = WAN_IP | Default GW = 10.1.1.1 Route 10.1.2.1 via 10.1.1.1 | | | | +--------+--+------------+ | | |.10 +--------+ | | | PC | | | +--------+ Default GW = 10.1.1.1
- 
 Well, I should've mentioned that I planned the whole network topology thoroughly, and therefore already knew when I was planning everything that I'd need static routes for possibly all the networks behind the "left hand side firewall" for all the DMZ hosts (which are very few, and that is absolutely acceptable). On the other hand, I also calculated a connection being: LAN Host –> LeftHandSideFirewall --> RightHandSideFirewall --> MGMT as perfectly OK, as long as the "Bypass Firewall Rules" option was set. And it is, at least for 99% of my connections (never had any problems so far, everything working well). But that single 'SCP' connection that I tried which started this thread in fact still does not make sense to me. I would understand if the firewall were to drop packets because of state-mismatch, but that's not the case, as the option is active and should allow the connections. So why does it close down? Can't be the operating system either I think, because then, it would not allow traffic for SSH either. And the OS firewall has been disabled for now. So what else could it be? The application? I don't think that the application would be looking at IP level, so no,...so what is it? 
 I did some more testing and what I experienced is really really strange:- Transfer a debian ISO image from the LAN host to the MGMT host via SCP, no static route: Works perfectly and fast!
- Transfer text files and excel sheets and other stuff from the LAN host to the MGMT host via SCP, no static route: Works perfectly and fast!
- Transfer an elasticsearch.deb file I downloaded from the LAN host to the MGMT host via SCP, no static route: Crashes!
- Transfer the same elasticsearch.deb file from the LAN host to the MGMT host via SCP, static route set: Works perfectly and fast.
 So I guess it's not only the assymetric route that is causing the problem here. It's also somehow related to that file,...any idea how to get to the point of what's going on here? Regarding the setup, apart from the SCP problem I experience, I will go for the following 'solution' in the long term:I see that a layer 3 switch would solve the problem, as the "asymmetric part" of the routing would vanish. But in fact, I would have the layer 3 switch replace my LAN switch. That way, from the DMZ, everything would continue to work with static routes, but from the LAN, the layer 3 switch would do the routing part for the secure networks... DMZ Hosts (Dual Homed) LAN Hosts +----+ +---+ +----+ + + + +----+ | | |---| | | | | | | +-----+ | | |---| | | +-+--+--+-------+ | | +------+ FW +--------------+ FW1+-----+ Lay3 Switch +----+FW2 | Secure Nets | | |---| | | +---------------+ | +-----+ | | |---| | | | | +----+ +---+ +----+ +----+-----+p.s. Thanks for the book link, I can only recommend that book as well, as well as the "TCP/IP Illustrated series"! 
- 
 …But that single 'SCP' connection that I tried which started this thread in fact still does not make sense to me. I would understand if the firewall were to drop packets because of state-mismatch, but that's not the case, as the option is active and should allow the connections. So why does it close down? Can't be the operating system either I think, because then, it would not allow traffic for SSH either. And the OS firewall has been disabled for now. So what else could it be? The application? I don't think that the application would be looking at IP level, so no,...so what is it? 
 I did some more testing and what I experienced is really really strange:...In your initial drawing, you appear to be running firewalls in high-availability mode using CARP in pairs A-C and B-D. 
 Have you ruled out the following:- 
Does the problem occur if you disable CARP temporarily? To rule out that the firewalls aren't switching roles during the transfer. Heavy load could lead to packet loss which could lead to a state change. 
- 
The default gateways / static routes are all using the VIP address of firewall pair and not a single firewall in particular? 
- 
Does the size of the file have some influence on the issue? I should think not, that elasticsearch.deb is only around 27MB 
 Some traffic capture is going to be required to find out where the problem lies… 
 Run tcpdump on the host while you run the SCP, open it up in wireshark and have a look to see what the last communication was; presumably a data packet that got no ACK, followed by retries that got no ACKs.Run tcpdump on the left hand firewall's LAN interface while running SCP. You should be seeing the first packet, the firewall should be responding with an ICMP redirect, and you shouldn't see anymore traffic hitting it until the ICMP redirect times out. Perhaps the duration of the transfer has something to do with it? Once the ICMP redirect times out, the next packet is going to hit the left hand firewall. This will be a data packet, and it will be out of state. Technically the fact that the Bypass firewall rules for traffic on the same interface option is checked should not be causing an issue and you should see an ICMP redirect. Run tcpdump on the WAN interface of the right hand side firewall while running SCP. I'd be looking for how the ICMP redirect packet is handled, and if those last ACKs are actually being sent out. 
- 
- 
 Have you ruled out the following: - 
Does the problem occur if you disable CARP temporarily? To rule out that the firewalls aren't switching roles during the transfer. Heavy load could lead to packet loss which could lead to a state change. 
- 
The default gateways / static routes are all using the VIP address of firewall pair and not a single firewall in particular? 
- 
Does the size of the file have some influence on the issue? I should think not, that elasticsearch.deb is only around 27MB 
 - 
I did not disable CARP temporarily yet, I could try that. But I checked the logs for any failover entry regarding CARP and there was nothing 
- 
The default gateways and static routes are using the VIPs, yes 
- 
That was my initial thought. But then I transferred the 240MB debian ISO image, which did not cause any problems. 
 I did another test with the ".deb" file today, without static route. It always and consitently crashes when being transferred via SCP. When I 7z it first, it transferres,…funny. Regarding "disabling CARP cluster", is it sufficient to disable that one on the primary node "temporarily"? Or on both nodes, or...? I will not have network downtime I guess? 
 To do thorough packet capturing I will have to wait till I find the time I'm afraid,...:-/ As it's not like high priority, as everything seems to be working except that one stupid file.
- 
- 
 Ok, my curiosity won in the end. I did a packet capture on both ends. I see ICMP redirects, but the client (Windows Server) ignores them and sends all packets along to the original left hand side gateway. 
 After a while, there are lots of TCP retransmissions on the client side.On the server side I see some ACKs sent to the client that the client side never received. So I did another packet capture: - On the client
- On the left hand side firewall
- On the right hand side firewall
- On the server
 http://ianfe.dyndns.org/netdiag/scp_client.pcapng 
 http://ianfe.dyndns.org/netdiag/fwall_left.cap
 http://ianfe.dyndns.org/netdiag/fwall_right.cap
 http://ianfe.dyndns.org/netdiag/scp_server.pcapLeft hand firewall trace looks pretty strange, but I'm not sure if this is just because wireshark is confused about the asymmetric route,…;-) 
- 
 The LHS trace is what you'd expect if ICMP redirects aren't taking place. The LAN interface is bouncing all the traffic. 
 Check System -> Advanced -> System Tunables, make sure net.inet.ip.redirect = 1 (default)A couple of things caught my eye: - 
In the LHS capture, at packet 129 the LAN interface is no longer reflecting the traffic, set logging on the firewall rule that is handling this, there should be a LAN->LAN rule in place and see if you're getting a drop. This is why the copy is stopping. What's causing the traffic to suddenly drop on the LAN interface is another matter. 
- 
In the client capture, the traffic pattern is very strange, there is a 10 second delay between packets 21 and 22, and another 5 second delay beforee packets 23 and 24, after that it starts running… 
 Go into Status -> System Logs -> Settings 
 Check all 4 boxes in Log Firewall Default Blocks section (default block, default pass, block bogon, block private)Re-run the copy and see if something pops up in the log. 
- 
- 
 Ok, I will try that tomorrow, have to leave now. 
 Regarding the strange delay:- First 5 seconds might be till password prompt appears (I don't really know what happens there in the backend)
- Next 10 seconds delay is me entering password
 At least I guess that's being it,… 
- 
 Actually, it just occurred to me that even though the LAN interface is set to forward all local traffic, because there might be an interpretation in the word "local", the traffic is actually destined to a different subnet, so technically it isn't local. 
 That being said, the firewall IS inspecting the traffic and IS making a state entry for it.
 The responses from the RHS firewall aren't going back to the LHS firewall, but instead being sent directly to the client.
 Because of this, the LHS firewall eventually times out the state and stops forwarding the traffic.As to why it works with certain files but not others…that requires more investigation. 
- 
 That being said, the firewall IS inspecting the traffic and IS making a state entry for it. Well, but as there's a rule on the LHS and RHS that state that this traffic is allowed, state entries should not matter for the direction client –> server. As the firewall rule explicitly allows that traffic anyway. 
 Otherwise a protocol like RDP would not work either, would it? And I would see blocked traffic, but I don't see blocked traffic at all, only allowed traffic...
- 
 That being said, the firewall IS inspecting the traffic and IS making a state entry for it. Well, but as there's a rule on the LHS and RHS that state that this traffic is allowed, state entries should not matter for the direction client –> server. As the firewall rule explicitly allows that traffic anyway. 
 Otherwise a protocol like RDP would not work either, would it? And I would see blocked traffic, but I don't see blocked traffic at all, only allowed traffic...Your observations make sense, but the behavior seen in capture seems to suggest that some sort of stateful inspection is taking place. 
 Did you have a chance to look at the logs to see if anything popped up?
- 
 Well, yes. I first checked that ip redirects are sent, both firewalls would do that. 
 I assured that the logs are all active, that looks ok. I still see no blocked traffic though, very strange. I will activate layer 3 switch in about 1.4 hours on the LAN, and hopefully this will resolve all of that. Although it would've been interesting to know what the issue is,…
