Need Help Resolving ?Asymmetric Routing? Issue in a Network with pfSense and Netgear Managed Switch (GS724Tv4)
-
Hello everyone,
I'm currently facing a challenging issue with asymmetric routing in my network setup and would appreciate any insights or suggestions you might have. Here's a brief overview of my network configuration:
Internet Connection: DSL connected to a FritzBox router (IP: 192.168.10.1) using PPPoE.
Firewall: pfSense firewall, running on a Proxmox server with 4 NICs. NIC1 is the proxmox management port connected to VLAN 100 (switch), NIC2 and NIC3 are bridged in pfSense, serving as WAN and LAN, and connected to a trunk port on the switch (g24). NIC4 bridged to pfSense and is connected to VLAN 100 (switch) to manage the pfSense. The pfSense WAN interface has the IP 192.168.10.2 and is connected to FritzBox. VLAN 100 and VLAN 10 both bridged to NIC3 in pfSense.
Managed Switch: Netgear Managed Switch handling two VLANs - Trusted (VLAN 10) and Management (VLAN 100). The switch is connected to pfSense firewall via a trunk port. Other ports on the switch are untagged, connecting devices belonging to the respective VLANs.
Network Layout: VLAN 100 is configured with the IP range 10.76.100.0/24, and the pfSense interface for this VLAN is 10.76.100.1. VLAN 10 has the IP range 10.76.28.0/24. My pfSense is set up to handle all inter-VLAN routing, while the Netgear switch is responsible for VLAN creation and port assignments.
Issue:
I'm experiencing asymmetric routing issues, particularly in VLAN 100. Devices in VLAN 10 trying to communicate with those in VLAN 100 often encounter dropped packets or connection instability, which I suspect is due to asymmetric routing. This is especially noticeable when connecting from a VLAN 10 device to Proxmox (via VLAN 100 IP address) via a browser and then using novnc to access one of the VMs. It constantly gives me a Code 1006 Disconnect Error after 5-10 seconds after using the VM.This is the firewall log
Troubleshooting Steps Taken:
Ensured all VLAN configurations are correct on both the pfSense and Netgear switch.
Simplified pfSense firewall rules to a bare minimum for troubleshooting purposes. Currently, there are no specific rules in the WAN interface, and all traffic is allowed to pass freely in both VLAN 10 and VLAN 100.
Checked firewall rules in pfSense to ensure they aren't inadvertently blocking or rerouting traffic.Checked https://docs.netgate.com/pfsense/en/latest/routing/static.html Did not work.
Specific Questions:
How can I ensure that all return traffic in VLAN 100 (or other VLANs) consistently routes through the pfSense firewall?
Are there specific settings on the Netgear Managed Switch that I should look at which might be causing or contributing to this asymmetric routing issue?
Could this issue be related to the way pfSense handles VLANs or firewall rules that I might have overlooked?
Any advice or diagnostic tips you could offer would be greatly appreciated. Thank you in advance for your time and assistance!Best regards,
-
@oliverus000 you sure its not just a state expired or was deleted?
Now if those were SA, yeah it would scream asymmetrical flow, but just A being blocked could just be a state is gone... Few different reasons why that could happen.. PA is just an Ack with the PSH flag set..
-
@johnpoz Here an extract of the firewall from one sesssion to reproduce the error (only blocks displayed):
My client x.28.20 (in VLAN10) connects via browser to Proxmox via an address in VLAN100 (x.100.221) and then I open a novnc session. After 5-10 seconds of being able to interact with the vms bash it suddenly disconnects (with code 1006).
As I said, I am only assuming something asymmetric but as of now I cant prove it (except the logs from the firewall).
There is no other obvious log so I am really stuck at the moment...
-
@oliverus000
i've never used netgear switches but something in the screenshot has me wondering.i'm assuming g24 is your uplink (vlan trunk to pfsense)?
-your pvid is set to vlan1 but that portmembership doesn't include vlan1 - it shouldn't really matter ... but who knows
-how is the relationship between "vlan member" and "vlan tag" on netgear hardware? are you certain the switch isn't sending untagged frames towards pfsense? (could try to change "acceptable frame types)- what the reason for the port priority settings ?
I've seen some buggy switches (or firmware) in the past .... so it wouldn't surprise me if something wonky is happening with tagged/untagged frames.
Also re-verify your proxmox configuration. If your virtualswitches somehow strip/change vlan tags in either direction then that could also cause a heap of trouble
-
@heper
Great points. Let me try to tackle them one by one:Yes, g24 is Trunk and connected to pfsense to the port vtnet2 (as in the screenshot above) No other devices are connected to any other port except the untagged ones.
- g24 has PVID 1 because I just didnt touch it and it was defaulted... (netgear has VLAN1 as default management) Any ideas what it should be and I will try it out..
- relationship of vlan member and vlan tag: the device I am runnning the browser is connected to g4 on the switch and sends untagged packages which should become tagged once they enter the switch and then should be routed to g24 with a tag attached. I've changed the config for g24 to Acceptable Frame Types "VLAN Only" since this port SHOULD only handle VLAN traffic since it only improves the setup. But the error unfortunetaly persists.
- Port Priority was just a desperate try to improve the flow (since i thought this would might help but it didn't. I have set back every priority to 0. But the error still persists.
- Firmwar of the netgear has been updated to its newest version.
- Proxmox Config: All ports are VLAN Aware except the one connected to the fritzbox router.
I am becoming more and more desperate
Attached a log from wireshark during the failure of such a session:
-
@oliverus000 still not seeing any SA blocks.. I see a FA there, which could close the states..
If your flow was asymmetrical, your SA (syn,ack) would be seen on pfsense on the wrong interface for where the state was created.. Not saying you don't have asymmetrical flow.. But I do not see a smoking gun that shows that..
if your flow was truly asymmetrical you wouldn't be able to actually make a connection, because the syn,ack wouldn't be allowed because of lack of state.
Its quite possible your states are just being reset.. I believe out of the box pfsense will reset states on a wan IP change, or loss of wan, etc.
NIC2 and NIC3 are bridged in pfSense
You have a bridge setup in pfsense? Why? Or do you have a "bridge" in your VM software to the physical nic on the box?
-
@johnpoz
I really wanted to dig deeper on the states situation and tried to look at the state in pfsense but when I look for the state that has my destination 10.76.100.221 included the result shows no matches (while I had my browser running the usecase described above):
Why is there no State at all? Am I missing out on something? A state should be there until it really gets reset, right?
To the bridge topic: pfsense runs as a VM on proxmox and therefore I need to bridge the NICs to make them available to the VM.
-
@oliverus000 if you see no states for an IP that is part of a active conversation that is working, then that screams the traffic is not going through pfsense at all.. Not possible for traffic to be flowing through pfsense without a state..
So either the state just got flushed and you haven't noticed that conversation isn't working - or your traffic is not flowing through pfsense like you want it too..
If I was to guess, your issue would be something related to your VM and bridging setup where you don't actually have stuff isolated like you think you do.
-
@johnpoz
I tried it now intensively and I could reproduce the behaviour: When I open a pure novnc window shell, nothing else in the background it creates a couple of states (different ports to the same ip, dont ask me why proxmox or novnc does that)
Then suddenly when something bad happens - ALL of them are gone and the table is empty
Any chance to find out why out of a sudden all states are getting flushed, even the closed ones??
-
@oliverus000 is your wan IP changing?
There is this setting..
Under advanced / misc
Then there is this setting on specific gateway under routing
I would look in your logs - do you see anything around the time you see the states go away?
But yeah if your states go away, that would for sure explain why your seeing the blocks in your firewall.
But also keep in mind, closed states will drop off the states list after the specific timing..
-
@johnpoz
These two settings are set to do not kill states...To the logs:
I cleared all logs before running my test. Performed the test and checked ALL logs:
System/General: nothing
System/Gateway: nothing
System/Routing: nothing
System/GUI Service: nothing specialFirewall: the same as already posted, a lot of blocks
DHCP: nothing special
Rest is empty.
:-(
-
@oliverus000
If you are running L3 switching then look at your gateways.
If you are not running L3 switching then it is not asymmetric routing as routing is layer 3.
By the way I am doing asymmetrical routing and it works on my current setup. I use Cisco for my L3 switching. -
@coxhaus
Yes my Switch is running on L3. Can you elaborate a little bit more what you mean by Gateways? As of now i have not specifically assigned any gateway information on the switch itself. I have only created the VLANs and the port assignments (PVID and Untagged/Tagged) on the switch itself. I have not touched any routing configuration on the switch since this should be handled via pfsense. The connected clients get the gateway info from DHCP which tells all the clients to use the specified VLAN gateway x.x.100.1 and x.x.28.1@johnpoz I have checked the proxmox network and bridge config and I am clueless what can be improved :-(
-
@coxhaus said in Need Help Resolving ?Asymmetric Routing? Issue in a Network with pfSense and Netgear Managed Switch (GS724Tv4):
By the way I am doing asymmetrical routing and it works on my current setup
Why would anyone do asymmetrical routing on purpose? Please explain..
I have done it when there is no other way, you can do host routing to work around it.. But why would anyone design a network to be asymmetrical? My answer to that would be your doing it wrong.
My switch is in L3 mode, but I am currently not routing anything on the switch, but I could if I wanted to - but routing on the switch does not mean your doing asymmetrical routing.. You would use a transit network
A switch with a trunkport and then ports in access mode doesn't say asymmetrical routing - Do you have svi set on the switch for these vlans and then pointing to them as gateways on the devices in these vlans vs the IPs on pfsense in those vlans?
-
@johnpoz said in Need Help Resolving ?Asymmetric Routing? Issue in a Network with pfSense and Netgear Managed Switch (GS724Tv4):
A switch with a trunkport and then ports in access mode doesn't say asymmetrical routing - Do you have svi set on the switch for these vlans and then pointing to them as gateways on the devices in these vlans vs the IPs on pfsense in those vlans?
If this question was for me here what I have setup (no routing for VLAN at all)
Routing in pfsense defined as follows:
Same for the other vlan10.
Firewall for 10 and 100, Pass all traffic:
DHCPs for 10 and 100:
-
@oliverus000 So you have no other IPs on the switch, other than its management IP, and your not pointing the gateway on any clients to these IPs on the switch..
If your not doing anything like that, then you wouldn't have asymmetrical routing.. You say the states just go away? That would be problematic..
Asymmetrical routing on a firewall causes issues when return traffic hits the firewall, but there is no state to allow the traffic..
This is a typical scenario where you would have asymmetrical traffic...
Client via some other router that has connection sends the syn to the destination.. But the device sending the syn,ack back to some other router.. When this router is a firewall as well.. Since it never saw the syn, it has no state to allow the return syn,ack - and would block this traffic.
Normally you would see this..
So when you send a syn, and the firewall allows it creates a state.. And sends the traffic on.. The syn,ack back is allowed by the state.. Now you have traffic flowing in both directs, just normal acks.. if the state goes away.. Traffic in either direction would be blocked.
Until a new state is created via a syn..
If in your blocks on your firewall you were seeing SA blocked, that would scream there is an asymmetrical flow that the firewall is not going to allow.. When you see just acks blocked, this points to just a removal of a state..
Either they just timed out because there was no traffic keeping them open, or they were deleted/killed. If devices are talking to each other an there is no traffic being sent, the state will timeout and close.. Now if one of the clients says hey I wasn't done talking here is some data and sends an ack, that ack will be blocked because there is no state.. Doesn't matter which end is sending the ack..
edit: once a handshake has been completed, ie the syn / syn,ack / ack - now all traffic between these devices wil have the ack flag on them..
If there is no existing state - this traffic will be blocked in either direction.. Just seeing blocks for Acks - where a connection was working before points to a loss of state.. You can see this with phones or wifi devices quite often where they will say wake up out of standby or something and try to continue a conversation they were using before.. But by this time the state has expired on the firewall, and is blocked..
edit
you can use pftop to see age of states, etc.. when they will expire, etc.. You can filter this for specific IPs, etc. -
@johnpoz said in Need Help Resolving ?Asymmetric Routing? Issue in a Network with pfSense and Netgear Managed Switch (GS724Tv4):
You say the states just go away? That would be problematic..
I would love to record it and its a very weird behaviour, let me describe EXACTLY what happens:
-
I wait until there are no states available any more for any connection to the server x.x.100.221 on pfsense
-
I refresh the window with a connection to x.x.100.221 which has a shell opened to the server via novnc.
-
I have around 20 new states on different ports:
-
I type in stuff in my shell (really just interacting with the server, nothing fancy, just typing in text or even not doing anything, just looking at the shell and then out of nowhere BAM:
-
ALL STATES ARE GONE just at the time when I got kicked out of my connection to the shell. ALL OF THEM
Now what comes to mind mind are two things:
- Is my pfsense detecting something and then flushes all the states and that is really disconnecting me (pfsense is the enemy)
- Is my connection somewhere breaking because something is bad and that leads to the flush of all states. (proxmox is doing some stuipid novnc stuff that pfsense does not like)
The reason why I cant let it go is because my IT head is not liking the fact that this could also happen to any other connection from VLAN10 to VLAN100 (not only me using a novnc shell)
WHY is pfsense flushing all states without telling me the reason? I cant imagine this is happening because they are all expired at the same time, especially when I have a window open connecting to the shell via novnc?
What i did now is a hping3 -S 10.76.100.221 -p 80 -c 1000 from a client in 10.76.28.x which should send TCP-SYN packages to port 80.
I have a packages loss of 10%
Is this related???
-
-
@oliverus000 said in Need Help Resolving ?Asymmetric Routing? Issue in a Network with pfSense and Netgear Managed Switch (GS724Tv4):
I have a packages loss of 10%
I wouldn't expect there to be any packet loss on something your just talking to locally - 10% is quite a lot.. Does it come in a bunch, ie see a bunch of loss and then its all back to normal - or is it a packet here, packet there out of 1000 for example.. That adds up to 10%
How are you determining that you have 10% packet loss? (edit: oh I see) Is that in clumps all together now and then or just random here or there..
If all of the states you see are in closing or closed - then yeah I would expect them to all go away at like the same time.. But if your saying your loosing all states, even active ones - that points to something flushing the state table..
But if your sending data, and getting an answer the state should be active - unless you are not flowing traffic through pfsense??
Those states you show - don't show any response they are all just one sided.. 8/0 etc... that is not what a normal active conversation would look like..
ESTABLISHED:ESTABLISHED
And you should see packets on both sides of the / like
-
@johnpoz
I changed from wifi to a cable and paket loss reduced to almost 0%. So most probably not really connected to my issue.BUT your comment most probably leads to something.... You are absolutely right. There are only one sided states and it never shows "established" when I am connected with a browser to my server... WHAT could this mean???
I only see something like this but this looks also very one sided:
-
@oliverus000 and your answer is not going back through pfsense.
So in my above example if client A talking B sends its syn through pfsense it will open a state if the firewall rules allow the traffic. But if the answers do not flow back through pfsense then the would never be an established connection.. And even if you continue to send traffic from A through pfsense.. At some point this state will close, and now traffic from A to B would be blocked..
So this points to symmetrical flow - but in the other direction.. So you could have something like this..
pfsense will open the state and send your traffic on - but since it never sees any return traffic.. At some point these states will expire.. And now your sender sending traffic will be blocked until he sends a new syn to open up a new state.
This some examples of why asymmetrical flow is almost never a good idea.. That @coxhaus mentions he is doing it - on purpose?? That is horrible design.. And can be very problematic - especially when you have a stateful firewall doing the routing..
You can see this sort of issue with multi homed devices.. As well
So for example my client on 192.168.1.x sends traffic to 192.168.2.x through pfsense.. But the device on 192.168.2 also has a connection in the 192.168.1 network and answers via this path then at some point pfsense will kill off the states.. And further traffic will be blocked until a new syn opens a new state..
Asymmetrical flow, mult-homed devices is just asking for problematic issues.. They should almost always be avoided..
Now you would hope that the client sending the traffic would be smart enough to figure out, hey I sent to 192.168.2.x via my gateway mac of xyz... Why is the response coming from 192.168.1.Y from mac abc.. Because such a response could be of security concern.. But many clients are stupid, and will just accept the answer.. Hey I sent to 192.168.2.x from port 4000 to port 443.. And the response even though from different IP and different mac address is to my port 4000 from a port 443..
Is this device your talking to multihomed? Ie does it have an IP in both networks?
If your going to talk to a device that has interfaces in network A and B from a device in network A.. You should talk to the device IP in network A.. Not B - if you talk to its B address, you are yes most likely going to have issues..
Multihoming can be very problematic.. And also a security concern.. Because your firewall has no control over this device talking to other devices in other networks - because it has a leg in multiple networks. And this can be used to circumvent firewall controls of what can talk to what.