Sticky connections not working with dual WAN
I have sticky connections enabled and have been having issues browsing more than one site that requires me to login (one site is owned and operated on my own server at a separate location and only has 1 IP address) but keep getting thrown back to the login screen. This has also been happening on banking websites and others.
I have confirmed on my own remote server that it is due to the IP changing and as said I have sticky connections enabled.
This happens within a minute or two so it's not due to the states but I tried setting it to 1200 seconds, killing states and browser sessions to be sure and trying again. But still no luck it still happens.
To confirm my setup I have 2 Fibre connections in the UK and they are not dropping out and are very stable and the following settings are set:
System > Routing > Gateways > Default gateway IPv4: LoadBalance (Load Balancing Group)
System > Routing > Gateways > Gateway Groups: Group Name: LoadBalance - Priority: Both set to: Tier 1 - Trigger Level: Trigger Level
Firewall > Rulesc > vLAN 1 > Outbound Rule set to: LoadBalance Group
Just out of curiosity - why would you be blocking bogon on your own internal network interface - zero sense to do such a thing..
Have you validated by sniffing on your 2 wans that traffic is actually not being sticky? Cuz yeah sites don't like it when you try and hit the same site from different IPs at the same time for the same session.
Just out of curiosity - why would you be blocking bogon on your own internal network interface - zero sense to do such a thing..
I'm not a security expert by any means, are you saying I should remove this for the vLANs and just have it on the WANs?
With regards to the sticky connections, I have verified it by server logs of websites I connect to that the IP is changing (my own remote server) so it is 100% changing.
Do you understand what is in bogon? It IP address that do not route on the internet.. How would a source IP that is listed in bogon ever hit your internal interfaces?
Your rules on vlan1_trusted already all say hey only vlan1_trusted are allowed.. So any such odd IP, say a downstream network or "bogon" wouldn't be allowed in the first place by any of your rules.. There is zero point to having bogon on your internal networks... And to be honest little point on your wan either ;) These are IPs that are not meant to route on internet - how/way would you see them on your wan in the first place?
Your lucky pfsense pulls out rfc1918 space from bogon list, or you wouldn't be able to get anywhere with bogon lan side being blocked.
If your IP is changing then you don't have sticky set.. Or your states are expiring or being removed.. Do you have it setup to remove states on issue with the gateway? I think this is default??? That if there is a wan even, states are cleared?
No I wasn't sure what bogon is, so I've removed it now thanks.
I do have sticky set and have tried with both the "source tracking timeout for sticky connections" set to the default of "0" and again at "1200" after closing all browser sessions and resettings states.
I'm not sure what you mean when you say "Do you have it setup to remove states on issue with the gateway? I think this is default??? That if there is a wan even, states are cleared?"?
I thought all I have to do to enable sticky connections is to enable it in "System > Advanced > Miscellaneous"? If there is something else to do please could you be clear as i'm not an advanced user with pfSense.
So if there is an issue with wan... say not answering ping.. Or the like - I mean its out there as a possible cause.. It might happen now and then off chance..
This is the settings im talking about.
Under advanced - networking
And then under misc
These 2 things could flush all your states on you..
One way to check is look in your state table - has it been flushed? When you see this problem occur.
Does it happen all the time, or just now and then it has happen?
If its always happening like very connection - that points to sticky not working, or setting not actually took place.. Have you tried toggle the setting saved, and then turned it back on? Saved - have you looked in the actual xml to validate the setting is in there?
I've checked those both and they are already disabled, so I enabled both and disabled them.
I've gone to check the XML file but it's not clear exactly what I should be looking for? Do you happen to know, please?
It happens a lot, for example I can login to Santander for my banking, click just a few links and I'm logged out.
I log in to my own servers admin panel and again within a few clicks, it logs me out just like Santander. The logs show the same IPs so it's not changing but specifically show it's down to me using an IP that I didn't login with:
Rejected session for user admin because IP (5.70.xxx.xxx) doesn't match session file (217.45.xxx.xxx)
I am also sure my connection is not dropping that often, there's just no way.
Just to confirm also, after turning the aforementioned settings on and off again I tried again with the "source tracking timeout for sticky connections" set to "1200" so it shouldn't change my IP when connected to the website for that amount of time (i.e. log me out).
However, it's still happening:
2020:06:06-02:06:04: '5.70.xxx.xxx' successful login to 'admin' after 1 attempts
2020:06:06-02:11:23: '217.45.xxx.xxx' successful login to 'admin' after 1 attempts
The second login was because my IP changed and I had to login again.
I actually submitted this as a bug because I believe it is (I also sought out help in the IRC channel but they couldn't help me) but they referred me to here first:
It could be a bug - but I would think a lot more than just you would be reporting it.. I would think dual wan with sticky would be a common enough sort of setup that there are quite a few out there in the field..
I don't have dual wan, or would love to try and duplicate.. That you have a server to test to makes it easy to see exactly what is happening etc..
I would have to simulate a dual wan - which I could do.. But lets see if we get some any other traction - maybe someone with dual wan even if not using in load balancing - might be willing to try and duplicate the problem.
As temp solution - only thing I could suggest would be to turn off the load balancing and just use 2nd connection as failover.
As a temp solution, I've just set a rule that anything going to my servers or santander.co.uk & retail.santander.co.uk will use a specific gateway.
Are we just hoping someone with Dual WAN setup reads this and jumps in to help then?
Well we could call in @Derelict but don't think he is around for a few days..
Well there's no major rush as I'm not exactly down so I'll just hang on for an update and hopefully, he'll see this soon.
Thanks for your help so far Johnpoz :)
If the application doesn't work with load balancing it doesn't work with load balancing.
That's pretty much what I have. Talk to the application side about accepting sessions from multiple IP addresses.
I'm using this exact scenario, with dual wan, banking sites and quite a few users accessing them. No issues
I did have issues in the beginning and I had to raise stickiness to 2500.
I also have raised the default weight to 2, so no line has a weight for 1.
I recall reading somewhere about an issue with load balancing, and this as a suggested workaround, but I can't recall it.
In any case, it doesn't hurt anything to use a default weight of 2 and adjust smaller lines accordingly.
I'm on 2.4.5 and this also worked flawlesly on 2.4.4.p3
@Derelict I’m sorry but I don’t understand your reply.
The application does work with loadbalancing (Google Chrome, Microsoft Edge etc...) but the security of these websites being visited require that the IP doesn’t change. Isn’t that the exact purpose of sticky connections to work around this?
Plus if someone else is now reporting the issue surely it warrants being looked into?
Look at the states when you are connected. If there are two different IP addresses being connected to, but all connections to the same IP address use the same WAN, then load balancing is doing what it is designed to do and you will need to policy route all traffic for that application out the same WAN or Failover gateway group, not a load balance gateway group.
^ great point... But my take on him saying his server was logging 2 different IPs connecting is that he was only connecting to 1 destination IPv4 address..
But your point is very valid for many of these sites that are hosted on cdn where www.whatever.com could end up being 2 different destination ips for the same site..
@Derelict I tried as you suggested.... killed all states, went to my own server and logged in via the website (as said the server only has 1 IP). I was almost immediately logged out so logged in again.
Checked the states and noticed it's using both WANs as suspected:
VLAN1_TRUSTED tcp 192.168.1.126:64519 -> 62.3.XXX.XXX:3334 TIME_WAIT:TIME_WAIT 8 / 8 2 KiB / 936 B
WAN1 tcp 217.45.XXX.XXX:8341 (192.168.1.126:64519) -> 62.3.XXX.XXX:3334 TIME_WAIT:TIME_WAIT 8 / 8 2 KiB / 936 B
VLAN1_TRUSTED tcp 192.168.1.126:64522 -> 62.3.XXX.XXX:3334 FIN_WAIT_2:FIN_WAIT_2 8 / 8 2 KiB / 4 KiB
WAN2 tcp 5.70.XXX.XXX:59341 (192.168.1.126:64522) -> 62.3.XXX.XXX:3334 FIN_WAIT_2:FIN_WAIT_2 8 / 8 2 KiB / 4 KiB
Sticky connections are on and the timeout is set to 1200.
Well your states are showing fin_wait.. and time_wait
Those states are being closed..
I would sniff this traffic and who is sending the fin?
I'm not sure what you mean? I'm the only person connecting.
It took me a few minutes to find those details in the states so that's probably why it shows they connections are closing. But I just grabbed any 2 connections in the logs showing it was using more than 1 WAN. There were many other lines of logs showing connections on both WANs.
These connections were made over the timeframe of 1 minute and after killing states so shouldn't there only be 1 WAN IP in the logs regardless?
t took me a few minutes to find those details
You can filter states.. My point was that those states are closed..
This statement "Once the states for that source expire" means what exactly... If any state, even closed states that are just waiting to time out.. Or does that state have to actually be active?
This where I thought maybe @Derelict could help..
Lets look at this scenario... You create a connection to IP X, now that state has been set to be closed.. fin.. and you enter a time_wait state.. Is that state considered expired - so a new session which is what you show there from a different source port would that go out the same wan, or would it round robin to the other wan?
You could look at it both ways.. Since the the state is just waiting to close, and you have this new session coming fro a different source port maybe I should round robin that connection.. Or you could look at it as hey there is ANY state from IP your rfc1918 address to this public IP 62.3 - so always use that wan? I am not exactly sure how it is looked at?
I could see both ways being valid ways of looking at.. Hey this client has an active session to x, any new sessions it creates will go out the same wan.. Or hey this session is closed or closing... Since this is a new session "different source port.. Maybe it should go out the other wan to load balance.
w that state has been set to be closed.. fin.. and you enter a time_wait state.. Is that state considered expired - so a new session which is what you show there from a different source port would that go out the same wan, or would it round robin to the other wan?
Would you be willing to do a remote session with me and I can show you all the evidence? I really think there's a bug here.
I am not saying its not a bug or that there isn't a problem - I just don't know which specifics pfsense is using to know keep a connection sticky.. I made a bit of edit addition - on my previous post.
You can look at it both ways, I don't know exactly what "Once the states for that source expire" means.. Maybe once there has been a fin, that state is no longer looked at - I am not sure..
Well I'm at a loss as to what to do next.
I think it comes down to @Derelict needs to advise what further testing I can do or accept it may be a possible bug?
I hope he replies!
I still think is out for a bit, my understand is he wouldn't be back for a few more days... So his check into the thread was a bit unexpected to me..
We can see if @jimp has any advice as well.. This is just a bit out of my comfort level, since I do not use multiple wans in a load balancing setup.. I don't really see the point to it to be honest ;) If you need to load balance tells me your connections are undersized ;) hehehe
I have more experience with this sort of thing on fortinet load balancing to servers behind them, and how their sticky connections work.. And even then its not a day to day sort of thing, only get called into consult on issues - normally they give me sniffs to work with and help them figure out what is going wrong ;)
If you could show state that is clearly active, and then another state being opened - then I would agree that is not how I would understand sticky to work.
You know who might be good as well would be @stephenw10
@johnpoz I don't deny my upload is undersized of for my needs... it's the best i can get at the moment though until they upgrade the infrastructure around here. It has many other advantages though such as redundancy.
Hopefully one of the people you tagged can chip in :)
I do appreciate all the help so far... thanks pal!
Sure failover I get. but that wouldn't need to be a in load balancing setup to do that ;) heheh
What I would suggest is try and validate if this other connection is being created after original state is closed.. You could just sniff on your client.. Do you see or send a fin at any time?
And that is when the wan changes.
It's getting a bit above me this now which is why I was hoping I could let you teamviewer in and maybe take a look?
You'd have all the answers in 5 minutes rather than going back and forth through this monkey ;)
@Daskew78 You shouldnt really care about states being closed when you have a stickiness of 1200.
As documentation says, if you have stickiness 0, then load balancing path is re evaluated when connectios are closed. (and we could discuss if this means fin wait etc)
But stickiness of 1200 Means 1200 seconds AFTER connections is closed, if a new request comes from the same ip to the same host it will leave from the same gateway.
I insist. stickiness works fine on multiwan ssl load balancing scenario.
And consider this workaround too
Also of note, when the weights differ, even though the gateways have a specific order with repetition in the rule, pf seems to still flip back and forth, though the general ratio of the weights is respected. For example with WAN1=3, WAN2=2:
I had the same issues as you do until I made 2 the default weight on both load balancing connections.
Deeper issues are suspected, as redmine says.
Please consider testing the workaround.
Yeah sure seems that issue is exactly what your seeing... I would do what @netblues says and that should fix up your issue I would hope.
Well if that's the case it is a bug then. But at least there appears to be a workaround.
I'll test it now, cheers again :)
We suggest to put a weight of 2 on both gateways and load balance them as both tier 1.
with a stickiness of 2500
As you see yes there is a redmine on it ;)
Currently targeted at 2.5 - but its been pushed many times already.. So wouldn't expect... This thread could get added to that redmine I would think.. Might put a bit more weight on looking into it.
Sorry I spoke a little too soon.... should I also set the sticky connections back to "0"?
@Daskew78 NO, it won't work on web banking sites
Sorry I was replying too fast and missed your update about setting the states to 2500.
I have set it to 2500 and set each gateway to Tier 1 but I can't see where I set a weight of 2? Where is the weight setting?