Unexplainable trouble with OpenVPN connection - beer for tip to fix it :)

Gertjan

To exclude another bunch of possible reasons : skip the VPN part.

Connect to the (a) remote site, put in place some RDP NAT rules to some internal devices. You could use your own WAN IP in the rule as a source, which will exclude any security issues.
Now test.
If the speed is ok, you know it's a VPN only issue.
Still issues : it's the 'road' to the remote side, not the VPN.

keson

@Gertjan THANKS a lot for your input.
I have created a NAT and rule on the remote site to allow direct 3389 port translation and it behaves the exactly same way, I can barely initiate the security handshake, the RDP stays for long minutes on the welcome and eventually in 10+ minutes it opens a desktop, but I am unable to move mouse/windows.

Again when tasing from my computer/phone directly over mobile data, it works blazing fast.

So this is to exclude the remote pfSense box for sure and also my home end devices.

Also as I can still be connected to the other older pfSense remote site and use RDP absolutely without any issue, I would also exclude the provider/lines.

I am now more convinced that the problem is my home pfSense. It starts to behave strange enough to reinstall it entirely. When rebooting or restoring configuration, I cant make it to load the pfblocker rules any more, it simply seems it has corrupted enough to be freshly reinstalled.

keson

@stephenw10 I will also try to follow your advices before I entirely reinstall my box from fresh USB, will need to wait for the evening for users leaving the remote site.

Just to make sure I follow your advice, by "assigning" the interfaces you refer to the page "/interfaces_assign.php" to delete the assignment and assign them again?

Gertjan

@keson said in Unexplainable trouble with OpenVPN connection - beer for tip to fix it :):

When rebooting or restoring configuration

take note : we all share the pfSense 'core' files. What makes our system different, and as much a s "works great" up untill "doesn't work" is ... the config.
So, re installing, using the same config ... yields the same result.

This is one of the reasons why you should you keep your setup as simple as possible. This enables you to create a new system from scratch if needed without takes hours to set it up.

Also : rdp traffic is as any other type of traffic, pfSense doesn't care less. It's handled with the same speed as all other traffic. Except if you are filtering using 'snort' etc

stephenw10

If the OpenVPN instances are already assigned as interfaces assigning them again will not do anything. They do not need to be assigned though, if they are not already assigned then I was suggesting assigning them. And yes on the Interfaces > Assign page.

If you see the same issue outside the tunnel it's unlikely to be a VPN issue anyway though.. It's possible it could be something at the firewall at one end or it could be something in the route between them. I would still try to find what size packets you can send across that outside the tunnel and whether you see packet loss there.

It's common to see ruleset alerts from pfBlocker aliases before it has loaded and populated them. That's not normally a problem, the rules will load once pfBlocker updates the alias.
Try going to Status > Filter Reload and forcing a reload. If you see new ruleset alerts at that point it's a problem.

Steve

keson

Gentleman, thansk for your feedbacks!

I do absolutely share this approach and like keeping things simple. I do not use any special setups, a bunch of port forwards to a bunch of services running at my home network, acme for letsencrypt certificates, pfblockerng to keep intruders out of my home allowing only few IPs in and remote access for my country and couple of openvpn servers connections. And a list of resolvers for all my internal servers to call them by names and not ips... Apart from that no hooks, no diverts or hacks, nothing.

I took now 2 months old backup (I do weekly backups) and I can say at least the pfblocker is healthy again. The system reported reinstalling packages which never went away due to a lock and I found in processes that it was a squid package which kept the system in installation phase (I have to confirm, that in past 2 months all my pfSense boxes behaved the same - squid does not update/reinstall to the final expected "green" status) so I killed the installation and uninstalled squid completely. After that and after several reboots the system does "look ok" again.

I have checked all the settings and mostly I found the cause, but it will take me 4 more days to find out.

This is my theory:

I have a primary WAN (a local net provider) and a secondary WAN (pcie LTE card with sim card). It seems there is a FUP limit on the card and most likely I run out of data. That woudl explain why it worked all the time in past weeks and "suddenly" stopped working.
Now you might ask, why do I use the secondary WAN and not the primary one. That is still a bit of a mystery for me - both WANS are in a group for failover (packet loss or high latency and tier 1 and 2) so basically the primary GW for a local provider shall always be used... Here comes the thing. All my other pfsense boxes are connected to all kinds of providers, but the one which I am talking about here is on the same provider. Lets say my ip address is 109.72.10.20 and the other box is 109.72.20.30. To me it seems like if my pfsense box automatically always uses the LTE secondary GW. When I mark the second LTE GW as down, I cant get to the remote server at all.

So my conclusion is that from some routing reason (I did not create any static routes, all are created by the system automatically...) whenever I want to go to my remote pfsense box hosted by the same provider as mine, I am taking the LTE path, where due to FUP applied I get not through. I guess it is because my IP address and the remote address are on the same B subnet. Just guessing.

Attached is the image of my routes.

And how did I found out?
When I connect to the OpenVPN I could see on the remote box in currently connected clients my ip which is dynamic (from Vodafone) and not my fixed public IP:

So I do expect, that once the FUP is reset, I will again be able to work.

I do thank to all of you for your ideas, I have learned many new things again from you guys! And I am sorry for not identifying the routing problem at the very beginning, those are the things I simply take as granted and do not even dare to question that.

P.S. If I can resolve the issue with routing to the remote site over my primary GW, it would perhaps resolve sooner. Would a static route help?

P.P.S. Yes, I have the primary GW as my default GW.

P.P.P.S. The pings to remote site behaved exactly the same way over VPN and outside vpn, because all go through the secondary LTE GW (now I know).

stephenw10

How is the server configured? Which interface is it listening on?

It's possible to have it listening on both WANs using a group or port forwards.

Is the remote client also configured to use an interface group?

It's possible to end up with the tunnel on the LTE WAN if the main WAN was down when it came up. It will not move back when the WAN comes back up, that does not break established connections.

If you had CGN and the twi devices were actually inside the same subnet then I could imagine it might choose to use that as a source. It's hard to imagine they are though. None of those IPs are CGN specific. And carriers do not normally allow internal connections between clients.

Steve

keson

@stephenw10 Thanks steve for your reply,
Both server and client (server is the remote pfsense box, my side is the client) have the OpenVPN server/client listening/originating on the primary WAN interface. (the remote side has no other WAN interface, just one there).

So the remote server does not use any interface group.

I have also created another OpenVPN for remote access exactly the same - both originating and listening interface is the primary WAN.

I have also tested all combinations of originating the connection form the interface group.

The Question related to the LTE WAN not switching back to the primary WAN also came to my mind, but I have rebooted / restored the whole box so many times and even disabled the LTE WAN entirely. As soon as I switch off the LTE, i get no connection - no ping to the remote site. The connection drops.

I have simplified everything to bare minimum, disabled all vpns and anything what could interfere, switched off LTE, set only one gateway, the primary one.

As soon as the LTE goes down I get no ping:

The interfaces look like this:

and the routes are also much simpler:

With this bare setup I can not ping the remote 109.72.xx.xx from my 109.72.yy.yy

I am nearly convinced that there is no error in pfsense logic nor configuration but the nature of this setup. Or perhaps even more: the internet provider is blocking this on the way... I can ping the gateway, but not the target and the trace route also ends:

As you said, the career does not allow communicating between their end users.

And I am also nearly sure that as soon as my LTE sim card will be reset for FUP, I will be able to work again.

Steve, thank you for your great effort. I am rally grateful for all advices from you and @Gertjan.

stephenw10

So the tunnel stays up but somehow traffic across stops?

Or more like you lose connectivity to the other site entirely...

Like it's blocked in your ISP somehow.

The client then must have some failover setup, like two remote statements, since the WAN gateway would not go down.

STeve

keson

@stephenw10 The tunnel drops as soon as I kill the secondary WAN. And it does not reconnect/connect any longer. PING also drops.
So yes, it must be the ISP blocking it.
The client (me) have the failover, which basically was what I was using without noticing it. As I had a plenty of data left on LTE and I really didn't check the other side what IP I am coming from, I simply setup a tunnel, it connected and I never really questioned it.
So it first failed when the failover was "out of data" and it created this whole thread.

It is always the best path to figure out what it is and confirm, there is no mystery.
I owe you guys beer, whenever you stop in CZ, you are welcome :) Accommodation as a bonus !

stephenw10

So two remote statements with different binding IPs on the client?

I would not expect a failover group to come into play here as the WAN is not actually down, it's just that route that fails.

You might try switching to a TCP tunnel or using a different port. If something it deliberately blocking OpenVPN there it may pass it.

It could be an over-matching ACL at the ISP blocking traffic between any of their internal client IPs somehow. Not much you can do about that...

Steve

keson

Good morning,
As of 1.8. I can confirm that the secondary WAN is up and working ”as before” - the FUP was reset.

So it really is a ISP “feature” - they block access between their clients. So thanks to my multi WAN setup i was not aware of this limitation and “suddenly” i experienced that once the secondary WAN was not working thanks to the applied FUP.

Thanks to all of you for your support.