Trying to NAT on 2 pfSense boxes on the same LAN and different WAN

fabsah

Hello,

I've modified my post and added more details (IPs, GW, screenshots, etc.) in this thread. Please, read the whole story here :
https://forum.pfsense.org/index.php?topic=92576.msg513532#msg513532
Thanks

Context : I'm working as a volunteer for a medium sized non-profit. Budget is always an issue. I'm no way a network expert, just your regular IT guy. I started a few months ago in this organisation and my first task was to deploy a pfsense firewall. At the time, we were much smaller and the following setup was more than enough :

This is our current setup, working perfectly :

pfsense 2.1.5 on a retired desktop computer with 3 Intel NICS (let's call this PRODUCTION) :

em0 - WAN1 - A cable modem with static public IP (100/50Mbits main connection)
em1 - WAN2 - A vdsl2 modem with static public IP (30/6Mbits backup connection)
em2 - LAN (172.16.2.100)
5 x 48 ports TPLINK switches (one on each level of the building)
Multiple wifi AP spread accross the building.
Several servers (2 Windows Terminal Server, NASes, several Ubuntu Servers VM, etc.).

The 2 WANs are configured as a failover group. WAN2 only kicks in if WAN1 fails. It happened once during the past 12 months. But it works.

There's a DHCP server on the LAN. It serves DHCP lease from 172.16.60.1 to 172.16.60.250
There is a squid proxy with squidguard for basic content filtering.
Some NAT rules are enabled from WAN -> LAN (for remote SSH access to some servers)
There is an OpenVPN server (on the pfsense box) on WAN1. The purpose of this server is to connect to our windows terminal server and to remotely mount smb shares.

Everything is humming nicely. With my limited experience and knowledge, I've been able to manage this whole thing and add services according to the organisation needs, until now.

We now want to upgrade the redundancy and the security level. As you can see, if this retired computer fails, the whole network goes down.
So, we've bought 2 dedicated pfsense appliances (let's call them TEST1 and TEST2).
Those are Intel Atom with 4 cores, 4Gb of RAM, SSD and 8 NICS.
The goal is to set TEST1 and TEST2 as a CARP cluster and to replace PRODUCTION.

Here is how TEST1 is currently setup :

em0 - WAN3 - The same cable modem with ANOTHER static public IP (100/50Mbits main connection)
em1 - LAN (172.16.1.50)
em2 - WAN2 - A vdsl2 modem with static public IP (30/6Mbits backup connection)

My original candid plan was to install TEST1 alongside PRODUCTION. Of course, the DHCP server on TEST1 LAN would be disabled at first. I would replicate manually and progressvly the services and configuration from PRODUCTION to TEST1. I would be the only one using TEST1 as a gateway until everything would be ready. When everything is working fine, unplug PRODUCTION and cross my fingers ;).

Everything was OK until I started replicating and testing NAT rules on TEST1.

The problem I'm facing is trying to access any of our servers from WAN3 through TEST1.
I've setup NAT accordingly.

It works flawlessly when I ssh from my home to the pfSense firewall TEST1 through the NAT rule setup on the WAN3 interface on TEST1. From there, I can ping or "piggyback-ssh" any server on our LAN.
It doesn't work if I try to ssh from home (ot any other location outside our LAN) to SERVER1 (172.16.1.11) through WAN3. Of course, I've tried different ports, checked my NAT rules ad nauseum, and even called the cable company to make sure there was a full nat in place.

I think I'm out of my depth to solve this issue.

From what I've gathered by searching the forum, since the servers on the LAN are using PRODUCTION as a gateway and for other reasons related to ARP / layer2, what I'm trying to achieve is not possible. Or very complicated (for my level of expertise) to achieve.

Would some charitable soul confirms that I'm not doing this the right way and point me toward the right path to success ;) ?

Thanks !

Derelict

What interface are you using to sync states/configs?

fabsah

@Derelict:

What interface are you using to sync states/configs?

Hello Derelict,

I'm not. The 2 firewalls (PRODUCTION and TEST1) are not synced. In the future, TEST1 and TEST2 will be synced. For now, I'm just trying to deploy the new firewal - TEST1 - alongside PRODUCTION for testing purporse.

Do you thing the lack of sync is the issue ?

Thanks !

Derelict

No. Not really.

How many public IPs do you have on each service and how are they obtained? Are they routed to you? DHCP?

You're going to have to provide more info. Like real public IPs, etc. You can change them for "privacy", just stay consistent.

I'm guessing you are seeing an asymmetric routing issue. The server you're connecting to has PRODUCTION as its default gateway so you connect in through TEST1 but the return traffic is being routed out PRODUCTION so your firewall receives replies from the PRODUCTION IP address instead of TEST1's and it's out of state so it's dropped.

https://doc.pfsense.org/index.php/Asymmetric_Routing_and_Firewall_Rules

It sounds like you're on the right track. The only think I'm not sure about is squid/squidguard on an CARP/HA pair. I've never done that.

fabsah

Thank you so much for taking interest in my problem and helping me resolve this. I feel less alone :D

I'll be at the office tomorrow and provide you with the informations you've requested (and screenshots of NAT and GW, that shouldn't hurt).

fabrice

fabsah

Hello,

I've amended the first post to include more informations and the result of some more testing I did this morning, as requested.

I've also included screenshots, labelled Arrakis*.png for the production firewall and Cerberus*.png for the testing firewall (the one I'm having trouble with).

Here it is :

Hello,

Context : I'm working as a volunteer for a medium sized non-profit. Budget is always an issue. I'm no way a network expert, just your regular IT guy. I started a few months ago in this organisation and my first task was to deploy a pfsense firewall. At the time, we were much smaller and the following setup was more than enough :

This is our current setup, working perfectly :

NUMERICABLE Cable modem (gateway - NO DHCP - Full NAT) :
PORT 1 : 212.xxx.xxx.53/31
PORT 2 : 212.xxx.xxx.145/29

BELGACOM VDSL2 modem : public IP and 192.168.1.1 (gateway, DHCP, Full NAT)

pfsense 2.1.5 on a retired desktop computer with 3 Intel NICS (let's call this ARRAKIS) :

em0 - NUMERICABLE PORT1 212.xxx.xxx.54- A cable modem with static public IP (100/50Mbits main connection)
em1 - BELGACOM 192.168.1.2 - A vdsl2 modem with static public IP (30/6Mbits backup connection)
em2 - LAN (172.16.2.100)
5 x 48 ports TPLINK switches (one on each level of the building)
Multiple wifi AP spread accross the building.
Several servers (2 Windows Terminal Server, NASes, several Ubuntu Servers VM, etc.).

The 2 WANs are configured as a failover group. WAN2 only kicks in if WAN1 fails. It happened once during the past 12 months. But it works.

There's a DHCP server on the LAN. It serves DHCP lease from 172.16.60.1 to 172.16.60.250
There is a squid proxy with squidguard for basic content filtering.
Some NAT rules are enabled from WAN -> LAN (for remote SSH access to some servers)
There is an OpenVPN server (on the pfsense box) on NUMERICABLE. The purpose of this server is to connect to our windows terminal server and to remotely mount smb shares.

Everything is humming nicely. With my limited experience and knowledge, I've been able to manage this whole thing and add services according to the organisation needs, until now.

We now want to upgrade the redundancy and the security level. As you can see, if this retired computer fails, the whole network goes down.
So, we've bought 2 dedicated pfsense appliances (let's call them Cerberus and TEST2).
Those are Intel Atom with 4 cores, 4Gb of RAM, SSD and 8 NICS.
The goal is to set CERBERUS and TEST2 as a CARP cluster and to replace ARRAKIS.

Here is how CERBERUS is currently setup :

em0 - NUMERICABLE PORT2 212.xxx.xxx.146- The same cable modem with ANOTHER static public IP (100/50Mbits main connection)
em1 - LAN (172.16.1.50)
em2 - BELGACOM 192.168.1.3 - A vdsl2 modem with static public IP (30/6Mbits backup connection)

My original candid plan was to install CERBERUS alongside ARRAKIS. Of course, the DHCP server on CERBERUS LAN would be disabled at first. I would replicate manually and progressvly the services and configuration from ARRAKIS to CERBERUS. I would be the only one using CERBERUS as a gateway until everything would be ready. When everything is working fine, unplug ARRAKIS and cross my fingers ;).

Everything was OK until I started replicating and testing NAT rules on CERBERUS.

The problem I'm facing is trying to access any of our servers from NUMERICABLE through CERBERUS.
I've setup NAT accordingly.

It works flawlessly when I ssh from my home to the pfSense firewall CERBERUS through the NAT rule setup on the NUMERICABLE interface on CERBERUS. From there, I can ping or "piggyback-ssh" any server on our LAN.
It doesn't work if I try to ssh from home (ot any other location outside our LAN) to SERVER1 (172.16.1.11) through NUMERICABLE. Of course, I've tried different ports, checked my NAT rules ad nauseum, and even called the cable company yo make sure there was a full nat in place.

UPDATE 20-APRIL - It works if I ssh from home to SERVER2 (172.16.210.50) though NUMERICABLE on CERBERUS IF SERVER2 has CERBERUS as a gateway (just tested it this morning)

This last updates confirms Derelict's assertion that the servers a answering the NAT request to the wrong gateway.

So, how to fix this ? Is there a way to make the server answer to different gateways, depending on where the request comes from ?

Thanks !

Arrakis-Dashboard.png_thumb

Arrakis-GW1.png_thumb

Arrakis-GG2.png_thumb

Arrakis-GG1.png_thumb

Cerberus-Dashboard.png_thumb

Cerberus-NATdetails.png_thumb

Cerberus-NAT1.png_thumb

mer

Someone with more specific real world knowledge (I've got "book knowledge") should be able to provide more details, but I don't think you can with your current setup. If TEST1 and PRODUCTION were in a redundant configuration, then I think it would work as all the servers behind them would point to a single address.

Perhaps you could get in off hours and on one of the servers point it at TEST1 or set up a test server to point at TEST1. I think this (a test server, doesn't have to be doing much, just a machine you can have control of) would be helpful because when you get TEST1 and TEST2 into a redundant configuration, you will want to test that (have them up and running, fail one of them by pulling a network cable).

Just some thoughts.

fabsah

@mer:

Someone with more specific real world knowledge (I've got "book knowledge") should be able to provide more details, but I don't think you can with your current setup. If TEST1 and PRODUCTION were in a redundant configuration, then I think it would work as all the servers behind them would point to a single address.

Perhaps you could get in off hours and on one of the servers point it at TEST1 or set up a test server to point at TEST1. I think this (a test server, doesn't have to be doing much, just a machine you can have control of) would be helpful because when you get TEST1 and TEST2 into a redundant configuration, you will want to test that (have them up and running, fail one of them by pulling a network cable).

Just some thoughts.

Hello Mer,

Thanks for you reply and taking the time ;)

Yep, I've already been able to ssh from home to SERVER2 (an old laptop with sshd) though NUMERICABLE on CERBERUS IF SERVER2 has CERBERUS as a gateway. So I KNOW it will work when I retire Arrakis (PRODUCTION) and let Cerberus manage the network. I can do that on a sunday. This is my plan B ;-)

I would like a plan A where both firewall can co-exist for a few days. But I understand that what I'm asking might not be technically possible. I still have 10 days to figure this out. I hope someone has THE solution, but it's not the end of the world if I have to ressort to plan B.

Thanks again !

fabrice

mer

If you want TEST1 and TEST2 in a redundant configuration (CARP), why not put TEST1, TEST2 and PRODUCTION into a group, with PRODUCTION as the master, let everything sync, then just retire PRODUCTION? It may take a bit of work getting everything talking and synched, plus you're changing the currently working config on PRODUCTION, but it would give you a head start on TEST1/TEST2 redundancy.

fabsah

@mer:

If you want TEST1 and TEST2 in a redundant configuration (CARP), why not put TEST1, TEST2 and PRODUCTION into a group, with PRODUCTION as the master, let everything sync, then just retire PRODUCTION? It may take a bit of work getting everything talking and synched, plus you're changing the currently working config on PRODUCTION, but it would give you a head start on TEST1/TEST2 redundancy.

The problem is that PRODUCTION/Arrakis is already maxed out with 3 NICs (it's an old retired desktop computer). So there's no slot left to add any more dedicated NIC for CARP syncing. We could buy a PCI card with 4 ETH ports, but that would mean a downtime for installing and reconfiguring -> If we have to have downtime, I'd better go for plan B and dedicate a whole sunday to put TEST1/TEST2 in production.

I'm aware of the amateurish aspect of this whole thing. We are trying to get more professionnal ;)

Thanks !

mer

Amateur, professional, just a matter of perspective. ;D Even though it's not optimal, you could use one of the existing interfaces as the carp interface, especially since you're looking at "10 days". Or just get TEST1/TEST2 all configured redundant and working, then take a Sunday and flip everything over. I'm guessing that TEST1/TEST2 are going to get the same LAN address as PRODUCTION so you don't have to push new routing tables out to all the clients? If so keep in mind "ARP cache" on the clients, if you get the same static IP on the WAN side, there may be upstream ARP cache to think about too.

That's about all I have to offer except for good luck.

fabsah

@mer:

Amateur, professional, just a matter of perspective. ;D Even though it's not optimal, you could use one of the existing interfaces as the carp interface, especially since you're looking at "10 days". Or just get TEST1/TEST2 all configured redundant and working, then take a Sunday and flip everything over. I'm guessing that TEST1/TEST2 are going to get the same LAN address as PRODUCTION so you don't have to push new routing tables out to all the clients? If so keep in mind "ARP cache" on the clients, if you get the same static IP on the WAN side, there may be upstream ARP cache to think about too.

That's about all I have to offer except for good luck.

Thanks Mer, the "sunday flip-over" is more and more where I'm setting my mind, based on the feedback received on this forum and the research I've made. I'll check the ARP cache issue you've just mentionned and keep that in mind during the deployment phase.

;) :)

fabsah

Just a follow up :

We made the switch 9 days ago and it's been a painless process. Everything was well planned, if I may say ;)

Cerberus (the new firewall) was carefully tested by a few selected people before that. The only remaining issue was NAT related, because the servers were not using the new gateway.

We chose a saturday to put the new firewall in production.

We basically :

deactivated the LAN DHCP server on the "old" firewall
activated the LAN DHCP server on Cerberus.
Turned off the "old" firewall.
Shut down and restarted all the servers / VM / network printers / wifi AP so they could use the new gateway.

We had to tinker with the vHost/domain server/Terminal server DNS configuration, but it was solved in under an hour. Mainly because I never touch those servers (this is outsourced to a private company), so I had to google my way around to find where to make according changes.

I'm now in the process of configuring CARP / pfsync / XML-RPC between the 2 pfsense appliances.

Thanks to everyone for their help !

fabrice