(Somewhat) High Availability setup? CARP not an option!

TWalker82

Hi All,

For those that don't want to read the background to this skip the below -------------

I'm a network admin for a large secondary school in London.

We have a somewhat unusual situation for our internet access - our provider is Atomwide/London Grid For Learning, and it would be hard to move away from them, both for contractual and product tie in reasons (they throw in a lot of added value products that we currently use).

We're currently using an old Cisco ASA 5505 box for firewall and network address translation. Some years ago it was public facing, but a few years ago our provider started providing and managing a new Cisco firewall, with the caveat that they have designated an internal IP range for us, which we don't want to use - switching across would be a huge headache.

So on the date that was setup, our Cisco ASA box was reconfigured as an intermediate firewall - it does 1:1 NAT and firewalling between the LGFL provided internal range and our existing internal IP addresses.

Our core switches are a pair of Cisco 4900m in a redundant configuration, and they do most of the routing (so they are the gateways on all our internal vlans). They use our ASA as the gateway of last resort (it is on it's own edge VLAN).

To make this work the ASA has a static route entry to use the cisco core switches for traffic aimed at our internal ip ranges.

I have considered if the ASA is necessary at all in this configuration, but unfortunately the 4900M switches have no NAT capability - if I tell them to use our service providers cisco box as the gateway of last resort, nothing works - presumably because it lacks a static route entry back into our core switches.

So, I want to replace the old ASA 5505 box with PFsense, virtualised on hyper-v so that I don't need to worry about more hardware.

This has become particularly important of late because the ASA is capped at 100mb, where our internet connection is getting upgraded next week to 1gb (symmetrical). We've been on 200 megabit for a few years now, and have been running a halfway house solution where our proxy server has a direct link onto the ISP firewall so that the web runs at the full speed.

We've got four Hyper-V 2019 servers, split across two blocks of the school with 10gb ethernet (so they should be plenty fast enough)

My first attempt was to put in two PFSense virtual machines in a CARP high availability configuration, but discovered along the way that CARP (and the MAC spoofing option required to make it work on hyper-v) is not compatible with using SR IOV enabled virtual switches - this is a bit of a deal breaker, as to remove SR IOV would hurt the performance of our other virtual servers, and fitting extra network hardware/links and creating an additional virtual switch to support the PFSense VMs would be a lot of effort/physical switch capacity (especially if we're going to make it robust against switch failure and fast enough to support full duplex gigabit speeds).

So is there some other standby option available in PFSense? So I have one live PFSense router, and another that eg ping monitors the first one - and if it goes down for some reason, the failover takes over?

I don't need connections and states to be maintained in the course of the failover - it doesn't matter if people lose connection for a minute or two - I'd just like it to fix itself in under ten minutes or so instead of me having to somehow dial in and switch on a replica.

So one option that occurs is that I can have a hyper-v replica of the PFsense VM, and just stick in a bit of scripting/powershell to check the primary is up every minute or two, and if not boot the replica up.

I was also wondering about just having two live pfsense vms (with different IP addresses, not CARP) and then setting our core switches to flip between the two of them as availability requires (BGP perhaps? Or with IP SLA?)

Any thoughts on how I should proceed? Options are something like:

Add the extra hardware to make CARP work.
Some other PFSense failover mechanism I'm not aware of.
Set up the Cisco core switches to manage failover between 2 pfsense VMs (caveat that while this can maintain outbound access inbound access to certain services will be down until both are live again - although if the pfsense is down it would kinda imply these services might be down on the hypervisor anyway!)
Automate Hyper-V replica failover

pete.s.

Is your Hyper-V servers HA?

If your Hyper-V servers are not, then what is the point of having the firewall HA? You just need a backup of the VM that you can restore - but you need that for all VMs regardless.
If your Hyper-V servers are HA, then just let Hyper-V provide failover on the single pfsense firewall VM.

So regardless of Hyper-V, basically just run one VM with pfsense and no CARP.

TWalker82

Nope, the Hyper-V servers aren't in a HA setup, although this is basically because of my paranoia about a split brain setup and the headaches that might cause (as mentioned, we have two hypervisors in one wing, two in another - if the link between the two wings became severed (OK, we have two physically distinct links so this is pretty unlikely) then we might have people in both wings working away saving their files, updating the database which would just be horrible to resolve once the links are restored - I'd rather have the current situation with replicas in each wing on standby so that if there is some kind of problem then I make the judgement call on whether to power on replicas.

With a firewall, no important data is getting saved/changed on it, so it doesn't matter if a split brain scenario arises - so I suppose this does present a fifth option, setup a Starwind VSAN or similar and run a highly available PFSense VM.

And to be clear, obviously I do and will have backups of whatever I setup - unfortunately I live about an hour away from work so if things go wrong I'd much rather be able to sort them remotely if at all possible (right now with Covid 19 I'm mostly working remotely anyway) than have to get up early and rush into work to try and restore backups/otherwise fix the internet.

Of course a lot of this is paranoia anyway, as I do have an alternative route to remote on if the firewall is down - I just want to make sure the solution I'm putting in is robust.

nzkiwi68

Given everything you've said, I would simply install a hardware pair of clustered XG-1537 appliances from Netgate.

https://www.netgate.com/solutions/pfsense/xg-1537-1u.html

Really, it's so cheap, and you've supporting pfSense and the work they do.
If anyone complains about the cost, get a quote on a clustered pair of comparable performance SonicWALL, Fortigate or Palo Alto firewalls. These are good firewalls, I'm not knocking them at all, but, boy oh boy the cost difference will be enormous and don't forget to think about the ongoing annual costs too.

The XG-1537 is a beast, it's ridiculously powerful and it has 10GbE onboard. You can add a 4 port Intel GbE card if you just want to run 1 GbE for now.

My experience with the Netgate gear is it has been reliable, at least as reliable as anything else I've used.

pete.s.

@nzkiwi68 said in (Somewhat) High Availability setup? CARP not an option!:

Given everything you've said, I would simply install a hardware pair of clustered XG-1537 appliances from Netgate.

https://www.netgate.com/solutions/pfsense/xg-1537-1u.html

Really, it's so cheap, and you've supporting pfSense and the work they do.
If anyone complains about the cost, get a quote on a clustered pair of comparable performance SonicWALL, Fortigate or Palo Alto firewalls. These are good firewalls, I'm not knocking them at all, but, boy oh boy the cost difference will be enormous and don't forget to think about the ongoing annual costs too.

The XG-1537 is a beast, it's ridiculously powerful and it has 10GbE onboard. You can add a 4 port Intel GbE card if you just want to run 1 GbE for now.

My experience with the Netgate gear is it has been reliable, at least as reliable as anything else I've used.

That's fine except that he don't need that performance and already has perfectly fine hyper-v hosts that can do the work and there don't appear to be any real security implications since it's mostly a NAT device.

I would just run with one pfSense on one Hyper-V server and have a backup on the other one for Hyper-V failures. Easier to setup, easier to manage and easier to upgrade compared to a HA setup.

If high availability absolutely was needed (can't see how it is), I'd just put in a standard dual or quad GbE network card in each Hyper-V server, dedicated to pfSense. Remember, it's not pfSense that is actually the problem here - it's Hyper-V and the network drivers/cards. And this is the workaround.

TWalker82

Thanks for the input guys - yeah, as Pete said, those devices look like overkill for our needs (and believe me, if you knew what my budget was and the general state of desktop hardware in the school you'd understand why I can't justify the outlay!)

Also, having done a bit more research, a Hyper-V high availability cluster won't work for our needs, because it'll have to replace our existing Hyper-V Replication setup (A hyper-v replica server can't be member of a cluster) which is just too much to think about doing right now.

So this morning I've been looking at scripting the failover of a Hyper-V replica, to automate it - I've adapted the script published here - https://docs.microsoft.com/en-us/archive/blogs/keithmayer/automated-disaster-recovery-testing-and-failover-with-hyper-v-replica-and-powershell-3-0-for-free

Plan is that this powershell script is run on the Replica server on a regular basis - perhaps every minute - it checks which the primary replica server is for the named VM (pfsense), checks connectivity to that server, and if the test fails commences an emergency failover and starts up the VM

# Function checks that virtual machine primary is online
Function PrimarySiteAvailable {
	Param ([string]$HyperVHost)
	
	$Test = Test-VMReplicationConnection -AuthenticationType Kerberos -ReplicaServerName $HyperVHost -ReplicaServerPort 80 -ErrorAction SilentlyContinue

	If ( $Test -match "was successful") {
		Return $True
	}
	Else {
		Return $False 
	}
}

# Set the name of the replica VM here
$VMname = "pfsense"

#Get the name of the replication primary for that VM
$pri = get-vmreplication -vmname $VMname | select primaryserver
$pri = $pri.PrimaryServer.Substring(0,$pri.PrimaryServer.IndexOf("."))

# If we are running on the replication primary that implies the machine has already been failed over, do nothing.
if ($pri -ne $env:computername){
	# Perform the uptest
	$IsPrimarySiteUp = PrimarySiteAvailable -HyperVHost $pri
	
	# Failover and start VM if uptest failed
	If ($IsPrimarySiteUp -eq $False) {
		$VM = Start-VMFailover -VMName $VMname -PassThru -Confirm:$false

		Start-VM -VM $VM
	}
}
# Note that when the primary comes back online, although it may automatically start the VM that was down, it will pause it when it notices that the replica has been failed over and powered on.
# To perform maintenance on primary hypervisor either perform planned failover, or move the vm to a different host (zero downtime!) - the replica primary automatically updates when vm is moved

Thought I'd publish here whether I use it or not as it was surprisingly hard to track down.

The one further option I came across yesterday is that OpenWRT can be setup with VRRP via Keepalived for high availability - the reason that this is significant is that using Keepalived, Virtual MAC addresses are optional - quoted here:

Note on Using VRRP with Virtual MAC Address
To reduce takeover impact, some networking environment would require using VRRP with VMAC address. To reach that goal Keepalived VRRP framework implements VMAC support by the invocation of ‘use_vmac’ keyword in configuration file.
from - https://www.keepalived.org/doc/software_design.html

So as far as I can tell, if I don't use VMAC then it would work fine on my hypervisors with SR IOV, as the MAC spoofing option will not be required.

Unfortunately the documentation doesn't look quite so good for setting that up but I'm not sure I can resist trying it out.

TWalker82

So just to update what I settled on, I have gone with a pair of OpenWRT virtual machines running in a high availability setup with Keepalived and VRRP.

Keepalived works fine without any special settings on the Hypervisor switch/VM - some connections will drop when you power off the active instance, but they come back within five seconds or so - I did a test where I RDP'd from outside the routers to a device on the inside, loaded up a live TV stream on the machine inside the routers, powered off the active router and neither the RDP stream nor the live tv stream were interrupted.

Shame that this isn't available within FreeBSD/PFSense (I understand keepalived on freebsd hasn't been updated since 2011) - or that CARP has the option of running without changing MAC addresses.

Have to say OpenWRT also boots up quicker (in about 10 seconds) and routing performance was better - was getting nearly 5 gigabits in my Iperf3 tests where PFSense under identical conditions would do a smidge over 2 gigabits.