Our Sites become unavailable randomly

wheemer

We have two older IBM servers running PFSense 2.2.2 in carp mode.

These things have been rock solid for two years now.

Recently we had a crash on out main box, so I submitted the crash dump via the webui.

Since then this box will stop our websites from loading after a while, very frequently. It works after a reboot for maybe half an hour then all sites we host go down and are unreachable.

I have since been forced to use "Enter Persistent Carp Maintenance Mode" so that our backup could take over.

This worked fine for a few days. On Tuesday I began setting up another PFSense box which has been taking me a while…

Starting yesterday morning at 11:30 am our secondary box began blocking our websites in the same manner as before. A reboot fixed it until this morning, where yet again our websites were all offline.

I'm a major noob with firewalls, however I was able to have PFSense working absolutely perfectly for two years. And now all of a sudden both my boxes die all the time.

What could this be? And how can this happen on two different boxes?

I think there is something wrong with 2.2.2

mer

What could it be? Hard to say without anyone seeing the configuration, rules, packages running, logs, etc.
You say "older IBM servers", so a possible "what could it be" is "hardware failure".

But again, it's really hard to diagnose anything without any information. Think of what happens when you say "My car won't start. Why?"

muswellhillbilly

Speaking from my own experience, I seldom tend to do in-place upgrades on old hardware. Rather, I'll choose newer hardware to install an updated version and transfer my configs over. I ran an update once on some old Dell blades and they wouldn't even boot up after that! You say you've had these firewalls in place for two years, yet 2.2.2 has only been out a short while. I assume from this you've performed an in-place update then? Might be worth sourcing some newer tin and see if that solves your problem.

Otherwise, as mer says, it really could be almost anything given the information provided.

wheemer

Well I am not going to replace them since I have no budget for that.

There's no way it was a hardware failure on both boxes at the same time.

I always update PFsense… I mean why would there be a built in updater if it's not good to use it?

I think the issue may be related to DNS, we have our DNS on windows boxes and PFSense just passes it through.

I was hoping there might be something in the UI I could look for but after checking the logs I can't see to find anything.

wheemer

I have reinstalled from scratch 2.2.2 on my primary box. So far so good, it's been an hour and it's working ok. We will see if it was some problem with an update, time will tell.

Thanks

Harvy66

@muswellhillbilly:

Speaking from my own experience, I seldom tend to do in-place upgrades on old hardware. Rather, I'll choose newer hardware to install an updated version and transfer my configs over. I ran an update once on some old Dell blades and they wouldn't even boot up after that! You say you've had these firewalls in place for two years, yet 2.2.2 has only been out a short while. I assume from this you've performed an in-place update then? Might be worth sourcing some newer tin and see if that solves your problem.

Otherwise, as mer says, it really could be almost anything given the information provided.

Usually an issue when your hardware does not support modern standards, AHCI, MSI-X, UEFI, ACPI, etc. These standards have been around for a long time, but many places still sell hardware that does not. Make sure your hardware supports stuff like these and you'll be good for a long time.

I always research my hardware before purchases and I haven't had any issues in 15+ years. Everything just works. Not to say these standards were around back then, but I always make sure I know what I'm buying to make sure it's as good as it can possibly be.

wheemer

My backup box has been reinstalled as well…

I have had no issue on the primary box yet.

So far so good.

NOYB

If it happens again you might want to look into possibility the system being compromised or target of a DoS attack.

cmb

@wheemer:

What could this be? And how can this happen on two different boxes?

I think there is something wrong with 2.2.2

Sounds a lot like the symptoms of an IP or MAC conflict, though could be any number of other problems. It's most definitely not a general problem with 2.2.2.

Where rebooting fixes something with symptoms along these lines it's most often because of what rebooting does to the switch(es) and/or router(s) the system is connected to (updating CAM and ARP tables), and nothing to do with actually rebooting.

If it happens again, packet capture on WAN filtering on one of the affected public IPs and try to reach one of the sites in question. Stop the capture, see if anything is actually getting there. Are the WAN IPs dropping, or only the CARP IPs?

If you're using the common VHIDs 1, 2, 3 etc. on your CARP IPs, I would change those to something significantly higher in the range. VHID determines the virtual MAC and VRRP uses same virtual MAC space. It's possible your provider brought up VRRP using conflicting VHIDs, or you have something else on your network running CARP or VRRP with the same VHID/VRID causing a MAC conflict. Rebooting would temporarily make that system "win back" the MAC in question with the WAN-side switch, but would lose it again at some point.

firewalluser

@wheemer:

Well I am not going to replace them since I have no budget for that.

There's no way it was a hardware failure on both boxes at the same time.

I always update PFsense… I mean why would there be a built in updater if it's not good to use it?

I think the issue may be related to DNS, we have our DNS on windows boxes and PFSense just passes it through.

I was hoping there might be something in the UI I could look for but after checking the logs I can't see to find anything.

Actually you would be surprised at how common it is especially when considering how batches of electronics are made and so having two identical machines ie a small batch exposes you to the same batch of ram chip's batch of cpu's, batch of psu's, and batch of HDD's.

I have reinstalled from scratch 2.2.2 on my primary box. So far so good, it's been an hour and it's working ok. We will see if it was some problem with an update, time will tell.

Thanks

One of my first thoughts would be your machine may have been compromised. Lets face it who virus checks their firewalls/routers?
http://krebsonsecurity.com/2015/01/lizard-stresser-runs-on-hacked-home-routers/

I'd also suggest rebooting the pfsense boxes after making any config changes just to be sure everything sticks properlys and conflicts dont arise just to be doubly sure as theres a bug which is fixed in 2.2.3 which might have implications for your setup.

wheemer

I setup a different server with a clean install of 2.2.2, and imported my config. Everything was working fine over the weekend, however Monday it went down again. The strange thing is that I am always still able to remote desktop in through the box.

So some parts of PFSense must not be affected. Also sometimes our webserver sites are offline, yet our email servers webmail works.

Again, please keep in mind this configuration was working for a couple years without issue.

wheemer

Our network team from our Fiber is saying that we are not under a denial of service attack… He says everything looks fine and that there is not that much traffic at all.

wheemer

Our website just went down again.

Our network teams says there are 55 connections to port 53, our dns server from Russia.

I have PFBlockerNG enabled where I am blocking all of russia and all of china.

Could this be related to our issue?

firewalluser

What do you logs show? Have you packet captured yet? If so, have you tried some of the packets against a test webserver or firewall?

wheemer

I could not see anything in the logs, which makes sense since they should be denied.

Our provider has blocked the IP address and everything is back to normal.

I do not understand why our PFsense was able to be broken like that though. Seems a little bit unreliable that something as simple as DNS traffic can take down our whole website.

tim.mcmanus

@wheemer:

I could not see anything in the logs, which makes sense since they should be denied.

Our provider has blocked the IP address and everything is back to normal.

I do not understand why our PFsense was able to be broken like that though. Seems a little bit unreliable that something as simple as DNS traffic can take down our whole website.

Poorly configured DNS servers are a main source for DDOS attacks. I won't go into specifics, Google will give you some good reading, but someone can send a DNS query to your DNS server which generates a large response to the "target".

I would advise against exposing a DNS server to the internet unless you absolutely need to and deeply understand how to configure it. IMHO, block port 53 from the WAN and everything should be good.

mer

Tim, you mean "block inbound to port 53 on WAN if it was not generated by LAN", yes?

tim.mcmanus

@mer:

Tim, you mean "block inbound to port 53 on WAN if it was not generated by LAN", yes?

Yeah, that makes sense.

I run DNS internally but that bind server also does root queries externally. No external port 53 access to it (block inbound to port 53 on WAN if it was not generated by LAN). Script kiddies are always on the lookout for a misconfigured service.

wheemer

It's pretty vague to say poorly configured without saying what you mean exactly.

We need port 53 open because we host our external dns.

We have recursion disabled and places like intodns.com say our dns is fine.

wheemer

Also our DNS is running from windows 2012 r2 with all updates.

So all PFSense has to do is pass the packets through, yet it still tanks.