Connection loss after rebooting machines

ReasonNL

I've got a nasty issue that already has me pulling my hair for countless hours (days, months)!

I've got pfSense (2.0 and 2.1) installed on a virtual machine host in KVM and attached 4 nic's, 2 of them are virtual private networks (DMZ's), it's running dsnmasq as forwarder.

I've got a dozen servers (All ubuntu server) running within 2 DMZ's and for some reason after a machine has been started after pfSense was started it can't connect to anything outside it's private network (DMZ), all other machines running in the same network are reachable and can connect to external systems perfectly fine.
DNS queries get refused and I can't connect tot the Internet but I can ping the pfSense box.
I've tried everything on the firewall side, such as making a rule allowing everything from everywhere and it made no difference.
This problem even occurs randomly on running machines as well resulting in loss of service with no reason!

The fix always seems to be rebooting pfSense, but in some cases I managed to get connectivity back by just renaming the interface from the DMZ from the affected machine.

What I've tried so far:

Install a virtual desktop and check all kinds of network connectivity.
Messing with the TinyDNS server since one of the issues was DNS refusal.
Swapped interfaces with physical ones attaching real computers instead of virtual ones.
Installed 2.1 and used the virtio drivers instead of the e1000 (Dreadful performance BTW)
Reinstalled and reconfigured everything about 4 times in steps.
Searching the Internet for alternative systems because it's beginning to drive me crazy.

Nothing seemed to have fixed the issue and for now, adding/changing something to my network or notincing a website is unreachable means rebooting pfSense and stopping all services for at least a minute which is for most of the time unacceptable.

Anyone any clue what to pursue next?
I can't think of anything else anymore, but it's making my life miserable….

ReasonNL

Have tried using static DHCP leases, although this works fine (machines get there IP) there still unreachable and unable to reach something outside there network.
Have made a simpler infrastructure with 1 DMZ and 1 single machine from within the LiveCD (i368 instead of amd64) but the problem persists.. Everything from within the LAN works fine, everything within the DMZ (Same Firewall and NAT, automatic or manual, rules) only works if they are started before pfSense or if I rename the interface (Basically just saving and applying the interface page, doesn't seem to need to change anything and it doesn't make a difference which interface either).
I can't imagine this is normal behavior and there must be something wrong, but I can't figure out what.
There is network connectivity, the machines can reach pfSense and get an IP from the DHCP server but that's it.

Nobody any thoughts on what to monitor check or change?

dhatz

I'd check the ARP config (#arp -a)

Have you by any chance enabled static ARP ? (Service -> DHCP server -> Static ARP)

ReasonNL

@dhatz:

I'd check the ARP config (#arp -a)

Have you by any chance enabled static ARP ? (Service -> DHCP server -> Static ARP)

Thanks for your reply dhatz!

No, it's disabled on all interfaces using DHCP. (Added DHCP for static leases on the DMZ's)
Using the "arp (-i interface) -a" command returns not all current leases, it is missing entries from static leases, but whenever someone accesses a server it's back again. The same behavior is noticeable in the DHCP lease status (Status -> DCHP Leases) which is probably an ARP dump anyway.

After some testing with this ARP table however this does seem to be related!
Booting up a machine that's hasn't been running does appear in the list but has no connectivity until it times out once! Rebooting one that has and thus all-ready appeared in the list. Is also not reachable until it times out as well! There is definitely something going on there….
The same thing however, does not happen on the LAN site.

ReasonNL

Flushing the ARP table "arp -d -a" fixes it as well! (Guess renaming the interface does also do this)
Even though this still is a major issue I'm glad it's fixable without hindering all other systems while doing so.

Rebooting a machine and flushing the ARP table is a hassle but acceptable for now, however it still happens randomly as well…

ReasonNL

Damn, the connectivity seems to be fixed, but the DNS issue remains. Still get REFUSED back on every query…

dhatz

What type of switch (physical switch e.g. Cisco/HP/etc or vSwitch) are you using ?

ReasonNL

The physical switches are d-link devices, but I don't think these have any relation in this issue. It all happens within one physical virtual host server. It consists of 2 physical network cards which connect to the LAN and WAN. Both DMZ's are isolated virtual networks and only exists within the host machine.
Even connectivity from different devices from within this environment fails (it never leaves the physical machine at all). As a matter of fact it only happens on the virtual private networks regardless of the calling location.

I've added a cron job to the pfsense config to flush the arp cache every 5 minutes. Eventhough the DNS issues persist, all servers have been reachable ever since. It just hinders automatic security updates for now.

Could there be something wrong in the virtual private network devices created by KVM?
Only pfsense seems to have issues with it though.

dhatz

@ReasonNL:

Both DMZ's are isolated virtual networks and only exists within the host machine.
Even connectivity from different devices from within this environment fails (it never leaves the physical machine at all).

Well, in that case you'll have to investigate the vSwitch technology used.

In the past I've used Open vSwitch, VDE and VBox's various options and have noticed differences in their behavior, particularly when used with CARP.

cmb

That sounds a lot like some kind of IP conflict/proxy ARP gone insane somewhere. Check the ARP cache, is the MAC in there actually the real legit MAC of the system with that IP? If not, something else is answering for that IP, and that's what you need to find and fix.

ReasonNL

I've added all hosts as static leases to de DHCP server, so there isn't any conflicting mac or ip address.
If this was the case it shouldn't have worked at all, it does however work untill something in pfsense decides to mess things up, flushing the arp table always seems to fix it.
On a side note, all machines within one of these DMZ's can reach eachother perfectly fine. This has never been a problem.
On the virtual switch side, pfsense seems to block something somewhere. The box is reachable from within these networks, everything appears in the arp table in pfsense, but any other traffic, besides the ping request from the server inside the virtual network, gets denied, nothing appears in the firewall log though.
I'm using the virtual interfaces libvirt creates by adding virtual networks in the virt-manager.
It utilizes the brctl command to create a bridge with no physical end.
It's basically a default ubuntu server install with qemu-kvm and bridge-utils being managed by virt-manager (libvirt), these packages all come from the basic repository.

cmb

@ReasonNL:

I've added all hosts as static leases to de DHCP server, so there isn't any conflicting mac or ip address.

That doesn't guarantee no conflicting MACs or IPs, none will be assigned by the firewall, but it's possible to configure something else as such. You didn't answer my question, when there is a problem before you flush the ARP cache, is the real actual MAC in there? If so, and just flushing the ARP cache at which point it gets ARP exactly the same as it did previously, it sounds like a vswitch issue where somehow it gets kicked back into reality by an ARP request. Doing a packet capture on the affected NIC from the firewall will confirm or deny that. It sees what traffic is "on the wire" (not quite a wire in this case, but what the vswitch sends to the NIC, before all processing). If you don't see the traffic coming in, then it's a vswitch issue.

ReasonNL

Eventhough it does indeed not guarantee conflicting MAC addresses. Pfsense does not allow entering the same MAC twice and "all" hosts are being supplied with there IP by pfsense and they "all" have connectiviy with the pfsense box.
Didn't notice any conflicting addresses in the arp table, not sure this is possible though.

I'll test your question tonight (The cron job currently keeps the issue from happening), but as far as I can tell from previous tests, I did not notice any change in the arp table row from the affected host besides a new timeout.

But is does however has connectivity on system reboot (which seems to result in the same issue in a predictable and repeatable manner), pinging the pfsense box from within the affected host does work, it gets an arp entry even though the system itself isn't reachable or able to connect to anything outside it's network anymore untill pfsense gets forced to delete it's arp entries.

I'll check for any inconsistencies in the arp table tonight when I reboot a machine after disabling the cache flush cron job.

ReasonNL

Checked it and it's exactly as I wrote in my previous post. All the arp entries remain the same. Nothing seems to change on the pfsense side.

cmb

Ok that should rule out IP conflict or proxy ARP gone mad, so the important question now is does tcpdump show the traffic coming into that NIC on the firewall when it's not working? My guess is no, and somehow frequent ARP requests fix the vswitch.

ReasonNL

Sorry for the somewhat late reaction, this issue has become a lower priority since it basically works now.
I will check what happens with tcpdump, will report back.

ReasonNL

I've been testing with tcpdump. Starting a machine after pfsense started (and no ARP flush has been issued) within one of the virtual networks does not show any traffic in ttcpdump, but does get an IP (from pfsense) and is able to ping the pfsense box even though the packages don't show up in tcpdump. DNS queries don't show up either and are getting refused (Standard reply when the DNS server is not available?).
Machines running before a manual ARP flush do show traffic when pinging the pfsense box and can query the DNS just fine.
Some weird stuff is happening here!
I can't imagine no one else is experiencing this. There must be more people running pfSense in a KVM environment.

Side note: This does happen randomly with running machines as well, testing this is quite cumbersome though.

I've been experimenting with STP configurations, but this did not change anything besides flooding the tcpdump with STP change packages (Every 2 seconds).

cmb

If you're getting ping replies, and not seeing the request and reply in tcpdump, that goes back to my earlier suggestion of IP conflict. Something is responding to those pings, and if it's not in tcpdump there, it has to be some other device assuming you're capturing on the right NIC and not filtering out that traffic.

ReasonNL

Your pesistent thoughts about conflicting IP's and the missing tcpdump entries made me thinking, and I decided to do something radical, kill every running machine within the DMZ's and stop all services for the night.
I've turned of every virtual machines but one while pinging the pfsense ip address from that box.
It stopped…. but it suddenly came back and continued having a reply.
You were right! Something did reply even though nothing was running anymore. After searching what the hell this could be I found the problem.
The virtual network uses the first available IP address for itself and that IP was assigned to pfsense as well.
Nothing in the configuration in the virt-manager indicates the virtual network itself uses an address at all, but after digging through the config files on the server itself the suspecting IP's popped up.

I feel like a complete moron ditching your first suggestion as it was right all along.
I can't thank you enough, I've spent countless hours trying to figure out what caused this problem and it was right in my face the hole time. I've even looked at pfsense alternatives concluding every time it didn't compete.

Again cmb THANKS a million!!! ;D

cmb

@ReasonNL:

I feel like a complete moron ditching your first suggestion as it was right all along.
I can't thank you enough, I've spent countless hours trying to figure out what caused this problem and it was right in my face the hole time.

I've seen about everything there is to see with this kind of stuff countless times, people would be a lot better off if they just believed me. ;D At least you fessed up to it, thanks for the follow up.

Glad you found and fixed it. And that I was right. ;)