2.0 on ALIX 2D13 partial lockup
-
Our primary firewall/router runs pfSense on an ALIX 2D13 with a 2GB SanDisk CF. It's been running 1.2.3 for about a year (not continuously) with no major problems.
Both times I've tried to upgrade to 2.0, I've run into trouble.
In March I attempted to upgrade to 2.0-RC1 using the webConfigurator firmware tool. This worked fine for a few hours of surfing and twiddling a few settings, mostly related to multi-WAN and filtering. Then eventually I was unable to access the Internet, unable to ping or SSH into the ALIX, but still sporadically able to use the webConfigurator. I tried rebooting into the alternate slice and it never came back up, so then I re-imaged with 1.2.3 and all has been well.
A few days ago, I installed a completely fresh image of 2.0-RELEASE, did a bunch of initial configuration, and let it run in production for a few days. Today I tweaked more settings and ended up in a similar state. I'm fairly confident that my Internet access stopped several minutes after I last touched the configuration. When I do manage to get in and use Diagnostics > Reboot, the system comes back in a similar state. I haven't seen anything obviously unusual in RRD or syslog.
So, current state as it sits on my desk (unfortunately, without a serial cable handy)… DHCP is working: my laptop will get an IP. DNS is occasionally working: sometimes I can connect by name instead of IP. webConfigurator is often working: I can make at least a few requests at a time before they start timing out. At the moment SSH actually seems to be working. It isn't responding to ping, however.
I'd certainly like to get this system running 2.0 eventually. What should I be looking at?
-
Okay, this feels familiar: after adding firewall rules to specifically allow DNS and ICMP to the pfSense system itself from LAN, those work again. (My "default" rule for LAN sends traffic directly to a "LoadBalance" gateway group.)
I certainly don't know that ping was working before yesterday's trouble, but DNS must have been functioning for those preceding few days… maybe the relevant timeline is that 1-2 hours after booting with such a default-load-balance rule, something gets reset and stops allowing this traffic.
-
Probably something in the routing table expired or similar cached dns.
As you have found you need a rule below the loadbalance rule that uses the default gateway for local services. Otherwise all traffic gets routed to the loadbalaced WANs.Also as you have found you can't boot from the second slice if you've upgraded from 1.2.3 to 2.0. This is because the config file is not backwards compatible and it's shared between the two slices.
Steve
-
This is because the config file is not backwards compatible and it's shared between the two slices.
Ah, good to know. I had assumed the configuration would be part of the point of having the alternate slice.
So maybe adding those extra filter rules was the only necessary solution. The webConfigurator still feels more sluggish than before as it tries to load all the data for my dashboard, but perhaps it's not.
-
You are probably seeing a delay as it checks its version from online. Which you might also need a rule for (not sure).
-
More notes:
I assume DHCP continued to work since that protocol is never captured by pf?
A very similar set of firewall rules worked just fine in 1.2. Perhaps the anti-lockout rule, or a different structuring of services, directed DNS to the right place? I ended up adding a rule for: any traffic to LAN address -> * system routing table.
Some of my difficulty using SSH and webConfigurator may have just had to do with DNS.
I observe that loading the Traffic graph widget, on this hardware, can easily make the webConfigurator unresponsive for a good few minutes. And yes, another delay is the check for updates, which fails because snapshots.pfsense.org is mostly returning 404s right now.
-
then you need to update the firmware update settings … My returns "You are on the latest version.". Then again, so does my 2.1 and 2.0.1 test machines.
-
I assume DHCP continued to work since that protocol is never captured by pf?
Interesting question. You'd think that since DHCP is a layer 7 protocol that it would be blocked with other IP packets. I guess that enabling a DHCP server on an interface implies you want to allow it, a more technical explanation would be interesting though.
Having the update server set to a (currently) non functioning URL causes quite a large delay.
Steve
-
then you need to update the firmware update settings
Okay, that turned out to be remarkably easy to fix.
I assume DHCP continued to work since that protocol is never captured by pf?
Interesting question. You'd think that since DHCP is a layer 7 protocol that it would be blocked with other IP packets. I guess that enabling a DHCP server on an interface implies you want to allow it, a more technical explanation would be interesting though.
Whoops, I had somehow gotten it into my head that DHCP went down to about level 4. That's a nice-sounding explanation. Of course I'd think that enabling dnsmasq means I want to allow it, but I suppose one doesn't enable dnsmasq on particular interfaces.
-
That's exactly what I thought, 'DHCP comes before IP and TCP surely'.
Wikipedia is never wrong! ::)Steve