Lost LAN to Internet connectivity
We are running pfSense 2.2-BETA, Nov 17 snapshot - as a guest under Hyper-V 2012 R2.
This morning we were working just fine until just after 8:30am we noticed web pages were not responding. Eventually I looked at the pfSense console and saw these messages:
(ada0:ata1:0:1:0): RES: 51 00 00 00 00 00 00 00 00 00 00 (ada0:ata1:0:1:0): Error 5, Retries exhausted (ada0:ata1:0:1:0): WRITE_DMA. ACB: ca 00 8f 01 00 40 00 00 00 00 40 00 (ada0:ata1:0:1:0): CAM status: Command timeout (ada0:ata1:0:1:0): Retrying command (ada0:ata1:0:1:0): SETFEATURES ENABLE WCACHE. ACB: ef 02 00 00 00 40 00 00 00 00 00 00 (ada0:ata1:0:1:0): CAM status: ATA Status Error (ada0:ata1:0:1:0): ATA status: 51 (DRDY SERV ERR), error: 00 () (ada0:ata1:0:1:0): RES: 51 00 00 00 00 00 00 00 00 00 00 (ada0:ata1:0:1:0): Retrying command (ada0:ata1:0:1:0): SETFEATURES ENABLE WCACHE. ACB: ef 02 00 00 00 40 00 00 00 00 00 00 (ada0:ata1:0:1:0): CAM status: ATA Status Error (ada0:ata1:0:1:0): ATA status: 51 (DRDY SERV ERR), error: 00 () (ada0:ata1:0:1:0): RES: 51 00 00 00 00 00 00 00 00 00 00 (ada0:ata1:0:1:0): Error 5, Retries exhausted [2.2-BETA][root@pfSense.custco.local]/root:
A while back when I was testing on an old computer with fussy hard drive (OK - it was going bad) I had rebooted and it seemed to temporarily fix things. So I did.
But when pfSense came back we still had no connection. I reviewed our rules and they did not seem to have changed. The dashboard showed all interfaces up and traffic coming to each.
I looked at the routing table and it seemed fine. I did not capture it at that point, but I'm pretty sure it was exactly like this:
[2.2-BETA][root@pfSense.custco.local]/var/log: netstat -nrW Routing tables Internet: Destination Gateway Flags Use Mtu Netif Expire default 184.108.40.206 UGS 61142076 1500 hn0 220.127.116.11/28 link#5 U 1325733 1500 hn0 18.104.22.168 link#5 UHS 2 16384 lo0 22.214.171.124 link#5 UHS 6 16384 lo0 127.0.0.1 link#3 UH 31505123 16384 lo0 192.168.1.0/24 link#6 U 90839193 1500 hn1 192.168.1.1 link#6 UHS 0 16384 lo0 Internet6: Destination Gateway Flags Use Mtu Netif Expire ::1 link#3 UH 0 16384 lo0 fe80::%lo0/64 link#3 U 0 16384 lo0 fe80::1%lo0 link#3 UHS 0 16384 lo0 fe80::%hn0/64 link#5 U 0 1500 hn0 fe80::215:5dff:fe62:e311%hn0 link#5 UHS 0 16384 lo0 fe80::%hn1/64 link#6 U 15 1500 hn1 fe80::215:5dff:fe62:e312%hn1 link#6 UHS 0 16384 lo0 ff01::%lo0/32 ::1 U 0 16384 lo0 ff01::%hn0/32 fe80::215:5dff:fe62:e311%hn0 U 0 1500 hn0 ff01::%hn1/32 fe80::215:5dff:fe62:e312%hn1 U 0 1500 hn1 ff02::%lo0/32 ::1 U 0 16384 lo0 ff02::%hn0/32 fe80::215:5dff:fe62:e311%hn0 U 0 1500 hn0 ff02::%hn1/32 fe80::215:5dff:fe62:e312%hn1 U 0 1500 hn1 [2.2-BETA][root@pfSense.custco.local]/var/log:
From a PC/Mac on the LAN we could:
Ping pfSense's LAN or WAN IP addresses.
Could NOT Ping 126.96.36.199 , Google.com or other well known host that accepts ICMP requests.
Could NOT ping the default gateway for our WAN IP on the ISP network.
Could ssh into pfSense
When I connected to pfSense using ssh I could:
Ping any PC on the LAN
Ping or telnet to any external IP I wanted.
I spent a number of hours working through the pfSense 2.1 draft guide and similar topics on the forum. I even updated to the snapshot released this morning. That did not help and we were really in need of getting connectivity back.
So thinking the original messages might have indicated a disk problem I restored
Finally I restored the pfSense VM from our 8am Hyper-V backup and now everything works fine as before.
I still see the "(ada0:ata1:0:1:0)" messages on the console, but when I look in system.log the most recent one is from Nov 28, so the console display may just not have rolled.
Any ideas or assistance on what might have caused this would be appreciated. Just tell me what information I need to provide.
Thank you - Richard
In this case, I'd guess that disk error isn't related.
Guessing you're not getting any ARP replies for your gateway IP? Check Diag>ARP. Sounds like the most likely cause is your Hyper-V or Windows config got broken so your WAN NIC is no longer attached to your Internet connection.
OK. Does the jive with the fact that once I was ssh'd into pfSense I could get to outside web sites?
I'm pretty new to pfSense and did not find wget or curl, so I just tried "telnet xxxxxxx.com 80" for google and a couple of other sites I knew.
Thank you again - Richard
I mis-read that as you couldn't ping the gateway from that host itself. Since that VM can get out, clearly you have connectivity at the host level. It's obviously routing correctly as well. Next most likely cause is the NAT or firewall config was broken. Check Diag>Backup/restore, Config History, see what changed.
Hmm - thought this was a thing of the past, but it happened again today. This time after I tried to do an update to the latest (12/19/2014) snapshot.
The update appeared to go OK. I watched/waited for the package updates for optional packages that has been previously installed (ntopng, darkstat, etc) to complete. Afterwards not internet connection.
When I tried to go reply my LAN interface settings I got a message box with the following message at the top of the page:
Packages are currently being reinstalled in the background.
Do not make changes in the GUI until this is complete.
I finally tried reapplying the interface config (WAN hn0, LAN hn1) using the console. Still no luck.
I restored my VM from this morning's backup and all was well again.
I then went through the snapshot upgrade once again - same exact results.
Thank goodness for backups - but I'm a bit concerned about not being able to upgrade.
Any thoughts or ideas on why this is happening? I made and copied over some of the folders (/etc, /var/log,, /root) before I last restored - if that info might help.
Thank you - Richard
Exactly which packages do you have loaded? Does Internet work until the package reinstall finishes, or?
The packages I had loaded were:
I did not notice if the Internet was working before the packages re-installed. I will try to test this sometime later today or tomorrow
Or should I uninstall the packages before doing the update?
Thanks - Richard
I figured you had some package installed that would have an impact on Internet connectivity, like maybe Squid. I guess pfblocker could fall into that category, though I'd expect anything it seriously broke would have broken filter reloads, which would have been spewing alerts at you.
I did not notice if the Internet was working before the packages re-installed.
Ah, in that case don't worry about it, I thought the way you worded part of it you were stating that things were fine until packages were reinstalled.
No need to do anything beyond the normal upgrade process and let the packages handle themselves.
Try the upgrade again, once it's booted back up, start a packet capture on WAN with count 0 and all else at defaults. Try to ping out to IPs on the Internet, try to load web pages, attempt a variety of things then stop the capture. Download the resulting pcap. The summary text may suffice to see something, can paste that here.
I have not had time to try the upgrade again - but will do it as soon as I have a chance and report back.
But in the meantime, we lost Internet connectivity again. Again - no workstations from the LAN can get out, but if I'm ssh'd into pfSense I can ping 188.8.131.52, do DNS resolution and connect to anyone on the LAN.
Based on some experience from a different install over the weekend I tried "pfctl -s nat" and sure enough no output at all for the nat configuration.
I next completely disabled NAT reflection, but that did not do any good.
I then restored a configuration backup from "2014-12-19 12:41:38" that I had made on Friday just after restoring a Hyper-V image and getting the system back working. But this did not fix things.
So I'm not sure what's going on if restoring a working config does not fix things.
I was able to restore from this morning's image backup of the virtual machine again … for the time being all is working fine again.
Does this suggest anything? If not I'll try the 2.2. upgrade as soon as I can, probably the next day or two.
Thank you - Richard
How is your outbound NAT configured?
I think this is what your asking for - so I think the answer is "automatic".
Let me know if you need all of the NAT or other rules.
Yeah should be fine.
When "pfctl -sn" is empty, what do you get for "grep nat /tmp/rules.debug"? Are there any of the disk errors happening around the same time? I'm wondering if somehow it's failing to read the config, or failing to read the raw ruleset, because of the disk error. That's seemed to be cosmetic-only on my Hyper-V systems, but it's possible that's causing the problem. The NAT ruleset being empty is definitely the source of the issue, it's just not clear how it ends up that way. I'm strongly suspecting something specific to your Hyper-V environment like disk reads failing, as if that were a general issue, we'd have hit it internally in our testing and hundreds of people would be on this board griping.
I am also seeing this under Hyper-V 3.0 and the 2.2 RC. It seems that every so often, apinger marks the WAN interface as down.
Dec 29 09:14:59 apinger: ALARM: WAN_DHCP(68.67.x.x) *** down *** Dec 29 09:15:21 apinger: alarm canceled: WAN_DHCP(68.67.x.x) *** down *** Dec 29 09:20:15 apinger: ALARM: WAN_DHCP(68.67.x.x) *** down *** Dec 29 09:20:31 apinger: alarm canceled: WAN_DHCP(68.67.x.x) *** down *** Dec 29 09:35:07 apinger: ALARM: WAN_DHCP(68.67.x.x) *** down *** Dec 29 09:35:28 apinger: alarm canceled: WAN_DHCP(68.67.x.x) *** down *** Dec 29 14:38:15 apinger: ALARM: WAN_DHCP(68.67.x.x) *** down *** Dec 29 14:38:35 apinger: alarm canceled: WAN_DHCP(68.67.x.x) *** down *** Dec 29 14:39:14 apinger: ALARM: WAN_DHCP(68.67.x.x) *** down *** Dec 29 14:39:30 apinger: alarm canceled: WAN_DHCP(68.67.x.x) *** down *** Dec 29 14:52:38 apinger: ALARM: WAN_DHCP(68.67.x.x) *** down *** Dec 29 14:52:54 apinger: alarm canceled: WAN_DHCP(68.67.x.x) *** down *** Dec 29 15:12:31 apinger: ALARM: WAN_DHCP(68.67.x.x) *** down *** Dec 29 15:12:48 apinger: alarm canceled: WAN_DHCP(68.67.x.x) *** down ***
It also seems to be related to traffic or packet load, as the frequency is greatly diminished overnight when I am not using my network.
FWIW, I ran Smoothwall under this same Hyper-V config until a few days ago when I noticed that pfSense 2.2 went RC and it did not experience these issues, so this is something unique to Hyper-V + pfSense or Hyper-V + FreeBSD.
I'm going to disable gateway monitoring and see if that at least masks the underlying issue. Note, state killing on gateway failure is not enabled (the box is checked) so I don't think that's the cause.
I know Hyper-V is probably a low priority, but I am extremely excited to be able to run it with non-legacy NICs, so I'd really like to get this resolved and will help in any way possible. This is perfect for my 1Gbps connection at home (under Hyper-V I can hit 850Mbps, my Atom D2500 couldn't manage more than 500Mbps) and I'd like to start using it in our Hyper-V environment for my business in addition to our physical installations.
If you guys would like me to open a paid support case, I'd be more than happy. I can also provide you with access to pfSense installed in a Hyper-V VM if that would help.