2.3.4->2.4.4 Upgrade 100% Packet Loss on WAN Interface

rjarratt

I use pfSense for my home firewall (I am not an IT Pro). I have been using 2.3.4 successfully for quite some time but felt I ought to upgrade to the latest. I run it on Hyper-V. The upgrade seemed to go smoothly, I didn't see any issues reported and the admin portal comes up just fine etc.

However, the WAN_DHCP gateway is now showing offline, and the logs show 100% packet loss on the WAN interface (which I think is why the gateway is offline). netstat -r shows what seems to be a valid route to the default gateway. I have checked the firewall rules, there is nothing in there that will block anything, the firewall logs don't show traffic being blocked either.

I am not sure what the problem could be. Thankfully, as I run it on Hyper-V I can easily revert to my previous 2.3.4 installation, so I can post to this forum! Any ideas why the upgrade should cause all packets to be lost?

stephenw10

Two things to check for.

Make sure you have a default gateway set.
There is a new feature in 2.4.4 where you can set a gateway group as the default and it will use Automatic to select one itself if there is not one set. Some edge cases are seeing issue with that (fixed in 2.4.5 snapshots).
In System > Routing > Gateway select the WAN-DHCP as the default v4 gateway if it is not already.

A change to the DHCP client in 2.4.4 means it now correctly respects an MTU setting given to it by an upstream server. If you have any custom options on the WAN it will take that value now where it previously ignored it.
Some DHCP servers seem to be handing out crazy values that were previously ignored.
Check the WAN MTU in Status > Interfaces.

Steve

rjarratt

Thank you, I will try those when I get the next opportunity in a few days from now. I will report back.

rjarratt

I made the two suggested checks.

First on the gateway, WAN_DHCP seems to be selected as the default as per the screenshot below: 0_1541233274810_pfSense Gateways.JPG

I checked the Gateway Groups and Static Routes pages, both are empty.

Is all that as it should be?

On the MTU it is showing 1500, so I think that is OK.

stephenw10

Did you try changing the gateway monitoring to a different IP?
With only one gateway though it will always be used even if it shows as off-line. I assume you cannot actually connect out at all?

It is pulling a DHCP lease though so there is a link of some sort.

Check the routing table in Diag > Routes just to be sure there is a default route. Although the gateway is marked default you still have the selection set to automatic.

Steve

rjarratt

I have tried another monitoring address but it doesn't make a difference (I used 8.8.8.8). I did try changing the default gateway not to be automatic, explicitly choosing WAN_DHCP, but again to no avail. I have checked Diagnostics->Routes and there is a default route. You are correct though, I cannot connect out at all, nothing I do results in any packets on the WAN interface, but it does manage to lease a DHCP address.

stephenw10

Is it giving you a rational address via DHCP? Gateway IP in the same subnet?

Check the DHCP logs match what you are seeing on the interface.

Check the system logs for errors.

I assume it's pulling a DHCP lease directly from your ISP?

Steve

rjarratt

I did check for all these things before and they all seemed fine as I recall. I won't get another chance to check again until next weekend, I will report back then.

Thanks

Rob

rjarratt

I checked the DHCP logs and the gateway IP is in the same subnet as the leased IP address.

I do see some IPv6 errors, but I am assuming they are benign as my ISP is IPv4.

I looked in the system logs. I am getting some errors relating to the time of day (there is an oddity with FreeBSD not seeming to get the time from Hyper-V correctly). The errors are like this one:

rc.bootup: The command '/usr/bin/nice -n20 /usr/local/bin/rrdtool update /var/db/rrd/ipsec-packets.rrd N:U:U:U:U:U:U:U:U' returned exit code '1', the output was 'ERROR: /var/db/rrd/ipsec-packets.rrd: illegal attempt to update using time 1541839764 when last update time is 1571980920 (minimum one second step)'

In the Gateways part of the system log I see lots of errors from dpinger, I assume they are symptoms rather than cause though. Here they are:

send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr a.b.c.1 bind_addr a.b.c.86 identifier "WAN_DHCP "

WAN_DHCP a.b.c.1: Alarm latency 0us stddev 0us loss 100%

stephenw10

When DHCP works and nothing else does that's usually a bad firewall rule. Thouhg here that would be somewhere upstream unless you have a blocking floating OUT rule.

Try running a packet capture on WAN. Do you see the ping packets leaving? Do you see any packets coming back?

Steve

rjarratt

I tried a packet capture before, and I tried it again today. I don't see any packets originating from computers on the LAN going out on the WAN. I do see some DNS lookups that appear to be coming from pfSense itself. I have looked at the firewall rules and there doesn't really seem to be anything that would block traffic. Looking at the firewall logs I see traffic being blocked, but it all appears to be IPv6.

stephenw10

Are those DNS lookups actually working? Does Diag > DNS Lookup work?

Can you packet capture the DHCP exchange?

What is the MAC of the gateway? Maybe it's something odd. Though that should affect 2.3.4 just the same.

Steve

johnpoz

@rjarratt said in 2.3.4->2.4.4 Upgrade 100% Packet Loss on WAN Interface:

gateway IP is in the same subnet as the leased IP address.

And this is public IP or private IP?? This is a VM right... Are we sure interfaces are not moving about and changing order on update? If your pfsense can not talk to your gateway your going to have a problem.

Your gateway and public IP should be the same when on 2.3.4 as it is when you upgrade... I do not see your mac changing on your vm... So if you can ping your gateway when your on 2.3.4 and not when on 2.4.4 something really odd is going on..

rjarratt

DNS lookup from the configurator does not work. I have just noticed a small difference between the old version and the new version in ifconfig

Here is the output on the old version (edited because my post is being marked as spam):
hn1:
ether 00:c0:df:10:58:09
status: active

And here is the output on the new version:
hn1: ether 00:c0:df:10:58:09
hwaddr 00:15:5d:00:1f:06
status: active

There is a new line for "hwaddr". I am not sure what the difference is, but the hwaddr value is the default MAC address I have set in Hyper-V, but I have also set the option to allow the Guest OS to change the MAC address. The MAC address reported by the Configurator is always the "ether" one.

Note that I can ping the gateway (192.168.0.1) in 2.4.4 from the LAN. The IP address I get in the ifconfig outputs above is the same in both cases.

I also noticed this error, is it relevant?
arpresolve: can't allocate llinfo for 82.28.4.1 on hn1

stephenw10

The arpresolve errors on WAN imply it is trying to ARP for that IP and failing. I assume 82.28.4.1 is the WAN gateway IP?

That does look like what you see if you hit the DHCP issue I mentioned in my first reply.
https://redmine.pfsense.org/issues/8507

I expect to see a bad MTU if you are hitting that but it's worth adding the workaround line to your DHCP options anyway:

In the "Lease Requirements and Requests" section for WAN DHCP in the field "Option modifiers" add the text without quotes: "supersede interface-mtu 0"

Or trying a 2.4.5 snapshot which has that fixed already.

Steve

rjarratt

I had a closer look at the issue and tried it out with no luck. Before I started I did this:

: netstat -4rnW
Routing tables

Internet:
Destination        Gateway            Flags       Use    Mtu      Netif Expire
default            82.28.4.1          UGS         234   1500        hn1
82.28.4.0/22       link#6             U           186   1500        hn1
82.28.4.86         link#6             UHS           0  16384        lo0
127.0.0.1          link#2             UH           31  16384        lo0
192.168.0.0/24     link#5             U           124   1500        hn0
192.168.0.1        link#5             UHS           0  16384        lo0

I then set the advanced option anyway, released and renewed the lease and I even rebooted after changing the option, without success. The netstat looked the same as above after setting the option and rebooting etc.

Babiz

I look same issue on my APU2 pfSense upgraded to lastest release and after restoring configuration/reboot , dpinger fail trough PPPoE and monitor IP 8.8.4.4.

So I guess some kind of messing up related pfSense internals stuff.

My solution is simply: Delete all gateways / reboot /re assign interfaces from cli/ssh /reboot.

Yeah pfSense concept is beauty but it's not fully perfect we know.

stephenw10

If you were hitting that DHCP issue you would not see a change in the routing table only the interface MTU.

The ARP resolve errors imply the gateway is not responding to ARP. Try running a packet capture on WAN to be sure it is actually sending the ARP requests and that the gateway is really not responding.

Steve