May 2nd Snapshot doesnt work, breaks everything! Beware

jimp

I don't need 4M worth of records. I don't have time to sort through all of that. Just the last dozen or so lines of each log file is sufficient.

I think we have a lead on part of the problem, I pushed a fix for one potential path that could break it but there is one other that I haven't tracked down yet.

https://redmine.pfsense.org/issues/8504

More interesting to me now than logs are two things:

1. The <gateways>section of your configuration(s) before and after upgrade, or at least after. You can redact IP addresses but do not alter anything else.
2. Whether or not you have a default route for IPv4 or IPv6 in "netstat -rnW" after upgrade.</gateways>

jimp

OK, there are at least three separate issues here from the looks of it:

0. Harmless route errors spamming the console/logs https://redmine.pfsense.org/issues/8497 (Fixed now)
1. An issue with the upgrade code not converting and handling default gateways properly in some cases https://redmine.pfsense.org/issues/8504 (Also fixed)
2. An issue where certain DHCP WANs (igb interfaces at least) constantly link cycle which leads to all sorts of other symptoms (services not running, IP addresses/routes missing, GUI inaccessible, etc) https://redmine.pfsense.org/issues/8506

We're still working on that last one.

Now what I need to know is:

What hardware are you running where this is happening?
What type of network interface is it happening to? (Both systems here, and the logs posted in the thread are all igb, but we don't know if that's a coincidence or not)
Check "clog /var/log/system.log | grep link" and/or "dmesg | grep link" output to see if the link is flapping

tmushy

Updated to the latestest beta and still getting issues
Im using a Qotom box

May 11 17:55:36 pfSense php-fpm[22628]: /rc.linkup: DEVD Ethernet attached event for wan
May 11 17:55:36 pfSense php-fpm[22628]: /rc.linkup: HOTPLUG: Configuring interface wan
May 11 17:55:37 pfSense kernel: igb0: link state changed to UP
May 11 17:55:37 pfSense kernel: igb0: link state changed to DOWN
May 11 17:55:42 pfSense kernel: igb0: link state changed to UP
May 11 17:55:43 pfSense php-fpm[22628]: /rc.linkup: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1526086543] unbound[66133:0] error: bind: address already in use [1526086543] unbound[66133:0] fatal error: could not open ports'
May 11 17:55:43 pfSense kernel: igb0: link state changed to DOWN
May 11 17:55:45 pfSense php-fpm[71870]: /rc.linkup: DEVD Ethernet detached event for wan

Its just looping the same thing over and over

LostInIgnorance

JimP, let us know when we can begin testing snapshots again as I can't keep rebuilding and restoring my firewall.

jimp

@LostInIgnorance:

JimP, let us know when we can begin testing snapshots again as I can't keep rebuilding and restoring my firewall.

Which is why you don't run snapshots on important production firewalls, at least not without proper lab testing first.

No progress since my last post except that an additional issue has been found:

3. Interface MTU being set incorrectly in some cases https://redmine.pfsense.org/issues/8507 – This can lead to what appears to be partially working connectivity. Some sites will load, others will fail, some may be partially work and partially broken due to resources that can't be fetched. Browsers may return a blank page rather than an error or fail to fetch links at all.

LostInIgnorance

JimP, this is not an important firewall. It is only used for my home environment, but I get to listen to my wife complain about not being able to get online. More of an annoyance to reload than it is anything else. Let me know if there is more logs or testing you need on this.

jimp

@LostInIgnorance:

I get to listen to my wife complain about not being able to get online.

If it's carrying your wife's traffic then that is THE very definition of an important production firewall :-)

@LostInIgnorance:

Let me know if there is more logs or testing you need on this.

I think we have an OK grasp of the general issues at the moment but a lack of leads on where the problem lies. So far all I've seen are symptoms and not the root cause yet, but since it's so tricky to reproduce in a lab setup it's a pain to try to dig into it for any length of time.

LostInIgnorance

JimP, I think you're on to something with the mtu size. I can tell you that the interface (igb2) that is connecting, shows a default gateway and an IP, then it disappears from the "netstat -rnW" command screen.
I am also available after 6p CST if you would like remote access. As this appliance is a mirror of the C2758 Atom you used to sell, I am hoping there are not too many people that will experience this issue.

slog.jpg_thumb

jimp

The next round of snapshots should be better here. It was related to the MTU. Turns out in 11.2, FreeBSD improved dhclient so it could handle the MTU, but it took the upstream MTU unconditionally and had no way to ignore the value. In each case I've seen so far, the ISP has sent a bogus MTU back which caused two things:

On e1000 and some other drivers, setting the MTU causes the link to go down and back up, which triggers the interface event scripts, which restarted dhclient, which set the MTU again, which made the link go down and back up, repeat, repeat, repeat, boom.
On other drivers, the MTU would be set to this value but it may not have been right. In my case and for others, this was a stupid low value like 576 which meant some sites would work and others would fail or be half broken.

We have a patch in the tree now from a FreeBSD dev which will be in the next set of snapshots that lets us ignore the incoming MTU with a supersede in the dhclient config (which I also added in the tree), and hopefully all this should hopefully return sanity to cases affected by these issues.

Dazog

@jimp:

The next round of snapshots should be better here. It was related to the MTU. Turns out in 11.2, FreeBSD improved dhclient so it could handle the MTU, but it took the upstream MTU unconditionally and had no way to ignore the value. In each case I've seen so far, the ISP has sent a bogus MTU back which caused two things:

On e1000 and some other drivers, setting the MTU causes the link to go down and back up, which triggers the interface event scripts, which restarted dhclient, which set the MTU again, which made the link go down and back up, repeat, repeat, repeat, boom.

On other drivers, the MTU would be set to this value but it may not have been right. In my case and for others, this was a stupid low value like 576 which meant some sites would work and others would fail or be half broken.

We have a patch in the tree now from a FreeBSD dev which will be in the next set of snapshots that lets us ignore the incoming MTU with a supersede in the dhclient config (which I also added in the tree), and hopefully all this should hopefully return sanity to cases affected by these issues.

Latest Build fixes issues with my DHCP WAN connection.

Bug is squashed.

Thank you for the hard work.

jimp

@Dazog:

Latest Build fixes issues with my DHCP WAN connection.

Bug is squashed.

Thank you for the hard work.

Did you have the link cycling issue, the MTU issue, or both?

pfSenseTest

@jimp:

Did you have the link cycling issue, the MTU issue, or both?

I had the link cycling issue on the Netgate MBT-4220 system and the latest snapshot fixed it.

Dazog

@jimp:

@Dazog:

Latest Build fixes issues with my DHCP WAN connection.

Bug is squashed.

Thank you for the hard work.

Did you have the link cycling issue, the MTU issue, or both?

Cycling Issue.

w0w

I am not sure if it's related but after upgrading from 20 Apr snapshot to 19 May I lost connectivity to the internet. It is showing that PPPoE WAN is up and running, but I can not ping any IP on the internet from pfSense or LAN. I don't see anything unusual in the logs except those messages that pkg can not reach servers, rolling back ZFS snapshot restores connection immediately.

P.S. Looks like in some stage it got connected because it shows my dynamic DNS as updated and once it reinstalled packages, but can not get package list anymore, ping 8.8.8.8 100% lost, traceroute does not even start to trace.

What can I do else to analyze it?

jimp

@w0w:

I am not sure if it's related but after upgrading from 20 Apr snapshot to 19 May I lost connectivity to the internet. It is showing that PPPoE WAN is up and running, but I can not ping any IP on the internet from pfSense or LAN. I don't see anything unusual in the logs except those messages that pkg can not reach servers, rolling back ZFS snapshot restores connection immediately.

P.S. Looks like in some stage it got connected because it shows my dynamic DNS as updated and once it reinstalled packages, but can not get package list anymore, ping 8.8.8.8 100% lost, traceroute does not even start to trace.

What can I do else to analyze it?

That doesn't quite sound like it's related to anything in this thread. Start a new thread and post details of your setup there, at least show the routing table and what the gateways list/page looks like, maybe the config.xml entries for the gateways you have configured. You can redact any IP addresses in that info.

w0w

OK jimp.

jimp

Looks like we can consider all of these issues resolved as far as I can see. Every system I'm aware of that has tried the updated code is working properly now.

tmushy

I can confirm the latest snapshot has indeed fixed all my issues!
Thank you for resolving this. Working great now

LostInIgnorance

JimP, I am sorry I didn't get to it earlier, but I was out of town. I just upgraded this morning and everything is working correctly as it should. Thanks for getting this fixed.