WAN going UP and DOWN in CE 2.7

Gertjan

Actually, you are close to spot on the situation I have right now.
And afaik, it's bugging me for the last two versions (at least) pfSense+ version

When I reboot pfSense, all goes well, interface come up, packages start, and all is quiet on the horizon.

Then, when I change 'nothing' at an (WAN) interface, I keep - every xx seconds or so, seeing "/rc.linkup: DEVD Ethernet attached event for wan", and this restarts packages. Something during this chain of events, another "/rc.linkup: DEVD Ethernet attached event for wan" is received.

The only way I can break out of this loop is : reboot pfSense.

I'm using a 4100 Plus, with one of the two ix interfaces as a IPv4 (DHCPc) and IPv6 (DHCPc) WAN, connected to my upstream ISP router. I thought it could be my ISP router pulling down the interface, but your test with a switch now makes me doubting.

My other 4 em(0->4) interfaces are used for my WANs.

I'm not using VLAn's.
Packages : Nut, pfBlockerng, Avahi and freeradius3 and a OpenVPN server listening on WAN IPv4 UDP.

I guess. have to reboot to total default to see what happens with a clean system (not any settings changes on my side) if this is a structural issue (bug).

emefff

@Gertjan

I am far from being an expert in this, but my guess is, that some basic function in BSD14 is not working correctly. Perhaps something is wrong with the NIC driver? My installed packages are:
Cron 0.3.8_3
iperf 3.0.3
nmap 1.4.4_7
pfBlockerNG-devel 3.2.0_5
Service_Watchdog 1.8.7_1
snort 4.1.6_8
System_Patches 2.2.4

My setup is nothing special: HP Prodesk SFF with Intel i350-T4. One WAN, one LAN. It worked for years, sometimes it was on for 2 months. Now, just like you, I reboot maybe once a day. Then it runs for 30 minutes or 2-3 hours until the first notifications fill my mailbox (watchdog restarts pf_filter or pf_dnsbl..), sometimes the connection just breaks down completely.

I have no idea what to make of it.

emefff

Hello,
this problem is persisting, does anybody have an idea please?
Mario.

Gertjan

@emefff said in WAN going UP and DOWN in CE 2.7:

that some basic function in BSD14 is not working correctly

That would have triggered a devcon 5 on bug support @freebsd.org
So, no.

It's a local script, a combination of usage event.
I still have to add a switch between pfSense WAN and the ISP router (on its LAN) so I can rule out that my ISP routers pulls down it's LAN after pfSense had being pulling it down as a result of package restart.
The issue overall is bothering me, as pfSense works flawlessly.

Or : re install from scratch with default everywhere. As my WAN is default (dhcp) and LAN is default (192.168.1.1/24), and I see the same behavior (again : WAN up down events in a loop according the system log), I will know it's pfSense and not my own combination of settings / packages used.

@emefff said in WAN going UP and DOWN in CE 2.7:

(watchdog restarts pf_filter or pf_dnsbl..)

As soon as the Service_Watchdog is needed, then you should repair or act upon the issue right away = discovering why the two PHP processes pf_filter and pf_dnsbl don't restart immediately.
I don't recall seeing these two

not running.

Btw : I'm not having pfBlocking doing it's internal maintenance every 2 hours or so.
Ones a day is more then enough for me :

pf_dnsbl is doing all the heavy lifting when DNSBL feeds are renewed.
If you have many DNSBL entries, like hundreds of thousands, then PHP ( !!) is used to sort and clean them. That could take a long time. And is a reason to be reasonable with the number and size of dnsbl feeds you use.

If for example pf_dnsbl takes some time to do it's work, to start up ( run top or htop on the command line during such an event and you'll see it at the top) and then watchdog kicks in because it thinks it's not running, then you just added another layer to the problem.

Also, feeds are not set to be checked for new (updated) download every xx minutes. Ones a week is fine for me. Result : It (pfb) restart maybe ones a week ... and DNS is this rock solid. See the "DNS" munin graphs above.

The watchdog package is a developer package. The perfect tool to make bad things worse.

Btw : iperf, nmap, snort are core network packages. if I was suspecting 'network' issues, I would ditch them first.
Cron and System_patches are just GUI items, and do 'nothing'.

emefff

@Gertjan
Hello,
thanks again for answering.

I don't recall seeing these two not running.

Well, the GUI seems to be somewhat slow here. What I do see is the WAN going down here

Ones a day is more then enough for me :

I also only do the update and maintenance stuff once a day at 3 o'clock plus some odd minute, when absolutely nobody should be doing anything.
I have to admit I probably have a lot of feeds running, but on the other hand I have 16GB of RAM and a lot of storage. I haven't ever thought about it being a problem. Maybe I should clean up the feeds a little.

Also, feeds are not set to be checked for new (updated) download every xx minutes. Ones a week is fine for me.

This I do once a day. That's done at the odd time above.

Btw : iperf, nmap, snort are core network packages.

Snort I cannot ditch (paid subscription) and the other to are running maybe once a month or once in three months. If iperf or nmap is a problem then BSD has problems too.

So, as you see, I really don't do anything very special.

Thanks,

Mario.

Gertjan

@emefff said in WAN going UP and DOWN in CE 2.7:

What I do see is the WAN going down here....

If you have some time left ....
Instead of looking at the extremely slow (to update) GUI :

Open the console, or SSH,, option 8 and :

tail -f /var/log/system.log

and now you can see real time what the system does, and what it does when you do something in the GUI, or the system by itself.

About pFB : with many DNSBL feeds, see it like this :
10 K : a fraction of a second
1 M : several seconds
10 M or more tens of seconds, even minutes

before all the lists are merged, removed the doubles, imploding networks.
When you have a huge amount of memory, a fast quad I core whatever, it doesn't matter.
The PHP engine (with a limit, fixed ! space of allocated memory - not even one Mbytes) executes the "pf_dnsbl" file, doing it's thing. The number of items to handle will put an exponential ( ! ) load on the task, thus the delay of the (re) start.
Especially this task is a good candidate to be re written in C.

Btw : but all this doesn't explain the WAN flapping, the subject of the thread.

emefff

@Gertjan
Now it's getting weirder. I made the following changes about 8 hours ago:

1.) Removed iperf and nmap packages.
2.) Removed every feed in pfblocker that had errors in pfblockerng.log (curl error etc.) and forced update.

I haven't had a WAN interruption since. Thus, I also had no reloads of pfb_dnsbl and pfb_filter.

Very strange, but for now it seems to work. IDK if this is a 'solution' or just a coincidence, though. I will observe and report back if it's getting worse again,

Mario.

Gertjan

@emefff said in WAN going UP and DOWN in CE 2.7:

Very strange, but for now it seems to work

I'm not surprised at all.
Because I've seen the same behavior.

I've been using @work a big ex server PC for my pfSense needs. Loads of G RAM, Xeon processor etc.
Still, the PHP processes doing all the GUI lifting, and all the PHP scripts like the ones used by pfBlockerng, are rather limited in their max allowed memory (RAM) usage :

I keep this list :

to a bare minimum.
As soon as there are not ten or hunderds of thousands, but millions of "DNSBL lines" (host name) every "Firewall > pfBlockerNG > Update", do a Force All Run takes far to much time.
For me, 10, 20 seconds is a max. This is also the time that DNS is unavailable to the system and all connected networks.
My 'cron' = auto pfBlockerNG update task is set to one a day, and files are checked for possible updates ones a week.

I'm using a Netgate 4100 max version, it has 4 Gbytes of memory and a 128 Gbytes disk. Still, it's to easy to bring pfBlockerng to a crawl when using many or big dnsbl feeds.
PHP just isn't the right tool to do that much file parsing.

Btw : I'm using pfBlockerng "Unbound Python Mode" as this isn't actually an option anymore.
"unbound mode" uses PHP to do the DNS parsing ....

emefff

@Gertjan
Hello,
I just wanted to report back my findings. 20-30 hours after my last post above, the fun started again. Hotplug events were not that frequent, but increased with running time of the appliance.
DNSBL was yellow in above graphics and I drew a connection to that. If it was out of sync, everything got a lot worse, complete LOC included.

Today, a few hours ago, I also switched to 'unbound python mode' instead of the other mode. Since then, I had not a single saving event of pfblocker (these occured very often in the recent past after switching to CE2.7) and also no hotplugging event of the WAN.

I was too optimistic last time, but with unbound python it looks much better.

Mario.

emefff

Hello again,
two days ago, I did a fresh install of pfSense CE 2.7, to speed it up I did a restore from a backup (Backup & Restore option with .xml).

Sadly, the hotplugging events on the i350 card cam back. So it seems, a fresh install does not help.

The only thing I can do to make my life easier (the frequent LOC make me want to pull my hair out) is to reboot in the morning.

Today, for the first time, I chose to 'Reroot' because rebooting also often hangs and I do not want to stress that flimsy button on my Prodesk 400 (the button is basically complete junk).
However, pfSense showed me a very informative error report (well, surely informative for an expert, but maybe not for me) that I attached.

I hope someone with more knowledge can please tell us what it is about. I assume it has something to do with the hotplugging events of my NIC,

thanks in advance,

Mario.

info.0 textdump.tar.0

Gertjan

@emefff said in WAN going UP and DOWN in CE 2.7:

I did a fresh install of pfSense CE 2.7, to speed it up I did a restore from a backup (Backup & Restore option with .xml).

You've managed to make your pfSense identical to the version you had before the re install.
At at bit level.
IT people would say : you've done a NOP or No Operation.
Dong nothing would have the same result.

@emefff said in WAN going UP and DOWN in CE 2.7:

Today, for the first time, I chose to 'Reroot' because rebooting also often hangs ....

Wait.
Did you saw it booting ?
You have a screen : you can see the boot process.
When the system wakes up, it knows nothings about disks NIC's or what OS it will be using.
Then the BIOS locates a bootable drive, reads the boot ecort and loads whatever is mentioned over there.
This will load the FreeBSD kernel.
The FreeBSD kernel doesn't know it will be a 'pfSense' system.
It will enumerates all the hardware it finds in the system.
If booting stops in that process, the issue is not 'pfSense' but the kernel having a hard time with a device, like a NIC driver.

The report tells me :
You've demanded a system reboot.
Then the kernel hits a VM (virtual memory) fault.
That's not a pfSense thing, for me, that's the thin boundary between the kernel and your hardware.

Try another system ^^
Btw : to motivate you : I'm using pfSense for more over a decade, never had to hardware reset it. Not sure my device has a reset button.
Btw : resetting a device like pfSense is (can be) bad for the file system. Press the reset button on your windows PC several times, and it won't boot anymore neither.

Also : install snort only if you are sure your device is rock solid for the last past xx month.
This goes for any very resource demanding software.

Again : these words are mine - I'm just another pfSense user.

emefff

@Gertjan
It wasn't clear before the fresh install, that it is the same. If it was so perfect, how come it's problematic since 2.7? And no, it is not the same bit for bit. There are no logs and other files that change during operation etc. Also, files are in different places of the SSD.

No I did not see it booting.

It's not my hardware what is at fault, it's CE 2.7. My hardware worked for years with 2.5 and 2.6 and I did not change any major stuff in 2.7.

I won't try another system, but if I do, it won't be pfSense.

Mario.

Gertjan

@emefff said in WAN going UP and DOWN in CE 2.7:

how come it's problematic since 2.7?

Can't tell. Most probably the (a new) NIC driver included in the kernel.
Previous pfSense uses FreeBSD 12.x, pfSense uses FreeBSD 14.
Like Windows 10 before and now Windows 11 : there are differences ^^

@emefff said in WAN going UP and DOWN in CE 2.7:

And no, it is not the same bit for bit. There are no logs and other files that change during operation etc. Also, files are in different places of the SSD.

I meant : running the same 'code'.

Stef93

@emefff
This seems to be something I have been struggling with for 2 weeks and finally seems to have made it. The first thing to do is System/RoutingGateways/Edit > Show Advanced Options > Packet Loss Thresholds to change the default value from 10-20 to 10-75.
The second thing I did was remove the interface and rebind it (I had to fix it wherever nat, openvpn, etc. were indicated, a lot)
the third thing I did was turn off the machine, completely de-energize it and make sure until the lights on the network card stopped blinking (wol) I waited another minute and only after that turned it on.
As a result, the problem went away... https://docs.netgate.com/pfsense/en/latest/hardware/tune.html There are a lot of tips here, BUT, I suggest doing this only if there are problems! Therefore, I advise you to remove all the tuning that you could do.

Stef93

@Stef93 said in WAN going UP and DOWN in CE 2.7:

@emefff
This seems to be something I have been struggling with for 2 weeks and finally seems to have made it. The first thing to do is System/RoutingGateways/Edit > Show Advanced Options > Packet Loss Thresholds to change the default value from 10-20 to 10-75.
The second thing I did was remove the interface and rebind it (I had to fix it wherever nat, openvpn, etc. were indicated, a lot)
the third thing I did was turn off the machine, completely de-energize it and make sure until the lights on the network card stopped blinking (wol) I waited another minute and only after that turned it on.
As a result, the problem went away... https://docs.netgate.com/pfsense/en/latest/hardware/tune.html There are a lot of tips here, BUT, I suggest doing this only if there are problems! Therefore, I advise you to remove all the tuning that you could do.

it takes a long time to explain these points, but these points were completed at intervals of 2-3 days, for example, 10-20 changed to 10-75 in order to see the losses, according to the standard> 20% losses, he considers the gateway lost

emefff

Hello,

I tried the threshold change from 10-20 to 10-75 and increased the MBUFS, nothing changed.

What did seem to get rid of the hotplugging events (at least for now 24h without any hotplugging event, which was NEVER the case since 2.7) is changing from snort to suricata. The funny thing is: sometime in CE 2.6 I changed from suricata to snort because of trouble with suricata.

I am confident this is solved, but will report back if I was wrong,

thanks everybody,

Mario.

emefff

Hi,

I have been running this config without unplanned hotplugging events of NIC for more than a week now. It was definitely the Snort package that caused these events,

Mario.

JPCNS